Experimental Data Ingestion

This document describes how to provide custom experimental data to the model, allowing users to substitute their own measurements for the reference data shipped with the repository.

Overview

The vEcoli model ships with curated reference data in reconstruction/ecoli/flat/. This data was compiled from public databases and literature and represents a “default” E. coli K-12 MG1655 grown in M9 minimal medium with glucose.

In many cases, users want to parameterize the model with their own experimental measurements—for example, RNA-seq data from a different strain, growth condition, or laboratory. The experimental data ingestion system provides a structured way to do this without modifying the core reference files.

Philosophy

The ingestion system follows these principles:

Like-for-like substitution: Custom data must match the format and semantics of the reference data it replaces. For RNA-seq, this means gene-level TPM values that can be mapped to the model’s gene set.
Schema validation: All ingested data is validated against Pandera schemas (see wholecell.io.schemas) to catch formatting errors early.
Manifest-based organization: Datasets are registered in manifest files that provide metadata (source, strain, condition) alongside file paths. This keeps the data self-documenting.
Config-driven selection: Users specify which dataset to use via configuration options, making it easy to switch between datasets without code changes.

Currently Supported Data Types

Data Type	Description	Status
RNA-seq (transcriptome)	Gene-level TPM expression values	✓ Supported
Proteomics	Protein abundance measurements	Planned, near term
Metabolomics	Metabolite concentrations	Under consideration
Metabolic fluxes	Flux values	Under consideration
Growth physiology	Growth rates, cell size, etc.	Under consideration

RNA-seq

RNA-seq data provides gene expression levels used by the ParCa (parameter calculator) to set basal transcription rates. By default, ParCa uses expression data from the reference files. With the ingestion system, you can substitute your own RNA-seq measurements.

File Organization

RNA-seq data is organized as:

reconstruction/
└── ecoli/
    └── experimental_data/
        └── rnaseq/
            ├── manifest.tsv        # Lists all available datasets
            ├── ref_0001.tsv       # TPM table for dataset ref_0001
            ├── ref_0002.tsv       # TPM table for dataset ref_0002
            ├── gbw_0001.tsv       # TPM table for dataset gbw_0001
            └── ...

The manifest.tsv file is the entry point—it lists all available datasets and their metadata. Each dataset has a corresponding TPM table file.

Manifest Schema

The manifest is a tab-separated file validated against RnaseqSamplesManifestSchema.

Required columns:

Column	Type	Description
`dataset_id`	string	Unique identifier for this dataset (e.g., `gbw_0001`). Referenced in config.
`dataset_description`	string	Human-readable description of the dataset.
`file_path`	string	Path to the TPM table file (relative to manifest or absolute).
`data_source`	string	Origin of the data (e.g., `Ginkgo Bioworks`, `PNNL`).

Optional columns:

Column	Type	Description
`data_source_experiment_id`	string	Experiment identifier from the data source.
`data_source_date`	string	Date of the experiment (e.g., `2026-01-15`).
`strain`	string	Strain descriptor (e.g., `MG1655 rph+`).
`condition`	string	Cultivation condition (e.g., `M9, Glucose, Aerobic, 37C`).

Example manifest:

dataset_id  dataset_description     file_path       data_source     strain  condition
ref_0001    Reference M9 Glucose minus AAs  ref_0001.tsv    reference       MG1655  M9, Glucose, Aerobic
gbw_0001    MG1655 rph+ in Modified M9      gbw_0001.tsv    Ginkgo Bioworks MG1655 rph+     Modified_M9_N_Fe

TPM Table Schema

Each TPM table is a tab-separated file validated against RnaseqTpmTableSchema.

Required columns:

Column	Type	Description
`gene_id`	string	Gene identifier matching the model’s gene set (EcoCyc IDs, e.g., `EG10001`).
`tpm_mean`	float	Mean TPM (transcripts per million) for this gene. Must be ≥ 0.

Optional columns:

Column	Type	Description
`tpm_std`	float	Standard deviation of TPM across replicates. Must be ≥ 0.

Example TPM table:

gene_id     tpm_mean        tpm_std
EG10001     1234.56 45.2
EG10002     567.89  23.1
EG10003     0.0     0.0
...

Note

Gene IDs must match the EcoCyc identifiers used by the model. Genes not found in the model’s gene set will be ignored with a warning.

Configuration

To use custom RNA-seq data, add the following options under parca_options in your configuration JSON:

Option	Type	Description
`rnaseq_manifest_path`	string	Path to the manifest TSV file.
`rnaseq_basal_dataset_id`	string	The `dataset_id` to use as the basal transcriptome.
`basal_expression_condition`	string	Modeled condition name (default: `"M9 Glucose minus AAs"`).

Example configuration:

{
    "parca_options": {
        "cpus": 4,
        "outdir": "out/custom_rnaseq",
        "rnaseq_manifest_path": "reconstruction/ecoli/experimental_data/rnaseq/manifest.tsv",
        "rnaseq_basal_dataset_id": "gbw_0001",
        "basal_expression_condition": "M9 Glucose minus AAs"
    }
}

Default behavior (backward compatible):

If rnaseq_manifest_path is null or omitted, ParCa uses the legacy reference data from reconstruction/ecoli/flat/rna_seq_data/.

Validation Errors

The ingestion system validates data early and provides clear error messages:

Error	Cause
`ValueError: rnaseq_manifest_path is set but rnaseq_basal_dataset_id is None`	You specified a manifest but forgot to specify which dataset to use.
`FileNotFoundError: ...`	The manifest file or a TPM table file doesn’t exist.
`KeyError: Dataset_id 'xyz' not found in manifest`	The `dataset_id` you specified isn’t in the manifest.
`SchemaError: ...`	A file doesn’t match the expected schema (missing columns, wrong types, etc.).

Python API

For programmatic access, use the functions in wholecell.io.ingestion:

from wholecell.io.ingestion import (
    ingest_rnaseq_manifest,
    ingest_rnaseq_tpm_table,
    ingest_transcriptome,
)

# Load and validate a manifest
manifest = ingest_rnaseq_manifest("reconstruction/ecoli/experimental_data/rnaseq/manifest.tsv")

# Load a single TPM table
tpm_df = ingest_rnaseq_tpm_table("reconstruction/ecoli/experimental_data/rnaseq/gbw_0001.tsv")

# Convenience: load a dataset by ID (validates manifest + TPM table)
tpm_df, metadata = ingest_transcriptome(
    "reconstruction/ecoli/experimental_data/rnaseq/manifest.tsv",
    dataset_id="gbw_0001"
)

Adding Your Own Data

To add your own RNA-seq data:

Prepare your TPM table as a tab-separated file with gene_id and tpm_mean columns. Ensure gene IDs are EcoCyc identifiers.
Place the file in reconstruction/ecoli/experimental_data/rnaseq/ (or another location).
Add an entry to the manifest with a unique dataset_id, description, and the path to your file.
Update your config to point to the manifest and specify your dataset_id.
Run ParCa to generate new simulation parameters using your data.

Example workflow:

# 1. Add your TPM file
cp my_experiment_tpm.tsv reconstruction/ecoli/experimental_data/rnaseq/my_exp_001.tsv

# 2. Edit manifest.tsv to add a row for my_exp_001

# 3. Create a config file
cat > configs/my_experiment.json << 'EOF'
{
    "parca_options": {
        "cpus": 4,
        "outdir": "out/my_experiment",
        "rnaseq_manifest_path": "reconstruction/ecoli/experimental_data/rnaseq/manifest.tsv",
        "rnaseq_basal_dataset_id": "my_exp_001"
    }
}
EOF

# 4. Run ParCa
python runscripts/parca.py --config configs/my_experiment.json

References

Schemas: wholecell.io.schemas.rnaseq
Ingestion functions: wholecell.io.ingestion
ParCa configuration: Workflows