Experimental Data Ingestion

This document describes how to provide custom experimental data to the model, allowing users to substitute their own measurements for the reference data shipped with the repository.

Overview

The vEcoli model ships with curated reference data in reconstruction/ecoli/flat/. This data was compiled from public databases and literature and represents a “default” E. coli K-12 MG1655 grown in M9 minimal medium with glucose.

In many cases, users want to parameterize the model with their own experimental measurements—for example, RNA-seq data from a different strain, growth condition, or laboratory. The experimental data ingestion system provides a structured way to do this without modifying the core reference files.

Philosophy

The ingestion system follows these principles:

  1. Like-for-like substitution: Custom data must match the format and semantics of the reference data it replaces. For RNA-seq, this means gene-level TPM values that can be mapped to the model’s gene set.

  2. Schema validation: All ingested data is validated against Pandera schemas (see wholecell.io.schemas) to catch formatting errors early.

  3. Manifest-based organization: Datasets are registered in manifest files that provide metadata (source, strain, condition) alongside file paths. This keeps the data self-documenting.

  4. Config-driven selection: Users specify which dataset to use via configuration options, making it easy to switch between datasets without code changes.

Currently Supported Data Types

Data Type

Description

Status

RNA-seq (transcriptome)

Gene-level TPM expression values

✓ Supported

Proteomics

Protein abundance measurements

Planned, near term

Metabolomics

Metabolite concentrations

Under consideration

Metabolic fluxes

Flux values

Under consideration

Growth physiology

Growth rates, cell size, etc.

Under consideration

RNA-seq

RNA-seq data provides gene expression levels used by the ParCa (parameter calculator) to set basal transcription rates. By default, ParCa uses expression data from the reference files. With the ingestion system, you can substitute your own RNA-seq measurements.

File Organization

RNA-seq data is organized as:

reconstruction/
└── ecoli/
    └── experimental_data/
        └── rnaseq/
            ├── manifest.tsv        # Lists all available datasets
            ├── ref_0001.tsv       # TPM table for dataset ref_0001
            ├── ref_0002.tsv       # TPM table for dataset ref_0002
            ├── gbw_0001.tsv       # TPM table for dataset gbw_0001
            └── ...

The manifest.tsv file is the entry point—it lists all available datasets and their metadata. Each dataset has a corresponding TPM table file.

Manifest Schema

The manifest is a tab-separated file validated against RnaseqSamplesManifestSchema.

Required columns:

Column

Type

Description

dataset_id

string

Unique identifier for this dataset (e.g., gbw_0001). Referenced in config.

dataset_description

string

Human-readable description of the dataset.

file_path

string

Path to the TPM table file (relative to manifest or absolute).

data_source

string

Origin of the data (e.g., Ginkgo Bioworks, PNNL).

Optional columns:

Column

Type

Description

data_source_experiment_id

string

Experiment identifier from the data source.

data_source_date

string

Date of the experiment (e.g., 2026-01-15).

strain

string

Strain descriptor (e.g., MG1655 rph+).

condition

string

Cultivation condition (e.g., M9, Glucose, Aerobic, 37C).

Example manifest:

dataset_id  dataset_description     file_path       data_source     strain  condition
ref_0001    Reference M9 Glucose minus AAs  ref_0001.tsv    reference       MG1655  M9, Glucose, Aerobic
gbw_0001    MG1655 rph+ in Modified M9      gbw_0001.tsv    Ginkgo Bioworks MG1655 rph+     Modified_M9_N_Fe

TPM Table Schema

Each TPM table is a tab-separated file validated against RnaseqTpmTableSchema.

Required columns:

Column

Type

Description

gene_id

string

Gene identifier matching the model’s gene set (EcoCyc IDs, e.g., EG10001).

tpm_mean

float

Mean TPM (transcripts per million) for this gene. Must be ≥ 0.

Optional columns:

Column

Type

Description

tpm_std

float

Standard deviation of TPM across replicates. Must be ≥ 0.

Example TPM table:

gene_id     tpm_mean        tpm_std
EG10001     1234.56 45.2
EG10002     567.89  23.1
EG10003     0.0     0.0
...

Note

Gene IDs must match the EcoCyc identifiers used by the model. Genes not found in the model’s gene set will be ignored with a warning.

Configuration

To use custom RNA-seq data, add the following options under parca_options in your configuration JSON:

Option

Type

Description

rnaseq_manifest_path

string

Path to the manifest TSV file.

rnaseq_basal_dataset_id

string

The dataset_id to use as the basal transcriptome.

basal_expression_condition

string

Modeled condition name (default: "M9 Glucose minus AAs").

Example configuration:

{
    "parca_options": {
        "cpus": 4,
        "outdir": "out/custom_rnaseq",
        "rnaseq_manifest_path": "reconstruction/ecoli/experimental_data/rnaseq/manifest.tsv",
        "rnaseq_basal_dataset_id": "gbw_0001",
        "basal_expression_condition": "M9 Glucose minus AAs"
    }
}

Default behavior (backward compatible):

If rnaseq_manifest_path is null or omitted, ParCa uses the legacy reference data from reconstruction/ecoli/flat/rna_seq_data/.

Validation Errors

The ingestion system validates data early and provides clear error messages:

Error

Cause

ValueError: rnaseq_manifest_path is set but rnaseq_basal_dataset_id is None

You specified a manifest but forgot to specify which dataset to use.

FileNotFoundError: ...

The manifest file or a TPM table file doesn’t exist.

KeyError: Dataset_id 'xyz' not found in manifest

The dataset_id you specified isn’t in the manifest.

SchemaError: ...

A file doesn’t match the expected schema (missing columns, wrong types, etc.).

Python API

For programmatic access, use the functions in wholecell.io.ingestion:

from wholecell.io.ingestion import (
    ingest_rnaseq_manifest,
    ingest_rnaseq_tpm_table,
    ingest_transcriptome,
)

# Load and validate a manifest
manifest = ingest_rnaseq_manifest("reconstruction/ecoli/experimental_data/rnaseq/manifest.tsv")

# Load a single TPM table
tpm_df = ingest_rnaseq_tpm_table("reconstruction/ecoli/experimental_data/rnaseq/gbw_0001.tsv")

# Convenience: load a dataset by ID (validates manifest + TPM table)
tpm_df, metadata = ingest_transcriptome(
    "reconstruction/ecoli/experimental_data/rnaseq/manifest.tsv",
    dataset_id="gbw_0001"
)

Adding Your Own Data

To add your own RNA-seq data:

  1. Prepare your TPM table as a tab-separated file with gene_id and tpm_mean columns. Ensure gene IDs are EcoCyc identifiers.

  2. Place the file in reconstruction/ecoli/experimental_data/rnaseq/ (or another location).

  3. Add an entry to the manifest with a unique dataset_id, description, and the path to your file.

  4. Update your config to point to the manifest and specify your dataset_id.

  5. Run ParCa to generate new simulation parameters using your data.

Example workflow:

# 1. Add your TPM file
cp my_experiment_tpm.tsv reconstruction/ecoli/experimental_data/rnaseq/my_exp_001.tsv

# 2. Edit manifest.tsv to add a row for my_exp_001

# 3. Create a config file
cat > configs/my_experiment.json << 'EOF'
{
    "parca_options": {
        "cpus": 4,
        "outdir": "out/my_experiment",
        "rnaseq_manifest_path": "reconstruction/ecoli/experimental_data/rnaseq/manifest.tsv",
        "rnaseq_basal_dataset_id": "my_exp_001"
    }
}
EOF

# 4. Run ParCa
python runscripts/parca.py --config configs/my_experiment.json

References