Experimental Data Ingestion
This document describes how to provide custom experimental data to the model, allowing users to substitute their own measurements for the reference data shipped with the repository.
Overview
The vEcoli model ships with curated reference data in reconstruction/ecoli/flat/.
This data was compiled from public databases and literature and represents a
“default” E. coli K-12 MG1655 grown in M9 minimal medium with glucose.
In many cases, users want to parameterize the model with their own experimental measurements—for example, RNA-seq data from a different strain, growth condition, or laboratory. The experimental data ingestion system provides a structured way to do this without modifying the core reference files.
Philosophy
The ingestion system follows these principles:
Like-for-like substitution: Custom data must match the format and semantics of the reference data it replaces. For RNA-seq, this means gene-level TPM values that can be mapped to the model’s gene set.
Schema validation: All ingested data is validated against Pandera schemas (see
wholecell.io.schemas) to catch formatting errors early.Manifest-based organization: Datasets are registered in manifest files that provide metadata (source, strain, condition) alongside file paths. This keeps the data self-documenting.
Config-driven selection: Users specify which dataset to use via configuration options, making it easy to switch between datasets without code changes.
Currently Supported Data Types
Data Type |
Description |
Status |
|---|---|---|
RNA-seq (transcriptome) |
Gene-level TPM expression values |
✓ Supported |
Proteomics |
Protein abundance measurements |
Planned, near term |
Metabolomics |
Metabolite concentrations |
Under consideration |
Metabolic fluxes |
Flux values |
Under consideration |
Growth physiology |
Growth rates, cell size, etc. |
Under consideration |
RNA-seq
RNA-seq data provides gene expression levels used by the ParCa (parameter calculator) to set basal transcription rates. By default, ParCa uses expression data from the reference files. With the ingestion system, you can substitute your own RNA-seq measurements.
File Organization
RNA-seq data is organized as:
reconstruction/
└── ecoli/
└── experimental_data/
└── rnaseq/
├── manifest.tsv # Lists all available datasets
├── ref_0001.tsv # TPM table for dataset ref_0001
├── ref_0002.tsv # TPM table for dataset ref_0002
├── gbw_0001.tsv # TPM table for dataset gbw_0001
└── ...
The manifest.tsv file is the entry point—it lists all available datasets and
their metadata. Each dataset has a corresponding TPM table file.
Manifest Schema
The manifest is a tab-separated file validated against
RnaseqSamplesManifestSchema.
Required columns:
Column |
Type |
Description |
|---|---|---|
|
string |
Unique identifier for this dataset (e.g., |
|
string |
Human-readable description of the dataset. |
|
string |
Path to the TPM table file (relative to manifest or absolute). |
|
string |
Origin of the data (e.g., |
Optional columns:
Column |
Type |
Description |
|---|---|---|
|
string |
Experiment identifier from the data source. |
|
string |
Date of the experiment (e.g., |
|
string |
Strain descriptor (e.g., |
|
string |
Cultivation condition (e.g., |
Example manifest:
dataset_id dataset_description file_path data_source strain condition
ref_0001 Reference M9 Glucose minus AAs ref_0001.tsv reference MG1655 M9, Glucose, Aerobic
gbw_0001 MG1655 rph+ in Modified M9 gbw_0001.tsv Ginkgo Bioworks MG1655 rph+ Modified_M9_N_Fe
TPM Table Schema
Each TPM table is a tab-separated file validated against
RnaseqTpmTableSchema.
Required columns:
Column |
Type |
Description |
|---|---|---|
|
string |
Gene identifier matching the model’s gene set (EcoCyc IDs, e.g., |
|
float |
Mean TPM (transcripts per million) for this gene. Must be ≥ 0. |
Optional columns:
Column |
Type |
Description |
|---|---|---|
|
float |
Standard deviation of TPM across replicates. Must be ≥ 0. |
Example TPM table:
gene_id tpm_mean tpm_std
EG10001 1234.56 45.2
EG10002 567.89 23.1
EG10003 0.0 0.0
...
Note
Gene IDs must match the EcoCyc identifiers used by the model. Genes not found in the model’s gene set will be ignored with a warning.
Configuration
To use custom RNA-seq data, add the following options under parca_options
in your configuration JSON:
Option |
Type |
Description |
|---|---|---|
|
string |
Path to the manifest TSV file. |
|
string |
The |
|
string |
Modeled condition name (default: |
Example configuration:
{
"parca_options": {
"cpus": 4,
"outdir": "out/custom_rnaseq",
"rnaseq_manifest_path": "reconstruction/ecoli/experimental_data/rnaseq/manifest.tsv",
"rnaseq_basal_dataset_id": "gbw_0001",
"basal_expression_condition": "M9 Glucose minus AAs"
}
}
Default behavior (backward compatible):
If rnaseq_manifest_path is null or omitted, ParCa uses the legacy
reference data from reconstruction/ecoli/flat/rna_seq_data/.
Validation Errors
The ingestion system validates data early and provides clear error messages:
Error |
Cause |
|---|---|
|
You specified a manifest but forgot to specify which dataset to use. |
|
The manifest file or a TPM table file doesn’t exist. |
|
The |
|
A file doesn’t match the expected schema (missing columns, wrong types, etc.). |
Python API
For programmatic access, use the functions in wholecell.io.ingestion:
from wholecell.io.ingestion import (
ingest_rnaseq_manifest,
ingest_rnaseq_tpm_table,
ingest_transcriptome,
)
# Load and validate a manifest
manifest = ingest_rnaseq_manifest("reconstruction/ecoli/experimental_data/rnaseq/manifest.tsv")
# Load a single TPM table
tpm_df = ingest_rnaseq_tpm_table("reconstruction/ecoli/experimental_data/rnaseq/gbw_0001.tsv")
# Convenience: load a dataset by ID (validates manifest + TPM table)
tpm_df, metadata = ingest_transcriptome(
"reconstruction/ecoli/experimental_data/rnaseq/manifest.tsv",
dataset_id="gbw_0001"
)
Adding Your Own Data
To add your own RNA-seq data:
Prepare your TPM table as a tab-separated file with
gene_idandtpm_meancolumns. Ensure gene IDs are EcoCyc identifiers.Place the file in
reconstruction/ecoli/experimental_data/rnaseq/(or another location).Add an entry to the manifest with a unique
dataset_id, description, and the path to your file.Update your config to point to the manifest and specify your
dataset_id.Run ParCa to generate new simulation parameters using your data.
Example workflow:
# 1. Add your TPM file
cp my_experiment_tpm.tsv reconstruction/ecoli/experimental_data/rnaseq/my_exp_001.tsv
# 2. Edit manifest.tsv to add a row for my_exp_001
# 3. Create a config file
cat > configs/my_experiment.json << 'EOF'
{
"parca_options": {
"cpus": 4,
"outdir": "out/my_experiment",
"rnaseq_manifest_path": "reconstruction/ecoli/experimental_data/rnaseq/manifest.tsv",
"rnaseq_basal_dataset_id": "my_exp_001"
}
}
EOF
# 4. Run ParCa
python runscripts/parca.py --config configs/my_experiment.json
References
Schemas:
wholecell.io.schemas.rnaseqIngestion functions:
wholecell.io.ingestionParCa configuration: Workflows