=========================
Experimental Data Ingestion
=========================

This document describes how to provide custom experimental data to the model,
allowing users to substitute their own measurements for the reference data
shipped with the repository.

--------
Overview
--------

The vEcoli model ships with curated reference data in ``reconstruction/ecoli/flat/``.
This data was compiled from public databases and literature and represents a
"default" *E. coli* K-12 MG1655 grown in M9 minimal medium with glucose.

In many cases, users want to parameterize the model with their own experimental
measurements—for example, RNA-seq data from a different strain, growth condition,
or laboratory. The **experimental data ingestion** system provides a structured
way to do this without modifying the core reference files.

Philosophy
==========

The ingestion system follows these principles:

1. **Like-for-like substitution**: Custom data must match the format and semantics
   of the reference data it replaces. For RNA-seq, this means gene-level TPM values
   that can be mapped to the model's gene set.

2. **Schema validation**: All ingested data is validated against Pandera schemas
   (see :py:mod:`wholecell.io.schemas`) to catch formatting errors early.

3. **Manifest-based organization**: Datasets are registered in manifest files that
   provide metadata (source, strain, condition) alongside file paths. This keeps
   the data self-documenting.

4. **Config-driven selection**: Users specify which dataset to use via configuration
   options, making it easy to switch between datasets without code changes.

Currently Supported Data Types
==============================

.. list-table::
   :header-rows: 1
   :widths: 20 40 40

   * - Data Type
     - Description
     - Status
   * - RNA-seq (transcriptome)
     - Gene-level TPM expression values
     - ✓ Supported
   * - Proteomics
     - Protein abundance measurements
     - Planned, near term
   * - Metabolomics
     - Metabolite concentrations
     - Under consideration
   * - Metabolic fluxes
     - Flux values
     - Under consideration
   * - Growth physiology
     - Growth rates, cell size, etc.
     - Under consideration

------
RNA-seq
------

RNA-seq data provides gene expression levels used by the ParCa (parameter calculator)
to set basal transcription rates. By default, ParCa uses expression data from the
reference files. With the ingestion system, you can substitute your own RNA-seq
measurements.

File Organization
=================

RNA-seq data is organized as:

.. code-block:: text

    reconstruction/
    └── ecoli/
        └── experimental_data/
            └── rnaseq/
                ├── manifest.tsv        # Lists all available datasets
                ├── ref_0001.tsv       # TPM table for dataset ref_0001
                ├── ref_0002.tsv       # TPM table for dataset ref_0002
                ├── gbw_0001.tsv       # TPM table for dataset gbw_0001
                └── ...

The ``manifest.tsv`` file is the entry point—it lists all available datasets and
their metadata. Each dataset has a corresponding TPM table file.

Manifest Schema
===============

The manifest is a tab-separated file validated against
:py:obj:`~wholecell.io.schemas.rnaseq.RnaseqSamplesManifestSchema`.

**Required columns:**

.. list-table::
   :header-rows: 1
   :widths: 25 15 60

   * - Column
     - Type
     - Description
   * - ``dataset_id``
     - string
     - Unique identifier for this dataset (e.g., ``gbw_0001``). Referenced in config.
   * - ``dataset_description``
     - string
     - Human-readable description of the dataset.
   * - ``file_path``
     - string
     - Path to the TPM table file (relative to manifest or absolute).
   * - ``data_source``
     - string
     - Origin of the data (e.g., ``Ginkgo Bioworks``, ``PNNL``).

**Optional columns:**

.. list-table::
   :header-rows: 1
   :widths: 25 15 60

   * - Column
     - Type
     - Description
   * - ``data_source_experiment_id``
     - string
     - Experiment identifier from the data source.
   * - ``data_source_date``
     - string
     - Date of the experiment (e.g., ``2026-01-15``).
   * - ``strain``
     - string
     - Strain descriptor (e.g., ``MG1655 rph+``).
   * - ``condition``
     - string
     - Cultivation condition (e.g., ``M9, Glucose, Aerobic, 37C``).

**Example manifest:**

.. code-block:: text

    dataset_id	dataset_description	file_path	data_source	strain	condition
    ref_0001	Reference M9 Glucose minus AAs	ref_0001.tsv	reference	MG1655	M9, Glucose, Aerobic
    gbw_0001	MG1655 rph+ in Modified M9	gbw_0001.tsv	Ginkgo Bioworks	MG1655 rph+	Modified_M9_N_Fe

TPM Table Schema
================

Each TPM table is a tab-separated file validated against
:py:obj:`~wholecell.io.schemas.rnaseq.RnaseqTpmTableSchema`.

**Required columns:**

.. list-table::
   :header-rows: 1
   :widths: 20 15 65

   * - Column
     - Type
     - Description
   * - ``gene_id``
     - string
     - Gene identifier matching the model's gene set (EcoCyc IDs, e.g., ``EG10001``).
   * - ``tpm_mean``
     - float
     - Mean TPM (transcripts per million) for this gene. Must be ≥ 0.

**Optional columns:**

.. list-table::
   :header-rows: 1
   :widths: 20 15 65

   * - Column
     - Type
     - Description
   * - ``tpm_std``
     - float
     - Standard deviation of TPM across replicates. Must be ≥ 0.

**Example TPM table:**

.. code-block:: text

    gene_id	tpm_mean	tpm_std
    EG10001	1234.56	45.2
    EG10002	567.89	23.1
    EG10003	0.0	0.0
    ...

.. note::
   Gene IDs must match the EcoCyc identifiers used by the model. Genes not found
   in the model's gene set will be ignored with a warning.

Configuration
=============

To use custom RNA-seq data, add the following options under ``parca_options``
in your configuration JSON:

.. list-table::
   :header-rows: 1
   :widths: 30 15 55

   * - Option
     - Type
     - Description
   * - ``rnaseq_manifest_path``
     - string
     - Path to the manifest TSV file.
   * - ``rnaseq_basal_dataset_id``
     - string
     - The ``dataset_id`` to use as the basal transcriptome.
   * - ``basal_expression_condition``
     - string
     - Modeled condition name (default: ``"M9 Glucose minus AAs"``).

**Example configuration:**

.. code-block:: json

    {
        "parca_options": {
            "cpus": 4,
            "outdir": "out/custom_rnaseq",
            "rnaseq_manifest_path": "reconstruction/ecoli/experimental_data/rnaseq/manifest.tsv",
            "rnaseq_basal_dataset_id": "gbw_0001",
            "basal_expression_condition": "M9 Glucose minus AAs"
        }
    }

**Default behavior (backward compatible):**

If ``rnaseq_manifest_path`` is ``null`` or omitted, ParCa uses the legacy
reference data from ``reconstruction/ecoli/flat/rna_seq_data/``.

Validation Errors
=================

The ingestion system validates data early and provides clear error messages:

.. list-table::
   :header-rows: 1
   :widths: 40 60

   * - Error
     - Cause
   * - ``ValueError: rnaseq_manifest_path is set but rnaseq_basal_dataset_id is None``
     - You specified a manifest but forgot to specify which dataset to use.
   * - ``FileNotFoundError: ...``
     - The manifest file or a TPM table file doesn't exist.
   * - ``KeyError: Dataset_id 'xyz' not found in manifest``
     - The ``dataset_id`` you specified isn't in the manifest.
   * - ``SchemaError: ...``
     - A file doesn't match the expected schema (missing columns, wrong types, etc.).

-----------
Python API
-----------

For programmatic access, use the functions in :py:mod:`wholecell.io.ingestion`:

.. code-block:: python

    from wholecell.io.ingestion import (
        ingest_rnaseq_manifest,
        ingest_rnaseq_tpm_table,
        ingest_transcriptome,
    )

    # Load and validate a manifest
    manifest = ingest_rnaseq_manifest("reconstruction/ecoli/experimental_data/rnaseq/manifest.tsv")

    # Load a single TPM table
    tpm_df = ingest_rnaseq_tpm_table("reconstruction/ecoli/experimental_data/rnaseq/gbw_0001.tsv")

    # Convenience: load a dataset by ID (validates manifest + TPM table)
    tpm_df, metadata = ingest_transcriptome(
        "reconstruction/ecoli/experimental_data/rnaseq/manifest.tsv",
        dataset_id="gbw_0001"
    )

-------------------
Adding Your Own Data
-------------------

To add your own RNA-seq data:

1. **Prepare your TPM table** as a tab-separated file with ``gene_id`` and ``tpm_mean``
   columns. Ensure gene IDs are EcoCyc identifiers.

2. **Place the file** in ``reconstruction/ecoli/experimental_data/rnaseq/`` (or another location).

3. **Add an entry to the manifest** with a unique ``dataset_id``, description,
   and the path to your file.

4. **Update your config** to point to the manifest and specify your ``dataset_id``.

5. **Run ParCa** to generate new simulation parameters using your data.

**Example workflow:**

.. code-block:: bash

    # 1. Add your TPM file
    cp my_experiment_tpm.tsv reconstruction/ecoli/experimental_data/rnaseq/my_exp_001.tsv

    # 2. Edit manifest.tsv to add a row for my_exp_001

    # 3. Create a config file
    cat > configs/my_experiment.json << 'EOF'
    {
        "parca_options": {
            "cpus": 4,
            "outdir": "out/my_experiment",
            "rnaseq_manifest_path": "reconstruction/ecoli/experimental_data/rnaseq/manifest.tsv",
            "rnaseq_basal_dataset_id": "my_exp_001"
        }
    }
    EOF

    # 4. Run ParCa
    python runscripts/parca.py --config configs/my_experiment.json

----------
References
----------

- Schemas: :py:mod:`wholecell.io.schemas.rnaseq`
- Ingestion functions: :py:mod:`wholecell.io.ingestion`
- ParCa configuration: :ref:`/workflows.rst`