wholecell.io.ingestion

Utilities for ingesting experimental data (e.g. RNA-seq transcriptomes) using the canonical Pandera schemas in wholecell.io.schemas.

This module is intentionally narrow for now: - Load TSVs into pandas DataFrames. - Validate them against the RNA-seq schemas. - Provide a small convenience wrapper to fetch a single transcriptome

given a manifest and dataset_id.

wholecell.io.ingestion._read_tsv(path)[source]

Read a tab-delimited file into a DataFrame.

Parameters:

path (str | Path)

Return type:

DataFrame

wholecell.io.ingestion.ingest_rnaseq_manifest(path)[source]

Load and validate an RNA-seq samples manifest.

Relative file_path entries are resolved relative to the manifest directory for convenience.

Parameters:

path (str | Path) – Path to the manifest TSV file.

Returns:

Validated manifest with file_path normalized to absolute paths.

Return type:

pandas.DataFrame

wholecell.io.ingestion.ingest_rnaseq_tpm_table(path)[source]

Load and validate a single RNA-seq TPM table.

Parameters:

path (str | Path) – Path to a TSV file with columns matching RnaseqTpmTableSchema.

Returns:

Validated DataFrame; extra columns are preserved but only the required/optional schema columns are validated.

Return type:

pandas.DataFrame

wholecell.io.ingestion.ingest_transcriptome(manifest_path, dataset_id)[source]

Ingest a single transcriptome (TPM table) specified by dataset_id.

This is a convenience wrapper that: 1) Validates the manifest. 2) Looks up the row with the given dataset_id. 3) Loads and validates the corresponding TPM table.

Parameters:
  • manifest_path (str | Path) – Path to the RNA-seq samples manifest TSV.

  • dataset_id (str) – Identifier of the dataset to load (must match a dataset_id row).

Returns:

  • Validated TPM table for the requested dataset.

  • Metadata dict for the selected manifest row.

Return type:

(pandas.DataFrame, dict)

Raises:
  • KeyError – If dataset_id is not found in the manifest.

  • ValueError – If multiple rows share the same dataset_id.