`wholecell.io.data_qc`

Quality control utilities for comparing experimental data against reference datasets.

This module provides stateless, composable functions that operate on DataFrames (typically already validated by wholecell.io.ingestion).

class wholecell.io.data_qc.RnaseqComparisonResult(comparison_table, summary_stats, genes_only_in_ref, genes_only_in_expt)[source]

Bases: object

Result of comparing two RNA-seq TPM tables.

Parameters:

comparison_table (DataFrame)
summary_stats (dict)
genes_only_in_ref (List[str])
genes_only_in_expt (List[str])

comparison_table

Outer join of reference and experimental TPMs. Core columns: [gene_id, ref_tpm, expt_tpm]. If annotations provided, also includes [gene_name, gene_essential].

Type:: pandas.core.frame.DataFrame

summary_stats

Dictionary of summary statistics (correlation, RMSE, gene counts, etc.).

Type:: dict

genes_only_in_ref

List of gene_ids present in reference but missing from experimental.

Type:: List[str]

genes_only_in_expt

List of gene_ids present in experimental but missing from reference.

Type:: List[str]

comparison_table: DataFrame

genes_only_in_expt: List[str]

genes_only_in_ref: List[str]

summary_stats: dict

wholecell.io.data_qc._compute_summary_stats(ref_tpm, expt_tpm, n_ref_total, n_expt_total, n_only_ref, n_only_expt)[source]

Compute summary statistics for matched gene pairs.

Parameters:

ref_tpm (ndarray) – Reference TPM values (matched genes only).
expt_tpm (ndarray) – Experimental TPM values (matched genes only).
n_ref_total (int) – Total number of genes in reference.
n_expt_total (int) – Total number of genes in experimental.
n_only_ref (int) – Genes present only in reference.
n_only_expt (int) – Genes present only in experimental.

Returns:

Summary statistics including correlations, RMSE, and gene counts.

Return type:

dict

wholecell.io.data_qc.compare_rnaseq_tables(ref_df, expt_df, gene_annotations=None, essential_genes=None)[source]

Compare experimental RNA-seq TPM table against a reference.

Both DataFrames must follow RnaseqTpmTableSchema (columns: gene_id, tpm_mean).

Parameters:

ref_df (DataFrame) – Reference TPM table.
expt_df (DataFrame) – Experimental TPM table.
gene_annotations (DataFrame | None) – Optional DataFrame with columns [gene_id, gene_name] for adding gene symbols. Can be loaded via load_gene_annotations().
essential_genes (Set[str] | None) – Optional set of essential gene IDs for adding essentiality flag. Can be loaded via load_essential_genes().

Returns:

Contains comparison table, summary statistics, and missing gene lists.

Return type:

RnaseqComparisonResult

wholecell.io.data_qc.load_essential_genes(essential_genes_tsv_path=PosixPath('validation/ecoli/flat/essential_genes.tsv'))[source]

Load set of essential gene IDs from essential_genes.tsv.

Parameters:: essential_genes_tsv_path (str | Path) – Path to essential_genes.tsv file.
Returns:: Set of essential gene FrameIDs (e.g., {“EG10068”, “EG10117”, …}).
Return type:: Set[str]

wholecell.io.data_qc.load_gene_annotations(genes_tsv_path=PosixPath('reconstruction/ecoli/flat/genes.tsv'))[source]

Load gene annotations (id -> symbol mapping) from genes.tsv.

Parameters:: genes_tsv_path (str | Path) – Path to genes.tsv file.
Returns:: DataFrame with columns [gene_id, gene_name].
Return type:: pd.DataFrame

wholecell.io.data_qc

`wholecell.io.data_qc`