wholecell.io.data_qc

Quality control utilities for comparing experimental data against reference datasets.

This module provides stateless, composable functions that operate on DataFrames (typically already validated by wholecell.io.ingestion).

class wholecell.io.data_qc.RnaseqComparisonResult(comparison_table, summary_stats, genes_only_in_ref, genes_only_in_expt)[source]

Bases: object

Result of comparing two RNA-seq TPM tables.

Parameters:
  • comparison_table (DataFrame)

  • summary_stats (dict)

  • genes_only_in_ref (List[str])

  • genes_only_in_expt (List[str])

comparison_table

Outer join of reference and experimental TPMs. Core columns: [gene_id, ref_tpm, expt_tpm]. If annotations provided, also includes [gene_name, gene_essential].

Type:

pandas.core.frame.DataFrame

summary_stats

Dictionary of summary statistics (correlation, RMSE, gene counts, etc.).

Type:

dict

genes_only_in_ref

List of gene_ids present in reference but missing from experimental.

Type:

List[str]

genes_only_in_expt

List of gene_ids present in experimental but missing from reference.

Type:

List[str]

comparison_table: DataFrame
genes_only_in_expt: List[str]
genes_only_in_ref: List[str]
summary_stats: dict
wholecell.io.data_qc._compute_summary_stats(ref_tpm, expt_tpm, n_ref_total, n_expt_total, n_only_ref, n_only_expt)[source]

Compute summary statistics for matched gene pairs.

Parameters:
  • ref_tpm (ndarray) – Reference TPM values (matched genes only).

  • expt_tpm (ndarray) – Experimental TPM values (matched genes only).

  • n_ref_total (int) – Total number of genes in reference.

  • n_expt_total (int) – Total number of genes in experimental.

  • n_only_ref (int) – Genes present only in reference.

  • n_only_expt (int) – Genes present only in experimental.

Returns:

Summary statistics including correlations, RMSE, and gene counts.

Return type:

dict

wholecell.io.data_qc.compare_rnaseq_tables(ref_df, expt_df, gene_annotations=None, essential_genes=None)[source]

Compare experimental RNA-seq TPM table against a reference.

Both DataFrames must follow RnaseqTpmTableSchema (columns: gene_id, tpm_mean).

Parameters:
  • ref_df (DataFrame) – Reference TPM table.

  • expt_df (DataFrame) – Experimental TPM table.

  • gene_annotations (DataFrame | None) – Optional DataFrame with columns [gene_id, gene_name] for adding gene symbols. Can be loaded via load_gene_annotations().

  • essential_genes (Set[str] | None) – Optional set of essential gene IDs for adding essentiality flag. Can be loaded via load_essential_genes().

Returns:

Contains comparison table, summary statistics, and missing gene lists.

Return type:

RnaseqComparisonResult

wholecell.io.data_qc.load_essential_genes(essential_genes_tsv_path=PosixPath('validation/ecoli/flat/essential_genes.tsv'))[source]

Load set of essential gene IDs from essential_genes.tsv.

Parameters:

essential_genes_tsv_path (str | Path) – Path to essential_genes.tsv file.

Returns:

Set of essential gene FrameIDs (e.g., {“EG10068”, “EG10117”, …}).

Return type:

Set[str]

wholecell.io.data_qc.load_gene_annotations(genes_tsv_path=PosixPath('reconstruction/ecoli/flat/genes.tsv'))[source]

Load gene annotations (id -> symbol mapping) from genes.tsv.

Parameters:

genes_tsv_path (str | Path) – Path to genes.tsv file.

Returns:

DataFrame with columns [gene_id, gene_name].

Return type:

pd.DataFrame