wholecell.io.data_qc
Quality control utilities for comparing experimental data against reference datasets.
This module provides stateless, composable functions that operate on DataFrames
(typically already validated by wholecell.io.ingestion).
- class wholecell.io.data_qc.RnaseqComparisonResult(comparison_table, summary_stats, genes_only_in_ref, genes_only_in_expt)[source]
Bases:
objectResult of comparing two RNA-seq TPM tables.
- Parameters:
- comparison_table
Outer join of reference and experimental TPMs. Core columns: [gene_id, ref_tpm, expt_tpm]. If annotations provided, also includes [gene_name, gene_essential].
- Type:
pandas.core.frame.DataFrame
- genes_only_in_ref
List of gene_ids present in reference but missing from experimental.
- Type:
List[str]
- genes_only_in_expt
List of gene_ids present in experimental but missing from reference.
- Type:
List[str]
- comparison_table: DataFrame
- wholecell.io.data_qc._compute_summary_stats(ref_tpm, expt_tpm, n_ref_total, n_expt_total, n_only_ref, n_only_expt)[source]
Compute summary statistics for matched gene pairs.
- Parameters:
ref_tpm (ndarray) – Reference TPM values (matched genes only).
expt_tpm (ndarray) – Experimental TPM values (matched genes only).
n_ref_total (int) – Total number of genes in reference.
n_expt_total (int) – Total number of genes in experimental.
n_only_ref (int) – Genes present only in reference.
n_only_expt (int) – Genes present only in experimental.
- Returns:
Summary statistics including correlations, RMSE, and gene counts.
- Return type:
- wholecell.io.data_qc.compare_rnaseq_tables(ref_df, expt_df, gene_annotations=None, essential_genes=None)[source]
Compare experimental RNA-seq TPM table against a reference.
Both DataFrames must follow
RnaseqTpmTableSchema(columns:gene_id,tpm_mean).- Parameters:
ref_df (DataFrame) – Reference TPM table.
expt_df (DataFrame) – Experimental TPM table.
gene_annotations (DataFrame | None) – Optional DataFrame with columns [gene_id, gene_name] for adding gene symbols. Can be loaded via
load_gene_annotations().essential_genes (Set[str] | None) – Optional set of essential gene IDs for adding essentiality flag. Can be loaded via
load_essential_genes().
- Returns:
Contains comparison table, summary statistics, and missing gene lists.
- Return type:
- wholecell.io.data_qc.load_essential_genes(essential_genes_tsv_path=PosixPath('validation/ecoli/flat/essential_genes.tsv'))[source]
Load set of essential gene IDs from essential_genes.tsv.