`ecoli.analysis.multivariant.new_gene_translation_efficiency_heatmaps`

Plot one value per index via heatmap for new_gene_expression_and_translation_efficiency variant.

Possible Plots:

Percent of sims that successfully reached a given generation number
Average doubling time
Average cell volume, mass, dry cell mass, mRNA mass, protein mass
Average translation efficiency, weighted by cistron count
Average mRNA count, monomer count, mRNA mass fraction, protein mass fraction, RNAP portion, and ribosome portion for a capacity gene to measure burden on overall host expression
Average new gene copy number
Average new gene mRNA count
Average new gene mRNA mass fraction
Average new gene mRNA counts fraction
Average new gene NTP mass fraction
Average new gene protein count
Average new gene protein mass fraction
Average new gene protein counts fraction
Average new gene initialization rate for RNAP and ribosomes
Average new gene initialization probabilities for RNAP and ribosomes
Average count and portion of new gene ribosome initialization events per time step
Average number and proportion of RNAP on new genes at a given time step
Average number and proportion of ribosomes on new gene mRNAs at a given time step
Average number and proportion of RNAP making rRNAs at a given time step
Average number and proportion of RNAP and ribosomes making RNAP subunits at a given time step
Average number and proportion of RNAP and ribosomes making ribosomal proteins at a given time step
Average fraction of time new gene is overcrowded by RNAP and Ribosomes
Average overcrowding probability ratio for new gene RNA synthesis and polypeptide initiation
Average max_p probabilities for RNA synthesis and polypeptide initiation
Average number of overcrowded genes for RNAP and Ribosomes
Average number of total, active, and free ribosomes
Average number of ribosomes initialized at each time step
Average number of total active, and free RNA polymerases
Average ppGpp concentration
Average rate of glucose consumption
Average new gene monomer yields - per hour and per fg of glucose

ecoli.analysis.multivariant.new_gene_translation_efficiency_heatmaps.COUNT_INDEX = 32: Plot data from generations [MIN_CELL_INDEX, MAX_CELL_INDEX) Note that early generations may not be representative of dynamics due to how they are initialized

ecoli.analysis.multivariant.new_gene_translation_efficiency_heatmaps.DASHBOARD_FLAG = 2

Standard Deviations Flag

True: Plot an additional copy of all plots with standard deviation displayed insted of the average
False: Plot no additional plots

ecoli.analysis.multivariant.new_gene_translation_efficiency_heatmaps.FONT_SIZE = 9

Dashboard Flag

0: Separate Only (Each plot is its own file)
1: Dashboard Only (One file with all plots)
2: Both Dashboard and Separate

ecoli.analysis.multivariant.new_gene_translation_efficiency_heatmaps.GENE_COUNTS_SQL = '\n WITH unnested_counts AS (\n SELECT unnest(gene_counts) AS gene_counts,\n generate_subscripts(gene_counts, 1)\n AS gene_idx, experiment_id, variant, lineage_seed, generation,\n agent_id\n FROM ({subquery})\n ),\n avg_per_cell AS (\n SELECT avg(gene_counts) AS avg_count,\n experiment_id, variant, gene_idx\n FROM unnested_counts\n GROUP BY experiment_id, variant, lineage_seed,\n generation, agent_id, gene_idx\n ),\n avg_per_variant AS (\n SELECT log10(avg(avg_count) + 1) AS avg_count,\n log10(stddev(avg_count) + 1) AS std_count,\n experiment_id, variant, gene_idx\n FROM avg_per_cell\n GROUP BY experiment_id, variant, gene_idx\n )\n SELECT variant, list(avg_count ORDER BY gene_idx) AS mean,\n list(std_count ORDER BY gene_idx) AS std,\n FROM avg_per_variant\n GROUP BY experiment_id, variant\n ': Generic SQL query for calculating average of a 1D-array column per cell, aggregates that per variant into log10(mean + 1) and log10(std + 1) columns.

ecoli.analysis.multivariant.new_gene_translation_efficiency_heatmaps.MAX_CELL_INDEX = 33: Specify which subset of heatmaps should be made Completed_gens heatmap is always made, because it is used to create the other heatmaps, and should not be included here. The order listed here will be the order of the heatmaps in the dashboard plot.

ecoli.analysis.multivariant.new_gene_translation_efficiency_heatmaps.STD_DEV_FLAG = True: Count number of sims that reach this generation (remember index 7 corresponds to generation 8)

ecoli.analysis.multivariant.new_gene_translation_efficiency_heatmaps.avg_1d_array_over_scalar_sql(array_column, scalar_column)[source]

Create generic SQL query that calculates the average per cell of each element in a 1D array column divided by a scalar column, and aggregates those ratios per variant into mean and std columns.

Note

Time steps with 0 in the scalar column are assigned a ratio of 0.

Parameters:

array_column (str) – Name of 1D list column to aggregate
scalar_column (str) – Name of scalar column to divide array_column cell averages by

Return type:

str

ecoli.analysis.multivariant.new_gene_translation_efficiency_heatmaps.avg_1d_array_sql(column)[source]

Create generic SQL query that calculates the average per cell of each element in a 1D array column and aggregates that per variant into mean and std columns.

Parameters:: column (str) – Name of 1D list column to aggregate
Return type:: str

ecoli.analysis.multivariant.new_gene_translation_efficiency_heatmaps.avg_ratio_of_1d_arrays_sql(numerator, denominator)[source]

Create generic SQL query that calculates the average per cell of each element in two 1D list columns divided elementwise and aggregates those ratios per variant into mean and std columns.

Note

Time steps with 0 in the denominator are assigned a ratio of 0.

Parameters:

numerator (str) – Name of 1D list column that will be numerator in ratio
denominator (str) – Name of 1D list column that will be denominator in ratio

Return type:

str

ecoli.analysis.multivariant.new_gene_translation_efficiency_heatmaps.avg_sum_1d_array_over_scalar_sql(array_column, scalar_column)[source]

Create generic SQL query that calculates the average per cell of the sum of elements in a 1D array column divided by a scalar column, and aggregates those ratios per variant as mean and std columns.

Note

Time steps with 0 in the scalar column are assigned a ratio of 0.

Parameters:

array_column (str) – Name of 1D list column to aggregate
scalar_column (str) – Name of scalar column to divide array_column cell averages by

Return type:

str

ecoli.analysis.multivariant.new_gene_translation_efficiency_heatmaps.avg_sum_1d_array_sql(column)[source]

Create generic SQL query that calculates the average per cell of the sum of elements in a 1D array column and aggregates that per variant into mean and std columns.

Parameters:: column (str) – Name of 1D list column to aggregate
Return type:: str

ecoli.analysis.multivariant.new_gene_translation_efficiency_heatmaps.get_gene_count_fraction_sql(gene_indices, column, index_type)[source]

Construct generic SQL query that gets the average per cell of a select set of indices from a 1D list column divided by the total of all elements per row of that list column, and aggregates those ratios per variant into mean and std columns.

Parameters:

gene_indices (list[int] | list[list[int]]) – Indices to extract from 1D list column to get ratios for
column (str) – Name of 1D list column
index_type (str) – Can either be monomer or mRNA. For monomer, function works exactly as described above. For mRNA, gene_indices will be a list of lists of mRNA indices. This is because one gene can have to multiple mRNAs (transcription units). Therefore, we sum the elements corresponding to each gene before proceeding (see get_rnas_combined_as_genes_projection()).

Return type:

str

ecoli.analysis.multivariant.new_gene_translation_efficiency_heatmaps.get_gene_mass_prod_func(sim_data, index_type, gene_ids)[source]

Create a function to be passed as the post_func argument to get_mean_and_std_matrices() which multiplies the average and standard deviation 1D array columns by the mass of the gene ID for each element.

Parameters:

sim_data (SimulationDataEcoli) – Simulation data
index_type (str) – Either mRNA or monomer. If mRNA, gene_ids is list of lists of mRNA IDs, where inner lists correspond to mRNAs for each gene. Therefore, we sum the masses for the mRNAs of each inner list and multiply the input mean and std by this sum per gene.
gene_ids (list[str] | list[list[str]]) – IDs of genes in the order they appear in the 1D arrays of the query result

Return type:

Callable[[DataFrame], DataFrame]

ecoli.analysis.multivariant.new_gene_translation_efficiency_heatmaps.get_indexes(conn, config_sql, index_type, ids)[source]

Retrieve DuckDB indices of a given type for a set of IDs. Note that DuckDB lists are 1-indexed.

Parameters:

conn (DuckDBPyConnection) – DuckDB database connection
config_sql (str) – DuckDB SQL query for sim config data (see dataset_sql())
index_type (str) – Type of indices to return (one of cistron, RNA, mRNA, or monomer)
ids (list[str] | list[list[str]]) – List of IDs to get indices for (must be monomer IDs if index_type is monomer, else mRNA IDs)

Returns:

List of requested indexes

Return type:

list[int | None] | list[list[int | None]]

ecoli.analysis.multivariant.new_gene_translation_efficiency_heatmaps.get_mRNA_ids_from_monomer_ids(sim_data, target_monomer_ids)[source]

Map monomer IDs back to the mRNA IDs that they were translated from.

Parameters:

target_monomer_ids (list[str]) – IDs of the monomers to map to mRNA IDs
sim_data (SimulationDataEcoli)

Returns:

List of mRNA ID lists, one for each monomer ID

Return type:

list[list[str]]

ecoli.analysis.multivariant.new_gene_translation_efficiency_heatmaps.get_mean_and_std_matrices(conn, variant_mapping, variant_matrix_shape, history_sql, columns, remove_first=False, func=None, order_results=False, success_sql=None, custom_sql=None, post_func=None, num_digits_rounding=None, default_value=None)[source]

Reads one or more columns and calculates mean and std. dev. for each variant. If no custom SQL query is provided, this defaults to averaging per cell, then calculating the averages and standard deviations of all cells per variant.

Parameters:

conn (DuckDBPyConnection) – DuckDB connection
variant_mapping (dict[int, tuple[int, int]]) – Mapping of variant IDs to row and column in matrix of new gene translation efficiency and expression factor variants
variant_matrix_shape (tuple[int, int]) – Number of rows and columns in variant matrix
history_sql (str) – SQL subquery from ecoli.library.parquet_emitter.dataset_sql()
columns (list[str]) – See ecoli.library.parquet_emitter.read_stacked_columns()
remove_first (bool) – See ecoli.library.parquet_emitter.read_stacked_columns()
func (Callable | None) – See ecoli.library.parquet_emitter.read_stacked_columns()
order_results (bool) – See ecoli.library.parquet_emitter.read_stacked_columns()
success_sql (str | None) – See ecoli.library.parquet_emitter.read_stacked_columns()
custom_sql (str | None) – SQL string containing a placeholder with name subquery where the result of read_stacked_columns will be placed. Final query result must only have two columns in order: variant and a value for each variant. If not provided, defaults to average of averages
post_func (Callable | None) – Function that is called on Polars DataFrame resulting from query. Should return a Polars DataFrame with exactly three columns: variant for the variant IDs, mean for some mean aggregate value (can be N-D list column), and std for some standard deviation aggregate.
num_digits_rounding (int | None) – Number of decimal places to round to
default_value (Any | None) – Default value to put in output variant matrices if variant ID not included in query result (e.g. if variant failed in first generation and had no completed sims)
new_gene_NTP_fraction – Set to True for NTP fraction heatmap so query output is properly handled

Returns:

Tuple of Numpy matrices with first two dimensions variant_matrix_shape. Each cell in first matrix has the mean for that variant. Each cell in the second matrix has the std. dev. for that variant. These values can be Numpy arrays instead of scalar values (e.g. when calculating aggregates for many genes at once), in which case the matrices have shapes variant_matrix_shape + (num_genes,)

Return type:

tuple[ndarray, ndarray]

ecoli.analysis.multivariant.new_gene_translation_efficiency_heatmaps.get_new_gene_mRNA_NTP_fraction_sql(sim_data, new_gene_mRNA_idx, ntp_ids)[source]

Construct SQL query that gets, for each NTP, the fraction used by the mRNAs of each new gene, averages that per cell, and aggregate those fractions per variant into mean and std columns where each row is a 2D list with shape (# NTPs, # new genes).

Parameters:

sim_data (SimulationDataEcoli) – Simulation data
new_gene_mRNA_idx (list[list[int]]) – List of lists of mRNA indices for each new gene
ntp_ids (list[str]) – IDs for NTPs in same order that they appear in sim_data.process.transcription.rna_data["counts_ACGU"]

Return type:

str

ecoli.analysis.multivariant.new_gene_translation_efficiency_heatmaps.get_overcrowding_sql(target_col, actual_col)[source]

Create generic SQL query that calculates for average number of genes that are overcrowded per time step for each cell, then aggregates that per variant into mean and std columns.

At every time step, if the element in target_col is greater than the corresponding element in actual_col, we say that the gene for that element is overcrowded. We average the number of overcrowded genes over all the time steps for each cell. Then, we average the per-cell averages over all cells in each variant.

Parameters:

target_col (str) – Name of 1D list column with target values
actual_col (str) – Name of 1D list column with actual values.

Return type:

str

ecoli.analysis.multivariant.new_gene_translation_efficiency_heatmaps.get_ribosome_counts_projection(sim_data, bulk_ids)[source]

Return SQL projection to selectively read bulk inactive ribosome count (defined as minimum of free 30S and 50S subunits at any given moment)

Parameters:

sim_data (SimulationDataEcoli) – Simulation data
bulk_ids (list[str]) – List of all bulk IDs in order

Return type:

str

ecoli.analysis.multivariant.new_gene_translation_efficiency_heatmaps.get_rnap_counts_projection(sim_data, bulk_ids)[source]

Return SQL projection to selectively read bulk inactive RNAP count.

Parameters:

sim_data (SimulationDataEcoli) – Simulation data
bulk_ids (list[str]) – List of all bulk IDs in order

Return type:

str

ecoli.analysis.multivariant.new_gene_translation_efficiency_heatmaps.get_rnas_combined_as_genes_projection(column, rna_idx, name, cast_type=None)[source]

Create generic SQL projection that evaluates to a list column where each element is the sum of a subset of elements from the original list column. This is mainly used to sum up all RNA data that corresponds to a single gene / cistron / monomer.

Parameters:

column (str)
rna_idx (list[list[int]])
name (str)
cast_type (str | None)

ecoli.analysis.multivariant.new_gene_translation_efficiency_heatmaps.get_variant_mask(conn, config_sql, variant_to_row_col, variant_matrix_shape)[source]

Get a boolean matrix where the rows represent the different translation efficiencies and the columns represent the different expression factors that were used to create variants. The matrix is True for each combination that was actually simulated and False otherwise.

Parameters:

conn (DuckDBPyConnection)
config_sql (str)
variant_to_row_col (dict[int, tuple[int, int]])
variant_matrix_shape (tuple[int, int])

Return type:

ndarray[tuple[int, …], dtype[bool]]

ecoli.analysis.multivariant.new_gene_translation_efficiency_heatmaps.plot(params, conn, history_sql, config_sql, success_sql, sim_data_dict, validation_data_paths, outdir, variant_metadata, variant_names)[source]

Create either a single multi-heatmap plot or 1+ separate heatmaps of data for a grid of new gene variant simulations with varying expression and translation efficiencies.

Params (override corresponding hard-coded global variables):: font_size, dashboard_flag, std_dev_flag, count_index, min_cell_index, max_cell_index

Parameters:

params (dict[str, Any])
conn (DuckDBPyConnection)
history_sql (str)
config_sql (str)
success_sql (str)
sim_data_dict (dict[str, dict[int, str]])
validation_data_paths (list[str])
outdir (str)
variant_metadata (dict[str, dict[int, Any]])
variant_names (dict[str, str])

ecoli.analysis.multivariant.new_gene_translation_efficiency_heatmaps.plot_heatmaps(heatmap_data, heatmap_details, new_gene_cistron_ids, ntp_ids, capacity_gene_common_names, total_heatmaps_to_make, is_dashboard, variant_mask, heatmap_x_label, heatmap_y_label, new_gene_expression_factors, new_gene_translation_efficiency_values, summary_statistic, figsize_x, figsize_y, plotOutDir, plot_suffix)[source]

Plots all heatmaps in order given by HEATMAPS_TO_MAKE_LIST.

Parameters:

is_dashboard – Boolean flag for whether we are creating a dashboard of heatmaps or a number of individual heatmaps
variant_mask – np.array of dimension (len(new_gene_translation_efficiency_values), len(new_gene_expression_factors)) with entries set to True if variant was run, False otherwise.
heatmap_x_label – Label for x axis of heatmap
heatmap_y_label – Label for y axis of heatmap
new_gene_expression_factors – New gene expression factors used in these variants
new_gene_translation_efficiency_values – New gene translation efficiency values used in these variants
summary_statistic – Specifies whether average (mean) or standard deviation (std_dev) should be displayed on the heatmaps
figsize_x – Horizontal size of each heatmap
figsize_y – Vertical size of each heatmap
plotOutDir – Output directory for plots
plot_suffix – Suffix to add to plot file names, usually specifying which generations were plotted

ecoli.analysis.multivariant.new_gene_translation_efficiency_heatmaps

`ecoli.analysis.multivariant.new_gene_translation_efficiency_heatmaps`