Workflows
A typical simulation workflow has four main steps:
Run the parameter calculator (
runscripts.parca
) to generate simulation parameters from raw data.Create one or more variants (
runscripts.create_variants
) of the simulation parameters.Simulate cells (
ecoli.experiments.ecoli_master_sim
).Aggregate simulation results with analysis scripts (
runscripts.analysis
).
While each of these steps can be run manually by invoking their associated scripts,
it is common to run them in an automatically coordinated Nextflow workflow
using runscripts.workflow
.
As mentioned in JSON Config Files, the preferred method for supplying configuration
options to runscripts is via a JSON configuration file specified using the --config
command-line option. Please check the configuration JSON located at
default_config_path
for the most up-to-date default configuration options.
Note
Remember that when creating your own JSON configuration file, you only need to include the configuration options whose values are different from the defaults.
Below, we explore each runscript in further detail, including the configuration options unique to each.
ParCa
The parameter calculator (or ParCa) is a Python script that performs certain computations
on raw, curated experimental data (located in the reconstruction/ecoli/flat
folder)
to generate the parameters expected by processes in our model. It packages these parameters
in a pickled SimulationDataEcoli
object
whose path must be given via the sim_data_path
configuration option to all runscripts
in runscripts/
and to experiment modules in ecoli/experiments
(default used by
runscripts.workflow
is ecoli_master_sim
).
The code responsible for loading data from the raw flat files is contained in
KnowledgeBaseEcoli
. The actual logic
of the ParCa is mostly contained within a single file: fit_sim_data_1
.
The main interface for running the ParCa is runscripts.parca
.
Configuration
Configuration options for the ParCa are all located in a dictionary under the
parca_options
key inside a JSON configuration file. They include:
cpus
: Number of CPU cores to parallelize parts of the ParCa overoutdir
: Path to directory (local or absolute) for output pickle filesoperons
: If True, calculate parameters with operon gene structureribosome_fitting
: If False, ribosome expression is not fit to protein synthesis demandsrnapoly_fitting
: If False, RNA polymerase expression is not fit to protein synthesis demandsremove_rrna_operons
: If True, do not include rRNAs in operon gene structures.remove_rrff
: If True, remove rrfF gene that encodes for the extra 5S rRNA in the rrnD operonstable_rrna
: If True, set degradation rates of mature rRNAs to the values calculated from the half-life in sim_data.constants. If False, set degradation rates of mature rRNAs to the average reported degradation rates of mRNAs.new_genes
: String folder name inreconstruction/ecoli/flat/new_gene_data
containing necessary flat files to add new gene(s) to the model (see templates inreconstruction/ecoli/flat/new_gene_data/template
). By default,off
does nothing (no new genes).debug_parca
: If True, fit only one arbitrarily-chosen transcription factor in order to speed up a debug cycle.save_intermediates
: Save intermediate pickle files for each major step of the ParCa (initialize()
,input_adjustments()
,basal_specs()
,tf_condition_specs()
,fit_condition()
,promoter_binding()
,adjust_promoters()
,set_conditions()
,final_adjustments()
).intermediates_directory
: Path to folder where intermediate pickle files should be saved or loaded.load_intermediate
: The function name of the ParCa step to load sim_data and cell_specs from; functions prior to and including this one will be skipped but all subsequent functions will run. Can only be used if all ParCa steps up to and including named step were previously run successfully withsave_intermediates
set to True.variable_elongation_transcription
: If True, enable variable elongation for transcription.variable_elongation_translation
: If True, enable variable elongation for translation.
Note
If the top-level sim_data_path
option is not null, the ParCa is skipped
in favor of the pickled simulation data at the specified path. This applies
regardless of whether running with runscripts.workflow
or
runscripts.parca
.
Variants
In many cases, we would like to use the model to answer biological questions that
require running the model with different parameters. For example, we may want to
see how a cell responds differently when grown in different media conditions.
Since most process parameters in our model come from the pickled
SimulationDataEcoli
generated by
the ParCa, we need an easy way to modify this object. The
runscripts.create_variants
script was designed for this purpose.
Template
In essence, this script runs a “variant function” with one or more input
parameter combinations, with each invocation independently modifying the
baseline SimulationDataEcoli
object in some way. Variant functions are contained within Python files
located in the ecoli.variants
folder. They all have an apply_variant
function that follows the following template:
from typing import Any, TYPE_CHECKING
# This if statement prevents Python from unnecessarily importing this
# object when it is only needed for type hinting
if TYPE_CHECKING:
from reconstruction.ecoli.simulation_data import SimulationDataEcoli
def apply_variant(
sim_data: "SimulationDataEcoli", params: dict[str, Any]
) -> "SimulationDataEcoli":
"""
Modify sim_data using parameters from params dictionary.
Args:
sim_data: Simulation data to modify
params: Parameter dictionary of the following format::
{
"{name of parameter}": {type of parameter},
...
}
Returns:
Simulation data with the following attributes modified::
{attributes of sim_data that this function changes}
"""
# Modify sim_data as you see fit using params. Following is example
sim_data.attribute = params["param_1"]
return sim_data
Configuration
When running runscripts.create_variants
, users must specify the
variant function to use under the variants
key in the configuration JSON
following the general template:
{
"variants": {
"{name of variant function}": {
{variant function parameters}
}
}
}
The name of each variant function is the name of the file containing its
apply_variant
function. For example, to use the variant function
ecoli.variants.new_gene_internal_shift
, provide the name
new_gene_internal_shift
. If the variants
key points to an
empty dictionary (no variants), then only the only “variant” saved
by runscripts.create_variants
is the unmodified simulation
data object. Thus, when running a workflow with runscripts.workflow
,
at least one lineage of cells will always be run with the baseline
sim_data
. To avoid this (e.g. when running many batches of simulations
with the same variant function), set the top-level skip_baseline
option
to True
.
Warning
Only one variant function is supported at a time.
If you would like to modify the simulation data object using multiple
variant functions, create a new variant function that invokes the desired
combination of apply_variants
methods from other variant functions.
The format of the variant function parameters is described in
parse_variants()
. By using the
op
key, you can concisely generate a large array of parameter
combinations, each of which results in the creation of a variant of the
SimulationDataEcoli
object.
When manually running runscripts.create_variants
(as opposed to
running runscripts.workflow
), the configuration file must also include:
Top-level (not under
variants
key)outdir
option: path to directory in which to save variant simulation data objects as pickle filesTop-level (not under
variants
key)kb
option: path to directory containing ParCa output pickle files
Output
The generated variant simulation data objects are pickled and saved in the
directory given in the outdir
key of the configuration JSON.
They all have file names of the format {index}.cPickle
, where
index is an integer. If the top-level skip_baseline
option is not set
to True
, the unmodified simulation data object is always
saved as 0.cPickle
. Otherwise, the 0 index is skipped. The identity of
the other indices can be determined by referencing the metadata.json
file that is also saved in outdir
. This JSON maps the variant function
name to a mapping from each index to the exact parameter
dictionary passed to the variant function to create the
variant simulation data saved with that index as its file name. See
apply_and_save_variants()
for
more details.
Simulations
Refer to Experiments for more information about the main
script for running single-cell simulations,
ecoli_master_sim
.
Analyses
The runscripts.analysis
script is the main interface for running
analyses on simulation output data. Importantly, to use this interface,
simulations must be run (whether with ecoli_master_sim
or runscripts.workflow
) with the emitter
option set to parquet
and an output directory set using the out_dir
or out_uri
key under the
emitter_arg
option (see JSON Config Files). This tells vivarium-core to use
ParquetEmitter
to save the simulation
output as Hive partitioned Parquet files. See Output for more
details on Parquet and DuckDB, the primary library used to interact with the
saved files.
Analysis scripts must be one of the following types and placed into the corresponding folder:
single
: Limited to data for a single simulated cellmultidaughter
: Limited to data for daughter cell(s) of a single mother cellmultigeneration
: Limited to data for all cells across many generations for a given initial seed, variant simulation data object, and workflow run (same experiment ID, see Experiments)multiseed
: Limited to data for all cells across many generations and initial seeds for a given variant simulation data object and workflow runmultivariant
: Limited to data for all cells across many generations, initial seeds, and variant simulation data objects for a given workflow runmultiexperiment
: Limited to data for all cells across many generations, initial seeds, variant simulation data objects, and workflow runs
Note
These categories represent upper bounds on the data that can be accessed.
A multiseed
analysis, for example, can choose to only read data
from cells between generations 4 and 8 from the cells with the same
experiment ID, variant simulation data object, and initial seed that
it has access to.
Tip
If you would like to use an analysis script with many different scopes,
instead of duplicating the entire script in each analysis type
folder, you can just create stub files in the appropriate folders
that simply import the plot
function from a primary analysis script.
Configuration
The runscripts.analysis
script accepts the following configuration
options under the analysis_options
key:
single
,multidaughter
,multigeneration
,multiseed
,multivariant
multiexperiment
: Can pick one or more analysis types to run. Under each analysis type is a sub-dictionary of the following format:{ "{analysis name}": {optional dictionary of analysis parameters}, # Example: "mass_fraction_summary": {"font_size": 12} }
The name of an analysis is simply its file name without the
.py
extension.experiment_id
,variant
,lineage_seed
,generation
,agent_id
: List of experiment IDs, variant indices, etc. to filter data to before running analyses. Note that experiment IDs and agent IDs are strings while the rest are integers.experiment_id
is required while the others are optional. If not provided,runscripts.analysis
simply does not filter data by variant indices, initial seeds, etc. before running analyses.variant_range
,lineage_seed_range
,generation_range
: List of length 2 where the first element is the start and the second element is the end (exclusive) of a range of variant indices, initial seeds, or generations to filter data to before running analyses. Overrides corresponding non-range options.sim_data_path
: List of string paths to simulation data pickle files. If multiple variants are given viavariant
orvariant_range
, you must provide same number of paths in the same order using this option. This option is mainly meant for internal use. For a simpler alternative that also works if multiple experiment IDs are given withexperiment_id
(variant indices may correspond to completely different variant simulation data objects in different workflow runs), seevariant_data_dir
.variant_metadata_path
: String path tometadata.json
file saved byrunscripts.create_variants
(see Output). This option is mainly intended for internal use. For a simpler alternative that also works if multiple experiment IDs are given viaexperiment_id
, seevariant_data_dir
.variant_data_dir
: List of string paths to one or more directories containing variant simulation data pickles and metadata saved byrunscripts.create_variants
. Must provide one path for each experiment ID inexperiment_ID
and in the same order.validation_data_path
: List of string paths to validation data pickle files (generated by ParCa). Can pass any number of paths in any order and they will be passed as is to analysis scriptplot
functions.outdir
: Local (relative or absolute) path to directory that serves as a prefix to theoutdir
argument for analysis scriptplot
functions (see Template). A copy of the configuration options used to runrunscripts.analysis
is saved asoutdir/metadata.json
.n_cpus
: Number of CPU cores to let DuckDB useanalysis_types
: List of analysis types to run. By default (if this option is not used), all analyses provided under all the analysis type keys are run on all possible subsets of the data after applying the data filters given usingexperiment_id
,variant
, etc. For example, say 2 experiment IDs are given withexperiment_id
, 2 variants withvariant
, 2 seeds withlineage_seed
, and 2 generations withgeneration
. Assuming no simulations failed andsingle_daughter
was set to True, analyses under themultiexperiment
key (if any) will each run once with all data passing this filter.multivariant
analyses will each run twice, first with filtered data for one experiment ID then with filtered data for the other.multiseed
analyses will each run 4 times (2 exp IDs * 2 variants),multigeneration
analyses 8 times (4 * 2 seeds),multidaughter
analyses 16 times (8 * 2 generations), andsingle
analyses 16 times. If you only want to run thesingle
andmultivariant
analyses, specify["single", "multivariant"]
using this option.
Template
All analysis scripts must contain a plot
function with the following signature:
from typing import Any, TYPE_CHECKING
if TYPE_CHECKING:
from duckdb import DuckDBPyConnection
def plot(
params: dict[str, Any],
conn: "DuckDBPyConnection",
history_sql: str,
config_sql: str,
sim_data_paths: dict[str, dict[int, str]],
validation_data_paths: list[str],
outdir: str,
variant_metadata: dict[str, dict[int, Any]],
variant_names: dict[str, str],
):
"""
Args:
params: Dictionary of parameters given under analysis
name in configuration JSON.
conn: DuckDB database connection, automatically created
by runscripts/analysis.py with appropriate settings.
history_sql: DuckDB SQL query that filters simulation
output data to subset appropriate for analysis type
(e.g. single cell for ``single`` analyses).
config_sql: DuckDB SQL query that filters simulation
config data to subset appropriate for analyis type.
sim_data_paths: Mapping from experiment IDs to mapping
from variant indices to variant simulation data
pickle paths. Generated by runscripts/analysis.py
either using:
- Combination of ``sim_data_path``, ``variant``,
``variant_metadata_path``, and ``experiment_id``
configuration options
- Traversing directories in ``variant_data_dir`` and
matching discovered variants with experiment IDs
given in ``experiment_id`` (preferred route for
most use cases)
validation_data_paths: List of validation data pickle
paths taken directly from ``validation_data_path``
configuration option.
outdir: String path equal to ``outdir`` configuration option
prepended to Hive partitioned directory representing data
filters applied to ``history_sql``. For example, a ``single``
analysis script run on data for experiment ID "test",
variant index 1, lineage seed 3, generation 2, and agent ID
"00" will get: ``{outdir}/experiment_id=test/variant=1/
lineage_seed=3/generation=2/agent_id=00``. By convention,
analysis scripts should save their outputs in this folder.
variant_metadata: Mapping from experiment IDs to mapping
from variant indices to parameters used to create variant
simulation data object. Generated by runscripts/analysis.py
in one of the same two methods used for sim_data_paths.
variant_names: Mapping from experiment IDs to name of variant
function used to generate variant simulation data objects
for workflow run. Generated by runscripts/analysis.py
in one of the same two methods used for sim_data_paths.
"""
Refer to Output for more information about how to use DuckDB to read and analyze simulation output inside analysis scripts.
Workflows
Nextflow is a piece of software that abstracts the
complexity of orchestrating complex workflows on a variety of supported
platforms, including personal computers, computing clusters, and even cloud
computing services. runscripts.workflow
uses the template Nextflow
workflow scripts located in the runscripts/nextflow
folder along with
an input configuration JSON to create and run a complete workflow with all of
the steps described above.
Configuration
All the previously covered configuration options also apply to the configuration
JSON supplied to runscripts.workflow
. Those options govern the behavior
of the corresponding step in the workflow. For example, running
runscripts.workflow
with n_cpus
under parca_options
set to 4 will start the workflow by running the ParCa with 4 CPUs.
After creating some number of variant simulation data objects with
runscripts.create_variants
, the workflow automatically
starts at least one cell simulation for each variant using
ecoli_master_sim
. The exact number
of simulations started per variant is configured by the following
top-level configuration options:
n_init_sims
: Number of replicate simulations to run for each variant, where replicates differ in the initial seed used to initialize themlineage_seed
: Each integer in the half-open interval[lineage_seed, lineage_seed + n_init_sims)
is used to initialize the first generation of a lineage, where a lineage is defined as a group of cell simulations with the same first generation initial seed (called a lineage seed) and variant simulation data objectgenerations
: Integer number of generations to run each cell lineagesingle_daughters
: If False, simulates both daughter cells (append0
to mother agent ID to get one daughter agent ID and1
to get other) after cell division. Otherwise, continue lineage with one arbitrary daughter cell state (append0
to mother agent ID to get daughter agent ID)
This means that if a workflow is run with n_init_sims
set to 4, generations
set to 10, single_daughters
set to True, and variant_options
configured to create 4 different variant simulation data objects (5 including
baseline, unmodified 0.cPickle
, see Output),
4 * 10 * 1 * 5 = 200
total simulations will run. This is assuming no lineages fail
before reaching 10 generations due to fail_at_total_time
(see JSON Config Files)
or some other uncaught exception.
Unlike when running runscripts.analysis
manually, the configuration JSON
supplied to runscripts.workflow
only needs to provide the names and
parameters for analysis scripts to run using the analysis type options (e.g.
single
, multivariant
, etc.) and can omit the other options documented
in Configuration. This is because the Nextflow workflow is configured
to automatically pass the other required parameters like the paths to the variant
simulation data pickles created by runscripts.create_variants
earlier
in the worklow.
Sherlock
Note
The following information is intended for members of the Covert Lab only.
After cloning the model repository to your home directory, skip the other steps
in the README until reaching the instructions to install Nextflow. After installing
Nextflow in your home directory, add the following lines to your ~/.bash_profile
,
then close and reopen your ssh connection:
# Legacy environment variables so old scripts work
export PI_HOME=$GROUP_HOME
export PI_SCRATCH=$GROUP_SCRATCH
# Load group-wide settings
if [ -f "${PI_HOME}/etc/bash_profile" ]; then
. "${PI_HOME}/etc/bash_profile"
fi
# Environment variable required by pyenv
export PYENV_ROOT="${PI_HOME}/pyenv"
# Environment modules used by vEcoli
module load system git/2.45.1 parallel
module load wcEcoli/python3
# Need Java for nextflow
module load java/18.0.2
export PYTHONPATH="$HOME/vEcoli"
if [ -d "${PYENV_ROOT}" ]; then
export PATH="${PYENV_ROOT}/bin:${PATH}"
eval "$(pyenv init -)"
eval "$(pyenv virtualenv-init -)"
fi
export PATH=$PATH:$HOME/.local/bin
# Use one thread for OpenBLAS (better performance and reproducibility)
export OMP_NUM_THREADS=1
Finally, inside the cloned repository, run pyenv local viv-ecoli
to load the Python virtual environment with all required packages installed.
For convenience, runscripts.workflow
accepts a boolean top-level
configuration option sherlock
. If set to True, runscripts.workflow
can be run on a login node to automatically submit a job that will run the
Nextflow workflow orchestrator with a 7-day time limit on the lab’s dedicated
partition (job should start fairly quickly and never get preempted by other
users). The workflow orchestrator will automatically submit jobs for each step
in the workflow: one for the ParCa, one to create variants, one for each cell,
and one for each analysis.
Importantly, the emitter output directory (see description of emitter_arg
in JSON Config Files) should be an absolute path to somewhere in your
$SCRATCH
directory (e.g. /scratch/users/{username}/out
). The path must
be absolute because Nextflow does not resolve environment variables like
$SCRATCH
in paths.
Warning
Running the workflow on Sherlock sets a 2 hour limit on all jobs in the
workflow, including analyses. Analysis scripts that take more than
2 hours to run should be excluded from workflow configurations and manually
run using runscripts.analysis
afterwards.
Progress
There are three main ways to monitor a workflow’s progress.
Check the command-line output of the Nextflow orchestrator. On a personal computer, Nextflow will periodically print its progress to the command line. On Sherlock, this output is written to the
slurm-{job ID}.out
file in the directory you started the workflow from.Open the file named
trace-{start timestamp}.txt
in the directory you started the workflow from. This file contains useful information about completed processes.Open the file named
.nextflow.log
in the directory you started the workflow from. This is a fairly verbose and technical log that may be useful for debugging purposes.
Fault Tolerance
Nextflow workflows can be configured to be highly fault tolerant. The following is a list workflow behaviors enabled in our model to handle unexpected errors.
When running on Sherlock, jobs that fail with exit codes 140 (hit job limits for RAM or runtime) or 143 (job was preempted by another user) are automatically retried up to a maximum of 3 tries. For the resource limit error code (140), Nextflow will automatically request more RAM and a higher runtime limit with each attempt:
4 * {attempt num}
GB of memory and2 * {attempt num}
hours of runtime. See thesherlock
profile inrunscripts/nextflow/config.template
.Additionally, some jobs may fail on Sherlock due to issues submitting them to the SLURM scheduler. Nextflow was configured to limit the rate of job sumission and job queue polling to keep these failures to a minimum. Furthermore, jobs that fail to submit are automatically retried with a relatively long 5 minute delay to hopefully avoid any transient scheduler issues.
Jobs that fail for any reason other than the Sherlock reasons described above are ignored. This is mainly to allow a workflow to finish running all programmed cell simulations even if some cells fail, terminating their corresponding lineages. For example, if generation 6 for a given initial seed and variant simulation data failed, then generation 7+ for that lineage cannot run but the lineages for different initial seeds and/or variant simulation data can still continue to run.
If you realize that a code issue is the cause of job failure(s), stop the workflow run if it is not already (e.g.
control + c
,scancel
, etc.), make the necessary code fixes, and rerunrunscripts.workflow
with the same configuration JSON and the--resume
command-line argument. Nextflow will intelligently resume workflow execution from the last successful job in each chain of job dependencies (e.g. generation 7 of a cell lineage depends on generation 6,runscripts.create_variants
depends onrunscripts.parca
, etc).
Output
A completed workflow will have the following directory structure underneath
the output directory specified via out_dir
or out_uri
under the
emitter_arg
option in the configuration JSON (see JSON Config Files):
{experiment ID}
: Folder in output directory named the experiment ID for the workflow. Allows many workflows to use the same output directory without overwriting data as long as they have different experiment IDs.history
: Hive-partitioned Parquet files of simulation output. See Output.configuration
: Hive-partitioned Parquet files of simulation configs. See Output.parca
: Pickle files saved byrunscripts.parca
.simData.cPickle
:SimulationDataEcoli
validationData.cPickle
:ValidationDataEcoli
rawData.cPickle
:KnowledgeBaseEcoli
rawValidationData.cPickle
:ValidationDataRawEcoli
variant_sim_data
: Output ofrunscripts.create_variants
.0.cPickle
: Unmodified, baseline simulation data object.metadata.json
: Mapping from variant function name to mapping from variant indices to parameter dictionaries used to create them.1.cPickle
,2.cPickle
, etc: Variant simulation data objects, if any.
daughter_states
: Hive-partitioned (with experiment ID partition omitted because all files are for the same experiment ID) directory structure containing daughter cell initial states as JSON files.analysis
: Hive-partitioned (with experiment ID partition omitted because all files are for the same experiment ID) directory structure containing output of analysis scripts in folder namedplot
at the level corresponding to the analysis type (e.g.multigeneration
analysis output will be in sub-folders of the formatvariant={}/lineage_seed={}/plot
). Eachplot
folder also contains ametadata.json
file with the configuration. options passed torunscripts.analysis
for the output in that folder.nextflow
: Nextflow-related files.main.nf
: Nextflow workflow script.nextflow.config
: Nextflow workflow configuration.{experiment ID}_report.html
: Contains detailed information about workflow run. Also serves to prevent users from accidentally running another workflow with the same experiment ID and overwriting data. If a user wishes to do so, they must first rename, move, or delete this file.workflow_config.json
: Configuration JSON passed torunscripts.workflow
.nextflow_workdirs
: Contains all working directories for Nextflow jobs. Required for resume functionality described in Fault Tolerance. Can also go to work directory for a job (consult files described in Progress or{experiment ID}_report.html
) and runbash .command.sh
with breakpoints set in the relevant code (import ipdb; ipdb.set_trace()
) for debugging.