HPC Clusters
vEcoli uses Nextflow and Apptainer containers to run on high-performance computing (HPC) clusters. For users with access to the Covert Lab’s partition on Sherlock, follow the instructions in the Sherlock section. For users looking to run the model on other HPC clusters, follow the instructions in the Other Clusters section.
To speed up HPC workflows, vEcoli supports the HyperQueue executor. See HyperQueue for more information.
Sherlock
On Sherlock, once a workflow is started with runscripts.workflow
,
runscripts/container/build-image.sh
builds an Apptainer image with
a minimal snapshot of your cloned repository. Nextflow starts containers
using this image to run the steps of the workflow. To run or interact
with the model outside of runscripts.workflow
, start an
interactive container by following the steps in Interactive Container.
Setup
Note
The following setup applies to members of the Covert Lab only.
After cloning the model repository to your home directory, add the following
lines to your ~/.bash_profile
, then close and reopen your SSH connection:
# Load newer Git, Java (for Nextflow), and PyArrow
module load system git java/21.0.4 py-pyarrow
# Include shared Nextflow and HyperQueue installations on PATH
export PATH=$PATH:$GROUP_HOME/vEcoli_env
Warning
If you have any lines in your ~/.bash_profile
, ~/.bashrc
, or
~/.profile
that modify the PATH
variable, make sure that
which python3
returns the Sherlock-managed Python 3 from the
py-pyarrow
module: /share/software/user/open/python/3.xx.x/bin/python3
.
If not, comment out lines that modify the PATH
variable
(for example, pyenv
-related), until the right Python is found.
Then, run the following to test your setup:
python3 runscripts/workflow.py --config ecoli/composites/ecoli_configs/test_sherlock.json
This will run a small workflow that:
Builds an Apptainer image with a snapshot of your cloned repository.
Runs the ParCa.
Runs one simulation.
Runs the mass fraction analysis.
All output files will be saved to a test_sherlock
directory in your
cloned repository. You can modify the workflow output directory by changing
the out_dir
option under emitter_arg
in the config JSON.
See Configuration for a description of the Sherlock-specific
configuration options and Running Workflows for details about running
a workflow on Sherlock.
To run scripts on Sherlock outside a workflow, see Interactive Container. To run scripts on Sherlock through a SLURM batch script, see Non-Interactive Container.
Note
The above setup is sufficient to run workflows on Sherlock. However, if you
have a compelling reason to update the shared Nextflow or HyperQueue binaries,
navigate to $GROUP_HOME/vEcoli_env
and run:
Nextflow:
NXF_EDGE=1 nextflow self-update
HyperQueue: See HyperQueue.
Configuration
To tell vEcoli that you are running on Sherlock, you MUST include the following
keys in your configuration JSON (note the top-level sherlock
key):
{
"sherlock": {
# Boolean, whether to build a fresh Apptainer image. If files that are
# not excluded by .dockerignore did not change since your last build,
# you can set this to false to skip building the image.
"build_image": true,
# Path (relative or absolute, including file name) of Apptainer image to
# build (or use directly, if build_image is false)
"container_image": "",
# Boolean, whether to use HyperQueue executor for simulation jobs
# (see HyperQueue section below)
"hyperqueue": true,
# Boolean, denotes that a workflow is being run as part of Jenkins
# continuous integration testing. Randomizes the initial seed and
# ensures that all STDOUT and STDERR is piped to the launching process
# so they can be reported by GitHub
"jenkins": false
}
}
In addition to these options, you MUST set the emitter output directory
(see description of emitter_arg
in JSON Config Files) to a path with
enough space to store your workflow outputs. We recommend setting this to
a location in your $SCRATCH
directory (e.g. /scratch/users/{username}/out
).
Warning
~
and environment variables like $SCRATCH
are not expanded in the
configuration JSON. See the warning box at Workflows.
Running Workflows
With these options in the configuration JSON, a workflow can be started by
running python3 runscripts/workflow.py --config {}
, substituting
in the path to your config JSON.
Warning
Remember to use python3
to start workflows instead of python
.
This command should be run on a login node (no need to request a compute node).
If build_image
is true in your config JSON, the terminal will report that
a SLURM job was submitted to build the container image. When the image build
job starts, the terminal will report the build progress.
Note
Files that match the patterns in .dockerignore
are excluded from the image.
Warning
Do not make any changes to your cloned repository or close your SSH connection until the build has finished.
Once the build has finished, the terminal will report that a SLURM job was submitted for the Nextflow workflow orchestrator before exiting back to the shell. At this point, you are free to close your connection, start additional workflows, etc. Unlike workflows run locally, Sherlock’s containerized workflows mean any changes made to the repository after the container image has been built will not affect the running workflow.
Once started, the Nextflow job will stay alive for the duration of the workflow (up to 7 days) and submit new SLURM jobs as needed.
If you are trying to run a workflow that takes longer than 7 days, you can use the resume functionality (see Fault Tolerance). Alternatively, consider running your workflow on Google Cloud, which has no maximum workflow runtime (see Google Cloud).
You can start additional, concurrent workflows that each build a new image
with different modifications to the cloned repository. However, if possible,
we recommend designing your code to accept options through the config JSON
which modify the behavior of your workflow without modifying core code. This
allows you to save time by reusing a previously built image as follows:
set build_image
to false and container_image
to the path of said image.
There is a 4 hour time limit on each job in the workflow, including analyses. This is a generous limit designed to accomodate very slow-dividing cells. Generally, we recommend that users exclude analysis scripts which take more than a few minutes from their workflow configuration. Instead, either run these manually following Interactive Container or create a SLURM batch script to run these analyses following Non-Interactive Container.
Interactive Container
Warning
The following steps should be run on a compute node. See the Sherlock documentation for details.
The maximum resource request for an interactive compute
node is 2 hours, 4 CPU cores, and 8GB RAM/core. Scripts that require more
resources should be submitted as SLURM batch scripts to the mcovert
or owners
partition (see Non-Interactive Container).
To run scripts on Sherlock, you must have either:
Previously run a workflow on Sherlock and have access to the built container image
Built a container image manually using
runscripts/container/build-image.sh
with the-a
flag
Start an interactive container with your full image path (see the warning box at Workflows) by navigating to your cloned repository and running:
runscripts/container/interactive.sh -i container_image -a
Note
Inside the interactive container, you can safely use python
directly
in addition to the usual uv
commands.
The above command launches a container containing a snapshot of your
cloned repository as it was when the image was built. This snapshot
is located at /vEcoli
inside the container and is mostly intended
to guarantee reproducibility for troubleshooting failed workflow jobs.
More specifically, users who wish to debug a failed workflow job should:
Start an interactive container with the image used to run the workflow.
Use
nano
to add breakpoints (import ipdb; ipdb.set_trace()
) to the relevant scripts in/vEcoli
.Navigate to the working directory (see Troubleshooting) for the job that you want to debug.
Invoke
bash .command.sh
to run the failing task and pause upon reaching your breakpoints, allowing you to inspect variables and step through the code.
Warning
~
and environment variables like $SCRATCH
do not work
inside the container. Follow the instructions in the warning box at
Workflows outside the container to get the full path to
use inside the container.
Danger
Any changes that you make to /vEcoli
inside the container are discarded
when the container terminates.
To start an interactive container that reflects the current state of your
cloned repository, navigate to your cloned repository and run the above
command with the -d
flag to start a “development” container:
runscripts/container/interactive.sh -i container_image -a -d
In this mode, instead of editing source files in /vEcoli
, you can
directly edit the source files in your cloned repository and have those
changes immediately reflected when running those scripts inside the
container. Because you are just modifying your cloned repository, any
code changes you make will persist after the container terminates and
can be tracked using Git version control.
Note
If the image you use to start a development container was built with
an outdated version of uv.lock
or pyproject.toml
, there may
be a long startup delay due to package updates. To avoid this,
build a new image with runscripts/container/build-image.sh -i container_image -a
,
replacing container_image
with a path for the image to build.
Non-Interactive Container
To run any script inside a container without starting an interactive session,
use the same command as Interactive Container but specify a command
using the -c
flag. For example, to run the ParCa process, navigate to
your cloned repository and run the following command, replacing container_image
with the pat to your container image and {}
with the path to your
configuration JSON:
runscripts/container/interactive.sh -i container_image -c "python /vEcoli/runscripts/parca.py --config {}"
This feature is intended for use in SLURM batch scripts to manually run analysis scripts with custom resource requests. Make sure to include one of the following directives at the top of your script:
#SBATCH --partition=owners
: The big advantage of this partition is that you can request very large amounts of resources (for example, dozens of cores). The major downsides are that queue times may be long and other users may preempt your job at any moment, though this is anecdotally rare for jobs under an hour long.#SBATCH --partition=mcovert
: Best for high priority scripts (short queue time) that you cannot risk being preempted. The number of available cores is 32 minus whatever is currently being used by other users in themcovert
partition. Importantly, if all 32 cores are in use bymcovert
users, not only will your script have to wait for resources to free up, so will any workflows. As such, treat this partition as a limited resource reserved for high priority jobs.
Just as with interactive containers, to run scripts directly from your
cloned repository and not the snapshot, add the -d
flag drop the
/vEcoli/
prefix from script names. Note that changing files in your
cloned repository may affect SLURM batch jobs submitted with this flag.
Other Clusters
Nextflow has support for a wide array of HPC schedulers. If your HPC cluster uses a supported scheduler, you can likely run vEcoli on it with fairly minimal modifications.
Prerequisites
The following are required:
Nextflow (requires Java)
PyArrow
Git clone vEcoli to a location that is accessible from all nodes in your cluster
If your cluster has Apptainer (formerly known as Singularity) installed,
check to see if it is configured to automatically mount all filesystems (see
Apptainer docs).
If not, you may run into errors when running workflows because
Apptainer containers are read-only. You may be able to resolve this by
adding --writeable-tmpfs
to containerOptions
for the sherlock
and sherlock-hq
profiles in runscripts/nextflow/config.template
.
Additionally, you will need to manually specify paths to mount when debugging
with interactive containers (see Interactive Container). This can be done
using the -p
argument for runscripts/container/interactive.sh
.
If your cluster does not have Apptainer, you can try the following steps:
Completely follow the local setup instructions in the README (install uv, etc).
Delete the following lines from
runscripts/nextflow/config.template
:
process.container = 'IMAGE_NAME'
...
apptainer.enabled = true
Make sure to always set
build_runtime_image
to false in your config JSONs (see Configuration)
Cluster Options
If your HPC cluster uses the SLURM scheduler,
you can use vEcoli on that cluster by changing the queue
option in
runscripts/nextflow/config.template
and all instances of
--partition=QUEUE(S)
in runscripts.workflow
to the
right queue(s) for your cluster.
If your HPC cluster uses a different scheduler, refer to the Nextflow
executor documentation
for more information on configuring the right executor. Beyond changing queue
names as described above, this could be as simple as modifying the executor
directives for the sherlock
and sherlock_hq
profiles in
runscripts/nextflow/config.template
. Additionally, you will need to
replace the SLURM submission directives in runscripts.workflow.main()
with equivalent directives for your scheduler.
HyperQueue
HyperQueue is a job scheduler that is designed to run on top of a traditional HPC scheduler like SLURM. It consists of a head server that can automatically allocate worker jobs using the underlying HPC scheduler. These worker jobs can be configured to persist for long enough to complete multiple tasks, greatly reducing the overhead of job submission and queuing, especially for shorter jobs.
HyperQueue is distributed as a pre-built binary on GitHub.
Unfortunately, this binary is built with a newer version of GLIBC
than is available on Sherlock, necessitating a rebuild from source. A binary
built in this way is available in $GROUP_HOME/vEcoli_env
to users with
access to the Covert Lab’s partition on Sherlock. This is added to PATH
in the Sherlock setup instructions, and unless you have a compelling reason
to update it, no further action is required.
Users who want or need to build from source should follow these instructions.