Stores
Stores are upgraded dictionaries that store the simulation state. They can be nested within one another to form a store hierarchy. The core stores in vEcoli comprise the following simplified store hierarchy:
bulk
unique
active_replisomes
active_ribosome
active_RNAPs
chromosome_domains
chromosomal_segments
full_chromosomes
DnaA_boxes
genes
oriCs
promoters
RNAs
listeners
mass
dry_mass
cell_mass
…
replication_data
fork_coordinates
fork_domains
…
…
Individual stores in this hierarchy are identified using tuples representing
their path inside the nested hierarchy. For example, the dry mass listener store
has the tuple path ("listeners", "mass", "dry_mass")
. There are three top-level
stores corresponding to the three main types of simulation data: bulk, unique, and
listeners. The following sections provide descriptions for each of these store types
prefaced by a series of relevant attributes including:
- Path:
Tuple path of the store (e.g. for use in process topologies)
- Updater:
Function used to apply updates to the store. Click on the linked API documentation to see the format that updates to the store must be given in.
- Divider:
Function used to split the store during cell division
- Serializer:
Instance of
vivarium.core.registry.Serializer
used to serialize store data before being emitted (see Serializing Emits)- Schema:
Helper function to create store schema in process
ports_schema
methods- Helpers:
Other useful helper functions
Warning
The store values that vivarium-core
shows to processes when running them
are NOT copies, so processes must be careful not to unintentionally
modify mutable values. The Numpy arrays in the bulk and unique molecules stores
are made read-only for extra protection (see WRITEABLE
flag in numpy.ndarray.flags
).
Bulk Molecules
- Path:
("bulk",)
- Updater:
- Divider:
- Serializer:
- Schema:
- Helpers:
ecoli.library.schema.bulk_name_to_idx()
,ecoli.library.schema.counts()
Bulk molecules are named as such because they represent species for which all molecules are treated as interchangeable (e.g. water). The bulk molecules store holds a structured Numpy array with the following named fields:
id
(str
): Names of bulk molecules pulled from EcoCycEach end with a bracketed “location tag” (e.g.
[c]
) containing one of the abbreviations defined in thereconstruction/ecoli/flat/compartments.tsv
file (see Cell Component Ontology)
count
(numpy.int64
): Counts of bulk moleculesNote that the
evolve_state()
method ofPartitionedProcess
does not see the full structured array with named fields through its bulk port and instead sees a 1D array of partitioned counts from the allocator (see Partitioning).
{}_submass
(numpy.float64
): Field for each submassEight submasses are rRNA, tRNA, mRNA, miscRNA, nonspecific_RNA, protein, metabolite, water, DNA
Initialization
To create the initial value for this store, the model will go through the following three options in order of preference:
- Load custom initial state
Set
initial_state
option forEcoliSim
(see JSON Config Files)
- Load from saved state JSON
Set
initial_state_file
option forEcoliSim
- Generate from
sim_data
generate_initial_state()
uses thesim_data
object generated by the ParCa to calculate initial state
Partitioning
Motivation
To support the use of independent sub-models for different biological processes (e.g. FBA for metabolism, Gillespie for complexation, etc.), the model allows processes to run mostly independently. At a high level, over the course of a single simulation step, each process will see the simulation state as it was before any other process has run. Each process will then calculate an update to apply to the simulation state and all updates will be simultaneously applied once all processes have run.
This setup has a potential problem: two processes may both decide to deplete
the count of the same molecule, resulting in a final count that is negative.
To prevent this from happening, the model forces processes to first request
counts of bulk molecules via special process-specific request
stores. These
stores are read by an allocator process
(Allocator
). The allocator process
then divides the bulk molecules so that each process sees a functional count
proportional to its request.
For example, if process A requests 100 of molecule X and process B requests 400 of molecule X but the cell only has 400 molecules of X, the allocator will divide the molecules as follows:
Process A: \(\frac{100}{100 + 400} * 400 = 80\) molecules of X
Process B: \(\frac{400}{100 + 400} * 400 = 320\) molecules of X
Steps and Flows
In our model, many processes are dependent one another to an even greater extent
than that imposed by this request/allocate partitioning scheme. For example, since
molecule binding and complexation events occur on timescales much shorter than the
default 1 second simulation timestep, Equilibrium
and TwoComponentSystem
must wait for TfUnbinding
to update
the simulation state by freeing currently bound transcription factors. This gives
all transcription factors a chance to form complexes
or participate in other reactions, better reflecting the transient binding dynamics
of real cells. TfBinding
must wait for these
new active transcription factor counts,
TranscriptInitiation
must wait for the counts of transcription factors bound to promoters, and so on.
To allow processes to run in a pre-specified order within
each timestep, we can make use of a subclass of the typical Vivarium
Process
class:
Step
. Almost all “processes” in the model
are actually instances of Step
. These Steps
are configured to run in user-configured “execution layers” by way of a flow
that is included in the simulation configuration (see
ecoli_master_sim
).
A flow
is a dictionary that specifies the dependencies for each Step. For
example, if a user wants Step B to run only after Step A has updated the
simulation state, the user can include Step A as a dependency of Step B:
{
"Step B": [("Step A",)]
}
Steps can have multiple dependencies (e.g. [("Step A",), ("Step B",)]
)
and each dependency must be in the form of a tuple path. All processes are
top-level stores so these paths are usually just ("process name",)
.
Vivarium will parse the flow
to construct a directed acyclic graph
and figure out the order in which to run steps by stratifying them into
execution layers. For example, consider the following flow
:
{
"Step B": [("Step A",)],
"Step C": [("Step A",)],
"Step D": [("Step C",)]
}
Vivarium will parse this into the following sequence of execution layers:
Step A
Step B and Step C (order does not matter)
Step D
Each timestep, Step A will run and update the simulation state, Steps B and C will run with a view of the state that was updated by Step A, and finally Step D will run with a view of the state that was updated by every other step.
Implementation
All partitioned processes are instances of the
PartitionedProcess
class. This both
serves to identify the processes that require partitioning and also implements
a standard next_update
method that allows these processes to be run on
their own (as in
migration tests).
Warning
In instances of PartitionedProcess
,
all ports connected to the bulk molecule store MUST be called
bulk
to be properly partitioned. Conversely, ports that are not meant
to be partitioned should NEVER be called bulk
in any
PartitionedProcess
.
In the model, each partitioned process is used to create two separate steps:
a Requester
and an
Evolver
. For each execution layer
in the flow
given to EcoliSim
,
Ecoli
will create Requesters, Evolvers,
and other required Steps and arrange them to be executed in the following order:
- Requesters:
Each calls the
calculate_request()
method of aPartitionedProcess
in said layer and writes its requests to a process-specificrequest
store
- Allocator:
Once all Requesters in said layer have finished writing their requests, an instance of
Allocator
reads all the writtenrequest
stores and proportionally allocates bulk molecules to processes, writing allocated counts to process-specificallocate
stores
- Evolvers:
Each swaps the Numpy structured array of unpartitioned bulk counts in the
bulk
port with the 1D array of allocated counts in the correspondingallocate
store, calls theevolve_state()
method of itsPartitionedProcess
, and returns updates to the bulk molecule counts and unique molecule stores
- Unique updater:
An instance of
UniqueUpdate
that tells the unique molecule updaters to apply accumulated updates (seeUniqueNumpyUpdater
for why we accumulate updates and wait to apply them after all Evolvers in an execution layer have run)
Note
The Requester
and
Evolver
for each partitioned process
share the same PartitionedProcess
instance. This allows instance variables to be updated and shared between the
calculate_request()
and evolve_state()
methods of each PartitionedProcess
.
Accessing Non-Partitioned Counts
There are certain partitioned processes that require access to the total, non-partitioned
counts of certain bulk molecules in their
evolve_state()
methods. For example,
PolypeptideElongation
needs to know
the total counts to all amino acids to accurately implement tRNA charging. To give these
processes access to non-partitioned counts, an additional port is added to
their ports_schema
methods and topologies that is also connected to the
bulk molecules store. By convention, this port is called bulk_total
to
differentiate it from the partitioned bulk
port. As noted in Implementation,
Evolvers overwrite the port named bulk
with the allocated bulk counts. Due to being
named bulk_total
instead of bulk
, the non-partitioned port value is left
untouched and allows the Evolver to read non-partitioned counts at will.
Indexing
Processes typically use the ecoli.library.schema.bulk_name_to_idx()
helper function
to get the indices for a set of molecules (e.g. all NTPs). These indices are typically cached
as instance attributes (e.g. self.ntp_idx
) in the calculate_request
method of a
PartitionedProcess
. This way, we can ensure
that all the necessary indices are retrieved the very first time the
Requester
for that process is
run, making it available to the Evolver
(which shares the same PartitionedProcess
)
and subsequent runs of the Requester. See the branch beginning if self.proton_idx is None
in calculate_request()
for an example.
Though counts can be directly retrieved from the Numpy structured array (e.g.
states["bulk"]["count"][self.ntp_idx]
), this method of access does not work
for Evolver
(i.e. inside the
evolve_state()
method)
because they automatically replace the non-partitioned Numpy structured array
of bulk counts with the 1D array of partitioned bulk counts for that process
(see Implementation). To standardize count access across the
calculate_request()
and
evolve_state()
methods,
the helper function ecoli.library.schema.counts()
can handle both of
these scenarios and also guarantees that the returned counts can be safely
edited without unintentionally mutating the source array.
Unique Molecules
- Path:
("unique",)
- Updater:
- Dividers:
- Serializer:
- Schema:
- Helpers:
Unique molecules are named as such because they represent species for
which individual molecules are not treated as interchangeable (e.g.
different RNA molecules may have different sequences). The unique molecules
store holds substores for each unique molecule (e.g. ("unique", "RNAs")
,
("unique", "active_RNAPs")
). Each unique molecule substore contains a
structured Numpy array
with a variety of named fields, each representing an attribute of interest
for that class of unique molecules (e.g. coordinates
for a gene
unique
molecule). All unique molecules will have the following named fields:
unique_index
(int
): Unique identifier for each unique moleculeWhen processes add new unique molecules, the helper function
ecoli.library.schema.create_unique_indexes()
is used to generate unique indices for each molecule to be added.
_entryState
(numpy.int8
): 1 for active row, 0 for inactive rowWhen unique molecules are deleted (e.g. RNA degradation), all of their data, including the
_entryState
field, is set to 0. When unique molecues are added (e.g. RNA transcription), the updater places the data for these new molecules into the rows that are identified as inactive by the helper functionecoli.library.schema.get_free_indices()
, which also grows the array if necessary.
massDiff_{}
(numpy.float64
): Field for each dynamic submassThe eight submasses are rRNA, tRNA, mRNA, miscRNA, nonspecific_RNA, protein, metabolite, water, and DNA. An example of a dynamic submass is the constantly changing protein mass of the polypeptide associated with an actively translating ribosome.
Initialization
See Initialization.
Accessing
Processes use the ecoli.library.schema.attrs()
helper function to access
any number of attributes for all active (_entryState
is 1) unique molecules
of a given type (e.g. RNA, active RNAP, etc.).
Listeners
- Path:
("listeners",)
- Updater:
- Divider:
None (leave value as is upon division)
- Serializer:
None by default (leave value as is) but will automatically use registered serializers for data that is of a type that cannot be automatically serialized to JSON by
orjson
. For example, listeners holding values with Unum units will be serialized byUnumSerializer
, which is registered inecoli/__init__.py
. See Serializing Emits.- Schema:
- Helpers:
None
The listeners store contains many substores that hold data which is saved
for downstream analyses. For example, the ("listeners", "mass")
substore
contains substores for various masses of interest, such as cell_mass
,
dry_mass
, protein_mass
, etc.
Tip
When possible, we recommend that you put all stores whose data you wish to save inside the listeners store. This is to ensure consistency across the model and helps keep stores organized.
Initialization
Listener stores are initialized at the start of a simulation with the
default values specified in the
ports_schema()
methods of
the processes that connect to them. Refer to
ecoli.library.schema.listener_schema()
for information about how
to configure this default value as well as attach useful metadata to
specific listener values.
Listener stores must contain data of the same type (or data that is serialized
to the same type) for the duration of a simulation. This is becuase the Parquet
storage format (see Parquet Emitter) is a columnar format that requires
columns to have static data types. Some leeway is allowed for None
(null)
values in nested types. For example, []
and [0]
both work fine for a
column containing 1D lists of integers.
Warning
Double check the data type of default values for listeners. For example,
a listener of float values with a default value of 0
will incorrectly
coerce all subsequent values to integers in the saved output. The correct
default value in this case is 0.0
.