Stores

Stores are upgraded dictionaries that store the simulation state. They can be nested within one another to form a store hierarchy. The core stores in vEcoli comprise the following simplified store hierarchy:

  • bulk

  • unique

    • active_replisomes

    • active_ribosome

    • active_RNAPs

    • chromosome_domains

    • chromosomal_segments

    • full_chromosomes

    • DnaA_boxes

    • genes

    • oriCs

    • promoters

    • RNAs

  • listeners

    • mass

      • dry_mass

      • cell_mass

    • replication_data

      • fork_coordinates

      • fork_domains

Individual stores in this hierarchy are identified using tuples representing their path inside the nested hierarchy. For example, the dry mass listener store has the tuple path ("listeners", "mass", "dry_mass"). There are three top-level stores corresponding to the three main types of simulation data: bulk, unique, and listeners. The following sections provide descriptions for each of these store types prefaced by a series of relevant attributes including:

Path:

Tuple path of the store (e.g. for use in process topologies)

Updater:

Function used to apply updates to the store. Click on the linked API documentation to see the format that updates to the store must be given in.

Divider:

Function used to split the store during cell division

Serializer:

Instance of vivarium.core.registry.Serializer used to serialize store data before being emitted (see Serializing Emits)

Schema:

Helper function to create store schema in process ports_schema methods

Helpers:

Other useful helper functions

Warning

The store values that vivarium-core shows to processes when running them are NOT copies, so processes must be careful not to unintentionally modify mutable values. The Numpy arrays in the bulk and unique molecules stores are made read-only for extra protection (see WRITEABLE flag in numpy.ndarray.flags).

Bulk Molecules

Path:

("bulk",)

Updater:

ecoli.library.schema.bulk_numpy_updater()

Divider:

ecoli.library.schema.divide_bulk()

Serializer:

ecoli.library.schema.get_bulk_counts

Schema:

ecoli.library.schema.numpy_schema()

Helpers:

ecoli.library.schema.bulk_name_to_idx(), ecoli.library.schema.counts()

Bulk molecules are named as such because they represent species for which all molecules are treated as interchangeable (e.g. water). The bulk molecules store holds a structured Numpy array with the following named fields:

  1. id (str): Names of bulk molecules pulled from EcoCyc

    Each end with a bracketed “location tag” (e.g. [c]) containing one of the abbreviations defined in the reconstruction/ecoli/flat/compartments.tsv file (see Cell Component Ontology)

  2. count (numpy.int64): Counts of bulk molecules

    Note that the evolve_state() method of PartitionedProcess does not see the full structured array with named fields through its bulk port and instead sees a 1D array of partitioned counts from the allocator (see Partitioning).

  3. {}_submass (numpy.float64): Field for each submass

    Eight submasses are rRNA, tRNA, mRNA, miscRNA, nonspecific_RNA, protein, metabolite, water, DNA

Initialization

To create the initial value for this store, the model will go through the following three options in order of preference:

  1. Load custom initial state

    Set initial_state option for EcoliSim (see JSON Config Files)

  2. Load from saved state JSON

    Set initial_state_file option for EcoliSim

  3. Generate from sim_data

    generate_initial_state() uses the sim_data object generated by the ParCa to calculate initial state

Partitioning

Motivation

To support the use of independent sub-models for different biological processes (e.g. FBA for metabolism, Gillespie for complexation, etc.), the model allows processes to run mostly independently. At a high level, over the course of a single simulation step, each process will see the simulation state as it was before any other process has run. Each process will then calculate an update to apply to the simulation state and all updates will be simultaneously applied once all processes have run.

This setup has a potential problem: two processes may both decide to deplete the count of the same molecule, resulting in a final count that is negative. To prevent this from happening, the model forces processes to first request counts of bulk molecules via special process-specific request stores. These stores are read by an allocator process (Allocator). The allocator process then divides the bulk molecules so that each process sees a functional count proportional to its request.

For example, if process A requests 100 of molecule X and process B requests 400 of molecule X but the cell only has 400 molecules of X, the allocator will divide the molecules as follows:

  • Process A: \(\frac{100}{100 + 400} * 400 = 80\) molecules of X

  • Process B: \(\frac{400}{100 + 400} * 400 = 320\) molecules of X

Steps and Flows

In our model, many processes are dependent one another to an even greater extent than that imposed by this request/allocate partitioning scheme. For example, since molecule binding and complexation events occur on timescales much shorter than the default 1 second simulation timestep, Equilibrium and TwoComponentSystem must wait for TfUnbinding to update the simulation state by freeing currently bound transcription factors. This gives all transcription factors a chance to form complexes or participate in other reactions, better reflecting the transient binding dynamics of real cells. TfBinding must wait for these new active transcription factor counts, TranscriptInitiation must wait for the counts of transcription factors bound to promoters, and so on.

To allow processes to run in a pre-specified order within each timestep, we can make use of a subclass of the typical Vivarium Process class: Step. Almost all “processes” in the model are actually instances of Step. These Steps are configured to run in user-configured “execution layers” by way of a flow that is included in the simulation configuration (see ecoli_master_sim).

A flow is a dictionary that specifies the dependencies for each Step. For example, if a user wants Step B to run only after Step A has updated the simulation state, the user can include Step A as a dependency of Step B:

{
    "Step B": [("Step A",)]
}

Steps can have multiple dependencies (e.g. [("Step A",), ("Step B",)]) and each dependency must be in the form of a tuple path. All processes are top-level stores so these paths are usually just ("process name",).

Vivarium will parse the flow to construct a directed acyclic graph and figure out the order in which to run steps by stratifying them into execution layers. For example, consider the following flow:

{
    "Step B": [("Step A",)],
    "Step C": [("Step A",)],
    "Step D": [("Step C",)]
}

Vivarium will parse this into the following sequence of execution layers:

  1. Step A

  2. Step B and Step C (order does not matter)

  3. Step D

Each timestep, Step A will run and update the simulation state, Steps B and C will run with a view of the state that was updated by Step A, and finally Step D will run with a view of the state that was updated by every other step.

Implementation

All partitioned processes are instances of the PartitionedProcess class. This both serves to identify the processes that require partitioning and also implements a standard next_update method that allows these processes to be run on their own (as in migration tests).

Warning

In instances of PartitionedProcess, all ports connected to the bulk molecule store MUST be called bulk to be properly partitioned. Conversely, ports that are not meant to be partitioned should NEVER be called bulk in any PartitionedProcess.

In the model, each partitioned process is used to create two separate steps: a Requester and an Evolver. For each execution layer in the flow given to EcoliSim, Ecoli will create Requesters, Evolvers, and other required Steps and arrange them to be executed in the following order:

  1. Requesters:

    Each calls the calculate_request() method of a PartitionedProcess in said layer and writes its requests to a process-specific request store

  2. Allocator:

    Once all Requesters in said layer have finished writing their requests, an instance of Allocator reads all the written request stores and proportionally allocates bulk molecules to processes, writing allocated counts to process-specific allocate stores

  3. Evolvers:

    Each swaps the Numpy structured array of unpartitioned bulk counts in the bulk port with the 1D array of allocated counts in the corresponding allocate store, calls the evolve_state() method of its PartitionedProcess, and returns updates to the bulk molecule counts and unique molecule stores

  4. Unique updater:

    An instance of UniqueUpdate that tells the unique molecule updaters to apply accumulated updates (see UniqueNumpyUpdater for why we accumulate updates and wait to apply them after all Evolvers in an execution layer have run)

Note

The Requester and Evolver for each partitioned process share the same PartitionedProcess instance. This allows instance variables to be updated and shared between the calculate_request() and evolve_state() methods of each PartitionedProcess.

Accessing Non-Partitioned Counts

There are certain partitioned processes that require access to the total, non-partitioned counts of certain bulk molecules in their evolve_state() methods. For example, PolypeptideElongation needs to know the total counts to all amino acids to accurately implement tRNA charging. To give these processes access to non-partitioned counts, an additional port is added to their ports_schema methods and topologies that is also connected to the bulk molecules store. By convention, this port is called bulk_total to differentiate it from the partitioned bulk port. As noted in Implementation, Evolvers overwrite the port named bulk with the allocated bulk counts. Due to being named bulk_total instead of bulk, the non-partitioned port value is left untouched and allows the Evolver to read non-partitioned counts at will.

Indexing

Processes typically use the ecoli.library.schema.bulk_name_to_idx() helper function to get the indices for a set of molecules (e.g. all NTPs). These indices are typically cached as instance attributes (e.g. self.ntp_idx) in the calculate_request method of a PartitionedProcess. This way, we can ensure that all the necessary indices are retrieved the very first time the Requester for that process is run, making it available to the Evolver (which shares the same PartitionedProcess) and subsequent runs of the Requester. See the branch beginning if self.proton_idx is None in calculate_request() for an example.

Though counts can be directly retrieved from the Numpy structured array (e.g. states["bulk"]["count"][self.ntp_idx]), this method of access does not work for Evolver (i.e. inside the evolve_state() method) because they automatically replace the non-partitioned Numpy structured array of bulk counts with the 1D array of partitioned bulk counts for that process (see Implementation). To standardize count access across the calculate_request() and evolve_state() methods, the helper function ecoli.library.schema.counts() can handle both of these scenarios and also guarantees that the returned counts can be safely edited without unintentionally mutating the source array.

Unique Molecules

Path:

("unique",)

Updater:

ecoli.library.schema.UniqueNumpyUpdater.updater()

Dividers:

See ecoli.library.schema.UNIQUE_DIVIDERS

Serializer:

ecoli.library.schema.get_unique_fields

Schema:

ecoli.library.schema.numpy_schema()

Helpers:

ecoli.library.schema.attrs()

Unique molecules are named as such because they represent species for which individual molecules are not treated as interchangeable (e.g. different RNA molecules may have different sequences). The unique molecules store holds substores for each unique molecule (e.g. ("unique", "RNAs"), ("unique", "active_RNAPs")). Each unique molecule substore contains a structured Numpy array with a variety of named fields, each representing an attribute of interest for that class of unique molecules (e.g. coordinates for a gene unique molecule). All unique molecules will have the following named fields:

  1. unique_index (int): Unique identifier for each unique molecule

    When processes add new unique molecules, you can let updater generate the new unique indices. If you need to reference the unique indices of new molecules in the same process AND timestep in which they are generated, see ecoli.library.schema.create_unique_indices(). Note that unique indices are only unique within a single cell.

  2. _entryState (numpy.int8): 1 for active row, 0 for inactive row

    When unique molecules are deleted (e.g. RNA degradation), all of their data, including the _entryState field, is set to 0. When unique molecues are added (e.g. RNA transcription), the updater places the data for these new molecules into the rows that are identified as inactive by the helper function ecoli.library.schema.get_free_indices(), which also grows the array if necessary.

  3. massDiff_{} (numpy.float64): Field for each dynamic submass

    The eight submasses are rRNA, tRNA, mRNA, miscRNA, nonspecific_RNA, protein, metabolite, water, and DNA. An example of a dynamic submass is the constantly changing protein mass of the polypeptide associated with an actively translating ribosome.

Note

Unique molecules are instances of MetadataArray, a subclass of Numpy arrays that adds a metadata attribute. This attribute is used to store the next unique index to be assigned to a new unique molecule. If you wish to add a custom unique molecule type, after you have created a structured Numpy array with at least the above attributes, create a MetadataArray instance from it using MetadataArray(array, next_unique_index), where array is your structured Numpy array and next_unique_index is the next unique index to be assigned.

Initialization

See Initialization.

Accessing

Processes use the ecoli.library.schema.attrs() helper function to access any number of attributes for all active (_entryState is 1) unique molecules of a given type (e.g. RNA, active RNAP, etc.).

Listeners

Path:

("listeners",)

Updater:

update_set()

Divider:

None (leave value as is upon division)

Serializer:

None by default (leave value as is) but will automatically use registered serializers for data that is of a type that cannot be automatically serialized to JSON by orjson. For example, listeners holding values with Unum units will be serialized by UnumSerializer, which is registered in ecoli/__init__.py. See Serializing Emits.

Schema:

ecoli.library.schema.listener_schema()

Helpers:

None

The listeners store contains many substores that hold data which is saved for downstream analyses. For example, the ("listeners", "mass") substore contains substores for various masses of interest, such as cell_mass, dry_mass, protein_mass, etc.

Tip

When possible, we recommend that you put all stores whose data you wish to save inside the listeners store. This is to ensure consistency across the model and helps keep stores organized.

Initialization

Listener stores are initialized at the start of a simulation with the default values specified in the ports_schema() methods of the processes that connect to them. Refer to ecoli.library.schema.listener_schema() for information about how to configure this default value as well as attach useful metadata to specific listener values.

Listener stores must contain data of the same type (or data that is serialized to the same type) for the duration of a simulation. This is becuase the Parquet storage format (see Parquet Emitter) is a columnar format that requires columns to have static data types. Some leeway is allowed for None (null) values in nested types. For example, [] and [0] both work fine for a column containing 1D lists of integers.

Warning

Double check the data type of default values for listeners. For example, a listener of float values with a default value of 0 will incorrectly coerce all subsequent values to integers in the saved output. The correct default value in this case is 0.0.