======
Stores
======

Stores are upgraded dictionaries that store the simulation state. They can
be nested within one another to form a store hierarchy. The core stores in
vEcoli comprise the following simplified store hierarchy:

- ``bulk``
- ``unique``

    - ``active_replisomes``
    - ``active_ribosome``
    - ``active_RNAPs``
    - ``chromosome_domains``
    - ``chromosomal_segments``
    - ``full_chromosomes``
    - ``DnaA_boxes``
    - ``genes``
    - ``oriCs``
    - ``promoters``
    - ``RNAs``

- ``listeners``

    - ``mass``

        - ``dry_mass``
        - ``cell_mass``
        - ...

    - ``replication_data``

        - ``fork_coordinates``
        - ``fork_domains``
        - ...

    - ...

Individual stores in this hierarchy are identified using tuples representing
their path inside the nested hierarchy. For example, the dry mass listener store
has the tuple path ``("listeners", "mass", "dry_mass")``. There are three top-level
stores corresponding to the three main types of simulation data: bulk, unique, and
listeners. The following sections provide descriptions for each of these store types
prefaced by a series of relevant attributes including:

:Path: Tuple path of the store (e.g. for use in process topologies)
:Updater: Function used to apply updates to the store. Click on the linked API
    documentation to see the format that updates to the store must be given in.
:Divider: Function used to split the store during cell division
:Serializer: Instance of :py:class:`vivarium.core.registry.Serializer` 
    used to serialize store data before being emitted (see :ref:`serializing_emits`)
:Schema: Helper function to create store schema in process ``ports_schema`` methods
:Helpers: Other useful helper functions

.. WARNING::
    The store values that ``vivarium-core`` shows to processes when running them
    are **NOT** copies, so processes must be careful not to unintentionally
    modify mutable values. The Numpy arrays in the bulk and unique molecules stores
    are made read-only for extra protection (see ``WRITEABLE``
    flag in :py:attr:`numpy.ndarray.flags`).

.. _bulk:

--------------
Bulk Molecules
--------------

:Path: ``("bulk",)``
:Updater: :py:func:`ecoli.library.schema.bulk_numpy_updater`
:Divider: :py:func:`ecoli.library.schema.divide_bulk`
:Serializer: :py:class:`ecoli.library.schema.get_bulk_counts`
:Schema: :py:func:`ecoli.library.schema.numpy_schema`
:Helpers: :py:func:`ecoli.library.schema.bulk_name_to_idx`,
    :py:func:`ecoli.library.schema.counts`

Bulk molecules are named as such because they represent species for 
which all molecules are treated as interchangeable (e.g. water). The bulk
molecules store holds a
`structured Numpy array <https://numpy.org/doc/stable/user/basics.rec.html>`_ 
with the following named fields:

    1. ``id`` (:py:class:`str`): Names of bulk molecules pulled from `EcoCyc <https://ecocyc.org/>`_
        Each end with a bracketed "location tag" (e.g. ``[c]``) containing
        one of the abbreviations defined in the
        ``reconstruction/ecoli/flat/compartments.tsv`` file (see
        `Cell Component Ontology <http://brg.ai.sri.com/CCO/downloads/cco.html>`_)
    2. ``count`` (:py:attr:`numpy.int64`): Counts of bulk molecules
        Note that the :py:meth:`~ecoli.processes.partition.PartitionedProcess.evolve_state`
        method of :py:class:`~ecoli.processes.partition.PartitionedProcess` does not see the full
        structured array with named fields through its bulk port and instead sees a 1D array of
        partitioned counts from the allocator (see :ref:`partitioning`).
    3. ``{}_submass`` (:py:attr:`numpy.float64`): Field for each submass
        Eight submasses are rRNA, tRNA, mRNA, miscRNA, nonspecific_RNA, protein, metabolite, water, DNA

.. _initialization:

Initialization
==============
To create the initial value for this store, the model will go through 
the following three options in order of preference:

    1. Load custom initial state
        Set ``initial_state`` option for 
        :py:class:`~ecoli.experiments.ecoli_master_sim.EcoliSim`
        (see :ref:`json_config`)

    2. Load from saved state JSON
        Set ``initial_state_file`` option for 
        :py:class:`~ecoli.experiments.ecoli_master_sim.EcoliSim`

    3. Generate from ``sim_data``
        :py:meth:`~ecoli.library.sim_data.LoadSimData.generate_initial_state` 
        uses the ``sim_data`` object generated by the ParCa to calculate 
        initial state


.. _partitioning:

Partitioning
============

Motivation
----------
To support the use of independent sub-models for different biological processes 
(e.g. FBA for metabolism, Gillespie for complexation, etc.), the model allows 
processes to run mostly independently. At a high level, over the course of a 
single simulation step, each process will see the simulation state as it was 
before any other process has run. Each process will then calculate an update 
to apply to the simulation state and all updates will be simultaneously 
applied once all processes have run. 

This setup has a potential problem: two processes may both decide to deplete 
the count of the same molecule, resulting in a final count that is negative. 
To prevent this from happening, the model forces processes to first request
counts of bulk molecules via special process-specific ``request`` stores. These
stores are read by an allocator process 
(:py:class:`~ecoli.processes.allocator.Allocator`). The allocator process 
then divides the bulk molecules so that each process sees a functional count 
proportional to its request.

For example, if process A requests 100 of molecule X and process B requests 
400 of molecule X but the cell only has 400 molecules of X, the allocator 
will divide the molecules as follows:

- Process A: :math:`\frac{100}{100 + 400} * 400 = 80` molecules of X 
- Process B: :math:`\frac{400}{100 + 400} * 400 = 320` molecules of X

Steps and Flows
---------------
In our model, many processes are dependent on one another even outside this
request/allocate partitioning scheme. For example, in each simulation time step,
:py:class:`~ecoli.processes.metabolism.Metabolism` must wait for
:py:class:`~ecoli.processes.polypeptide_elongation.PolypeptideElongation`
to finish calculating the counts of amino acids, GTP, etc. consumed
before it can solve the FBA problem to optimize growth.

To allow processes to run in a pre-specified order within 
each timestep, we can make use of a subclass of the typical Vivarium 
:py:class:`~vivarium.core.process.Process` class: 
:py:class:`~vivarium.core.process.Step`. Almost all "processes" in the model 
are actually instances of :py:class:`~vivarium.core.process.Step`. These Steps 
are configured to run in user-configured "execution layers" by way of a ``flow`` 
that is included in the simulation configuration (see 
:py:mod:`~ecoli.experiments.ecoli_master_sim`).

A ``flow`` is a dictionary that specifies the dependencies for each Step. For 
example, if a user wants Step B to run only after Step A has updated the 
simulation state, the user can include Step A as a dependency of Step B:

.. code-block::

    {
        "Step B": [("Step A",)]
    }

Steps can have multiple dependencies (e.g. ``[("Step A",), ("Step B",)]``)
and each dependency must be in the form of a tuple path. All processes are
top-level stores so these paths are usually just ``("process name",)``.

Vivarium will parse the ``flow`` to construct a directed acyclic graph  
and figure out the order in which to run steps by stratifying them into 
execution layers. For example, consider the following ``flow``:

.. code-block::

    {
        "Step B": [("Step A",)],
        "Step C": [("Step A",)],
        "Step D": [("Step C",)]
    }

Vivarium will parse this into the following sequence of execution layers: 

1. Step A
2. Step B and Step C (order does not matter)
3. Step D

Each timestep, Step A will run and update the simulation state, Steps B and C 
will run with a view of the state that was updated by Step A, and finally 
Step D will run with a view of the state that was updated by every other step.

.. _implementation:

Implementation
--------------
All partitioned processes are instances of the 
:py:class:`~ecoli.processes.partition.PartitionedProcess` class. This both 
serves to identify the processes that require partitioning and also implements 
a standard ``next_update`` method that allows these processes to be run on 
their own.

.. WARNING::
    In instances of :py:class:`~ecoli.processes.partition.PartitionedProcess`, 
    all ports connected to the bulk molecule store **MUST** be called 
    ``bulk`` to be properly partitioned. Conversely, ports that are not meant 
    to be partitioned should **NEVER** be called ``bulk`` in any 
    :py:class:`~ecoli.processes.partition.PartitionedProcess`.

In the model, each partitioned process is used to create two separate steps: 
a :py:class:`~ecoli.processes.partition.Requester` and an 
:py:class:`~ecoli.processes.partition.Evolver`. For each execution layer 
in the ``flow`` given to :py:class:`~ecoli.experiments.ecoli_master_sim.EcoliSim`, 
:py:class:`~ecoli.composites.ecoli_master.Ecoli` will create Requesters, Evolvers,
and other required Steps and arrange them to be executed in the following order:

1. Requesters:
    Each calls the
    :py:meth:`~ecoli.processes.partition.PartitionedProcess.calculate_request`
    method of a :py:class:`~ecoli.processes.partition.PartitionedProcess`
    in said layer and writes its requests to a process-specific ``request`` store

2. Allocator:
    Once all Requesters in said layer have finished writing their requests, an
    instance of :py:class:`~ecoli.processes.allocator.Allocator`
    reads all the written ``request`` stores and proportionally allocates bulk
    molecules to processes, writing allocated counts to process-specific
    ``allocate`` stores

3. Evolvers:
    Each swaps the Numpy structured array of unpartitioned bulk counts in the
    ``bulk`` port with the 1D array of allocated counts in the corresponding
    ``allocate`` store, calls the
    :py:meth:`~ecoli.processes.partition.PartitionedProcess.evolve_state` 
    method of its :py:class:`~ecoli.processes.partition.PartitionedProcess`, 
    and returns updates to the bulk molecule counts and unique molecule stores

4. Unique updater: 
    An instance of 
    :py:class:`~ecoli.processes.unique_update.UniqueUpdate` that tells the
    unique molecule updaters to apply accumulated updates 
    (see :py:class:`~ecoli.library.schema.UniqueNumpyUpdater` for why we accumulate
    updates and wait to apply them after all Evolvers in an execution layer have run)

.. note::
    The :py:class:`~ecoli.processes.partition.Requester` and 
    :py:class:`~ecoli.processes.partition.Evolver` for each partitioned process 
    share the same :py:class:`~ecoli.processes.partition.PartitionedProcess` 
    instance. This allows instance variables to be updated and shared between the 
    :py:meth:`~ecoli.processes.partition.PartitionedProcess.calculate_request` 
    and :py:meth:`~ecoli.processes.partition.PartitionedProcess.evolve_state`
    methods of each :py:class:`~ecoli.processes.partition.PartitionedProcess`.

Accessing Non-Partitioned Counts
--------------------------------
There are certain partitioned processes that require access to the total, non-partitioned 
counts of certain bulk molecules in their
:py:meth:`~ecoli.processes.partition.PartitionedProcess.evolve_state` methods. For example, 
:py:class:`~ecoli.processes.polypeptide_elongation.PolypeptideElongation` needs to know
the total counts to all amino acids to accurately implement tRNA charging. To give these 
processes access to non-partitioned counts, an additional port is added to 
their ``ports_schema`` methods and topologies that is also connected to the 
bulk molecules store. By convention, this port is called ``bulk_total`` to 
differentiate it from the partitioned ``bulk`` port. As noted in :ref:`implementation`,
Evolvers overwrite the port named ``bulk`` with the allocated bulk counts. Due to being
named ``bulk_total`` instead of ``bulk``, the non-partitioned port value is left
untouched and allows the Evolver to read non-partitioned counts at will.


Indexing
========
Processes typically use the :py:func:`ecoli.library.schema.bulk_name_to_idx` helper function 
to get the indices for a set of molecules (e.g. all NTPs). These indices are typically cached 
as instance attributes (e.g. ``self.ntp_idx``) in the ``calculate_request`` method of a
:py:class:`~ecoli.processes.partition.PartitionedProcess`. This way, we can ensure
that all the necessary indices are retrieved the very first time the
:py:class:`~ecoli.processes.partition.Requester` for that process is
run, making it available to the :py:class:`~ecoli.processes.partition.Evolver`
(which shares the same :py:class:`~ecoli.processes.partition.PartitionedProcess`)
and subsequent runs of the Requester. See the branch beginning ``if self.proton_idx is None``
in :py:meth:`~ecoli.processes.polypeptide_elongation.PolypeptideElongation.calculate_request`
for an example.

Though counts can be directly retrieved from the Numpy structured array (e.g. 
``states["bulk"]["count"][self.ntp_idx]``), this method of access does not work
for :py:class:`~ecoli.processes.partition.Evolver` (i.e. inside the
:py:meth:`~ecoli.processes.partition.PartitionedProcess.evolve_state` method)
because they automatically replace the non-partitioned Numpy structured array
of bulk counts with the 1D array of partitioned bulk counts for that process
(see :ref:`implementation`). To standardize count access across the
:py:meth:`~ecoli.processes.partition.PartitionedProcess.calculate_request` and
:py:meth:`~ecoli.processes.partition.PartitionedProcess.evolve_state` methods,
the helper function :py:func:`ecoli.library.schema.counts` can handle both of
these scenarios and also guarantees that the returned counts can be safely
edited without unintentionally mutating the source array.


----------------
Unique Molecules
----------------

:Path: ``("unique",)``
:Updater: :py:meth:`ecoli.library.schema.UniqueNumpyUpdater.updater`
:Dividers: See :py:data:`ecoli.library.schema.UNIQUE_DIVIDERS`
:Serializer: :py:class:`ecoli.library.schema.get_unique_fields`
:Schema: :py:func:`ecoli.library.schema.numpy_schema`
:Helpers: :py:func:`ecoli.library.schema.attrs`

Unique molecules are named as such because they represent species for 
which individual molecules are not treated as interchangeable (e.g. 
different RNA molecules may have different sequences). The unique molecules
store holds substores for each unique molecule (e.g. ``("unique", "RNAs")``,
``("unique", "active_RNAPs")``). Each unique molecule substore contains a 
`structured Numpy array <https://numpy.org/doc/stable/user/basics.rec.html>`_ 
with a variety of named fields, each representing an attribute of interest 
for that class of unique molecules (e.g. ``coordinates`` for a ``gene`` unique 
molecule). All unique molecules will have the following named fields:

    1. ``unique_index`` (:py:class:`int`): Unique identifier for each unique molecule
        When processes add new unique molecules, you can let
        :py:meth:`updater <ecoli.library.schema.UniqueNumpyUpdater.updater>`
        generate the new unique indices. If you need to reference the unique
        indices of new molecules in the same process AND timestep in which they
        are generated, see :py:meth:`ecoli.library.schema.create_unique_indices`.
        Note that unique indices are only unique within a single cell.
    2. ``_entryState`` (:py:attr:`numpy.int8`): 1 for active row, 0 for inactive row
        When unique molecules are deleted (e.g. RNA degradation), all of their data, 
        including the ``_entryState`` field, is set to 0. When unique molecues are 
        added (e.g. RNA transcription), the updater places the data for these new 
        molecules into the rows that are identified as inactive by the helper function 
        :py:func:`ecoli.library.schema.get_free_indices`, which also grows the array 
        if necessary. 
    3. ``massDiff_{}`` (:py:attr:`numpy.float64`): Field for each dynamic submass
        The eight submasses are rRNA, tRNA, mRNA, miscRNA, nonspecific_RNA, protein, 
        metabolite, water, and DNA. An example of a dynamic submass is the constantly
        changing protein mass of the polypeptide associated with an actively 
        translating ribosome.

.. note::
    Unique molecules are instances of :py:class:`~ecoli.library.schema.MetadataArray`,
    a subclass of Numpy arrays that adds a ``metadata`` attribute. This attribute
    is used to store the next unique index to be assigned to a new unique molecule.
    If you wish to add a custom unique molecule type, after you have
    created a structured Numpy array with at least the above attributes,
    create a :py:class:`~ecoli.library.schema.MetadataArray` instance from it using
    ``MetadataArray(array, next_unique_index)``, where ``array`` is your structured
    Numpy array and ``next_unique_index`` is the next unique index to be assigned.


Initialization
==============
See :ref:`initialization`.

Accessing
=========
Processes use the :py:func:`ecoli.library.schema.attrs` helper function to access 
any number of attributes for all active (``_entryState`` is 1) unique molecules 
of a given type (e.g. RNA, active RNAP, etc.).


---------
Listeners
---------

:Path: ``("listeners",)``
:Updater: :py:func:`~vivarium.core.registry.update_set`
:Divider: None (leave value as is upon division)
:Serializer: None by default (leave value as is) but will automatically
    use registered serializers for data that is of a type that cannot be
    automatically serialized to JSON by ``orjson``. For example,
    listeners holding values with Unum units will be serialized by
    :py:class:`~ecoli.library.serialize.UnumSerializer`, which is
    registered in ``ecoli/__init__.py``. See :ref:`serializing_emits`.
:Schema: :py:func:`ecoli.library.schema.listener_schema`
:Helpers: None

The listeners store contains many substores that hold data which is saved
for downstream analyses. For example, the ``("listeners", "mass")`` substore
contains substores for various masses of interest, such as ``cell_mass``,
``dry_mass``, ``protein_mass``, etc.

.. tip::
    When possible, we recommend that you put all stores whose data
    you wish to save inside the listeners store. This is to ensure
    consistency across the model and helps keep stores organized.

Initialization
==============
Listener stores are initialized at the start of a simulation with the
default values specified in the
:py:meth:`~vivarium.core.process.Process.ports_schema` methods of
the processes that connect to them. Refer to
:py:func:`ecoli.library.schema.listener_schema` for information about how
to configure this default value as well as attach useful metadata to
specific listener values.

Listener stores must contain data of the same type (or data that is serialized
to the same type) for the duration of a simulation. This is becuase the Parquet
storage format (see :ref:`parquet_emitter`) is a columnar format that requires
columns to have static data types. Some leeway is allowed for ``None`` (null)
values in nested types. For example, ``[]`` and ``[0]`` both work fine for a
column containing 1D lists of integers.

.. warning::
    Double check the data type of default values for listeners. For example,
    a listener of float values with a default value of ``0`` will incorrectly
    coerce all subsequent values to integers in the saved output. The correct
    default value in this case is ``0.0``.