Output
Simulation output can come in one of two different formats and contain data from as many or few stores as you desire.
Stores to Emit
To indicate that you want to save the data in a simulation store for later,
set the _emit
key to True in the port schema for all ports connecting
to that store. By default, we always emit data for:
Bulk molecules store located at
("bulk",)
: Thenumpy_schema()
helper function that we use to create the schema for ports to the bulk store automatically sets_emit
to True when thename
argument isbulk
.Listeners located at
("listeners",)
: Thelistener_schema()
helper function that we use to create the schema for ports to stores located somewhere in the hierarchy under thelistener
store automatically sets_emit
to True
Serializing Emits
Serialization is the process of converting data into a format that can be stored
or transmitted then later reconstructed back into its original format. By default, both of
the two available data output formats in vEcoli serialize
data by first converting the store hierarchy to save (Stores to Emit) to JSON using
orjson, which natively serializes Python’s built-in types
as well as basic 1D Numpy arrays. For stores containing data that is not one of these types,
vivarium-core allows users to specify custom serializers either on a per-store basis using the
_serialize
schema key or for all stores of a given type using the
Serializer
API (see
vivarium-core documentation).
For details about reading data back after it has been saved, refer to Querying for the in-memory data format and DuckDB for the persistent storage format.
In-Memory Emitter
When timeseries
is specified using the emitter
option in a configuration JSON,
simulation output is stored transiently in-memory in a dictionary keyed by time that
looks like the following:
{
# Data for time = 0
0.0: {
# Store hierarchy as nested dictionary containing all stores and sub-stores
# where ``_emit`` is True
"store_1": value,
"store_2": {
"inner_store_1": value,
...
},
...
},
# Data for time = 1
1.0: {...},
...
}
This data format is mainly intended for ad-hoc analysis scripts (e.g. Jupyter notebooks) where a single-cell simulation is run and probed for model development. Importantly, the data saved by this emitter is lost when the Python program used to run the cell simulation terminates.
Querying
Data can be read from the RAM emitter by calling
query()
on the EcoliSim
object used to run
the simulation. To deserialize data (reconstitute it after serialization),
the deserialize_value()
function is called, which
calls the deserialize()
method
of the Serializer
instance whose
can_deserialize()
method returns
True on the data to deserialize.
Parquet Emitter
When parquet
is specified using the emitter
option in a configuration JSON,
simulation output is stored in a tabular file format called Parquet inside a nested
directory structure called Hive partitioning.
Hive Partitioning
In Hive partitioning, certain keys in data are used to partition the data into folders:
key_1=value_1/key_2=value_2/...
In the vEcoli Parquet emitter, the keys used for this purpose are the experiment ID, variant index, lineage seed (initial seed for cell lineage), generation, and agent ID. These keys uniquely identify a single cell simulation, meaning each simulation process will write data to its own folder in the final output with a path like:
experiment_id={}/variant={}/lineage_seed={}/generation={}/agent_id={}
This allows workflows that run simulations with many variant simulation data objects, lineage seeds, generations, and agent IDs to all write data to the same main output folder without simulations overwriting one another.
Parquet Files
Because Parquet is a tabular file format (think in terms of columns like a Pandas
DataFrame), additional serialization steps must be taken after the emit data
has been converted to JSON format in accordance with Serializing Emits.
The Parquet emitter (ParquetEmitter
)
first calls flatten_dict()
in order to
flatten the nested store hierarchy into unnested key-value pairs where keys
are paths to leaf values concatenated with double underscores and values are
leaf values. For example, take the following nested dictionary:
{
"a": {
"b": 1,
"c": {
"d": 2,
"e": 3
},
"f": 4
},
"g": 5
}
This is flattened to:
{
"a__b": 1,
"a__c__d": 2,
"a__c__e": 3,
"a__f": 4,
"g": 5
}
Then, get_encoding()
is used to get the
the type of the Parquet column that will be created for each key-value pair in
the flattened dictionary, where each key is the column name and each value is one
entry in the column. Parquet files are strongly typed, so emitted store data
must always be serialized to the same type as they were in the first time step
(default or initial value). The exception to this rule are columns that can contain
null values or nested types containing null values (e.g. empty list). For these columns,
all values except the null entries must be the same type (e.g. column with lists
of integers where some entries are empty lists).
Currently, the Parquet emitter does not support saving data from stores that are dynamically added or removed over the course of a simulation or workflow. This means that all cells in a workflow must emit a consistent store hierarchy that flattens to a dictionary with the same double-underscore concatenated keys (resulting Parquet files all have the same columns). This limitation may be relaxed in the future via an optional flag that will come with an unknown performance penalty.
The Parquet emitter saves the serialized tabular data to two Hive-partitioned
directories in the output folder (out_dir
or out_uri
option under
emitter_arg
in JSON Config Files):
configuration
: Copy of all configuration options (e.g. from JSON, CLI) that were used to run the simulation as well as store-specific metadatahistory
: Actual saved simulation output
configuration
Each simulation will save a single Parquet file named config.pq
inside
its corresponding Hive partition under the configuration
folder.
Many of the columns inside this Parquet file come from flattening the configuration
JSON used to run the simulation and can be read back in analysis scripts (see
Analyses) using the helper function
get_config_value()
.
Additionally, this file can contain metadata for each store to emit. This metadata
can be specified under the _properties
key in a port schema as follows:
{
"_properties": {
"metadata": Put anything here.
}
}
Schemas constructed with the listener_schema()
helper
function can populate this metdata concisely. These metadata values are compiled for
all stores in the simulation state hierarchy by
get_output_metadata()
. In the
saved configuration Parquet file, the metadata values will be located in
columns with names equal to the double-underscore concatenated store path
prefixed by output_metadata__
. For convenience, the
get_field_metadata()
can be used in
analysis scripts to read this metadata.
history
Each simulation will save Parquet files containing serialized simulation output data
inside its corresponding Hive partition under the history
folder. The columns in
these Parquet files come from flattening the hierarchy of emitted stores. To leverage
Parquet’s columnar compression and efficient reading, we batch many time steps worth
of emits into a temporary newline-delimited JSON file before reading them into a
PyArrow table where each row
contains the column values for a single time step. This PyArrow table is then
written to a Parquet file named {batch size * number of batches}.pq
(e.g.
400.pq
, 800.pq
, etc. for a batch size of 400). The default batch size of
400 has been tuned for our current model but can be adjusted via emits_to_batch
under the emitter_arg
option in a configuration JSON.
DuckDB
DuckDB is the main library that we use to read and query Parquet files. It offers class-leading performance and a fairly user-friendly SQL dialect for constructing complex queries. Refer to the DuckDB documentation to learn more.
We provide a variety of helper functions in ecoli.library.parquet_emitter
to read data using DuckDB. These include:
get_dataset_sql()
: Construct basic SQL queries to read data fromhistory
andconfiguration
folders. This is mainly intended for ad-hoc Parquet reading (e.g. in a Jupyter notebook). Analysis scripts (see Analyses) receive ahistory_sql
andconfig_sql
that reads data from Parquet files with filters applied when run usingrunscripts.analysis
.num_cells()
: Quickly get a count of the number of cells whose data is included in a SQL queryskip_n_gens()
: Add a filter to an SQL query to skip the first N generations worth of datandlist_to_ndarray()
: Convert a PyArrow column of nested lists into a N-D Numpy arrayndarray_to_ndlist()
: Convert a N-D Numpy array into a PyArrow column of nested listsndidx_to_duckdb_expr()
: Get a DuckDB SQL expression which can be included in aSELECT
statement that uses Numpy-style indexing to retrieve values from a nested list Parquet columnnamed_idx()
: Get a DuckDB SQL expression which can be included in aSELECT
statement that extracts values at certain indices from each row of a nested list Parquet column and returns them as individually named columnsget_field_metadata()
: Read saved store metadata (see configuration)get_config_value()
: Read option from configuration JSON used to run simulationread_stacked_columns()
: Main interface for reading simulation output fromhistory
folder. Can either immediately read all data in specified columns into memory by supplyingconn
argument or return a DuckDB SQL query that can be iteratively built upon (useful when data to large to read into memory all at once).
Warning
Parquet lists are 1-indexed. ndidx_to_duckdb_expr()
and named_idx()
automatically add 1 to
user-supplied indices.
Construct SQL Queries
The true power of DuckDB is unlocked when SQL queries are iteratively constructed. This can be accomplished in one of two ways:
For simpler queries, you can wrap a complete DuckDB SQL expression in parentheses to use as the input table to another query. For example, to calculate the average cell and dry mass for over all time steps for all cells accessible to an analysis script:
SELECT avg(*) FROM ( SELECT listeners__mass__dry_mass, listeners__mass__cell_mass FROM ( history_sql ) )
history_sql
can be slotted in programmatically using an f-string.For more advanced, multi-step queries, you can use common table expressions (CTEs). For example, to run the same query above but first averaging over all time steps for each cell before averaging the averages over all cells:
WITH cell_avgs AS ( SELECT avg(listeners__mass__dry_mass) AS avg_dry_mass, avg(listeners__mass__cell_mass) AS avg_cell_mass FROM (history_sql) GROUP BY experiment_id, variant, lineage_seed, generation, agent_id ) SELECT avg(*) FROM cell_avgs
See new_gene_translation_efficiency_heatmaps
for examples of complex queries, as well as helper functions to create SQL expressions
for common query patterns. These include:
Other Workflow Output
We provide helper functions in ecoli.library.parquet_emitter
to read other
workflow output.
open_arbitrary_sim_data()
: Intended for use in analysis scripts. Accepts thesim_data_paths
dictionary given as input to analysis scripts byrunscripts.analysis
and picks a single arbitrary path in that dictionary to read and unpickle.open_output_file()
: When opening any workflow output file in a Python script, use this function instead of the built-inopen
(e.g.with open_output_file({path}, "r") as f:
). This is mainly intended to future-proof analysis scripts for Google Cloud support.