Output
Simulation output can come in one of two different formats and contain data from as many or few stores as you desire.
Stores to Emit
To indicate that you want to save the data in a simulation store for later,
set the _emit
key to True in the port schema for all ports connecting
to that store. By default, we always emit data for:
Bulk molecules store located at
("bulk",)
: Thenumpy_schema()
helper function that we use to create the schema for ports to the bulk store automatically sets_emit
to True when thename
argument isbulk
.Listeners located at
("listeners",)
: Thelistener_schema()
helper function that we use to create the schema for ports to stores located somewhere in the hierarchy under thelistener
store automatically sets_emit
to True
Serializing Emits
Serialization is the process of converting data into a format that can be stored
or transmitted then later reconstructed back into its original format. By default, both of
the two available data output formats in vEcoli serialize
data by first converting the store hierarchy to save (Stores to Emit) to JSON using
orjson, which natively serializes Python’s built-in types
as well as basic 1D Numpy arrays. For stores containing data that is not one of these types,
vivarium-core allows users to specify custom serializers either on a per-store basis using the
_serialize
schema key or for all stores of a given type using the
Serializer
API (see
vivarium-core documentation).
For details about reading data back after it has been saved, refer to Querying for the in-memory data format and DuckDB for the persistent storage format.
In-Memory Emitter
When timeseries
is specified using the emitter
option in a configuration JSON,
simulation output is stored transiently in-memory in a dictionary keyed by time that
looks like the following:
{
# Data for time = 0
0.0: {
# Store hierarchy as nested dictionary containing all stores and sub-stores
# where ``_emit`` is True
"store_1": value,
"store_2": {
"inner_store_1": value,
...
},
...
},
# Data for time = 1
1.0: {...},
...
}
This data format is mainly intended for ad-hoc analysis scripts (e.g. Jupyter notebooks) where a single-cell simulation is run and probed for model development. Importantly, the data saved by this emitter is lost when the Python program used to run the cell simulation terminates.
Querying
Data can be read from the RAM emitter by calling
query()
on the EcoliSim
object used to run
the simulation. To deserialize data (reconstitute it after serialization),
the deserialize_value()
function is called, which
calls the deserialize()
method
of the Serializer
instance whose
can_deserialize()
method returns
True on the data to deserialize.
Parquet Emitter
When parquet
is specified using the emitter
option in a configuration JSON,
simulation output is stored in a tabular file format called Parquet inside a nested
directory structure called Hive partitioning. For details on the available JSON
configuration options, see ParquetEmitter
.
Hive Partitioning
In Hive partitioning, certain keys in data are used to partition the data into folders:
key_1=value_1/key_2=value_2/...
In the vEcoli Parquet emitter, the keys used for this purpose are the experiment ID, variant index, lineage seed (initial seed for cell lineage), generation, and agent ID. These keys uniquely identify a single cell simulation, meaning each simulation process will write data to its own folder in the final output with a path like:
experiment_id={}/variant={}/lineage_seed={}/generation={}/agent_id={}
This allows workflows that run simulations with many variant simulation data objects, lineage seeds, generations, and agent IDs to all write data to the same main output folder without simulations overwriting one another.
Parquet Files
Because Parquet is a tabular file format (think in terms of columns like a Pandas
DataFrame), additional serialization steps must be taken after the emit data
has been converted to JSON format in accordance with Serializing Emits.
The Parquet emitter (ParquetEmitter
)
first calls flatten_dict()
in order to
flatten the nested store hierarchy into unnested key-value pairs where keys
are paths to leaf values concatenated with double underscores and values are
leaf values. For example, take the following nested dictionary:
{
"a": {
"b": 1,
"c": {
"d": 2,
"e": 3
},
"f": 4
},
"g": 5
}
This is flattened to:
{
"a__b": 1,
"a__c__d": 2,
"a__c__e": 3,
"a__f": 4,
"g": 5
}
Then, np_dtype()
is used to get the
the type of the Parquet column that will be created for each key-value pair in
the flattened dictionary, where each key is the column name and each value is one
entry in the column. Parquet files are strongly typed, so emitted store data
must always be serialized to the same type as they were in the first time step
(default or initial value). The exception to this rule are columns that can contain
null values or nested types containing null values (e.g. empty list). For these columns,
all values except the null entries must be the same type (e.g. column with lists
of integers where some entries are empty lists).
Warning
The Parquet emitter is poorly suited for storing large listeners that have more than a single dimension per time step. We recommend splitting these listeners up if possible, especially if you plan to read specific indices along those dimensions.
The Parquet emitter saves the serialized tabular data to two Hive-partitioned
directories in the output folder (out_dir
or out_uri
option under
emitter_arg
in JSON Config Files):
configuration
: Copy of all configuration options (e.g. from JSON, CLI) that were used to run the simulation as well as store-specific metadatahistory
: Actual saved simulation output
configuration
Each simulation will save a single Parquet file named config.pq
inside
its corresponding Hive partition under the configuration
folder.
Many of the columns inside this Parquet file come from flattening the configuration
JSON used to run the simulation and can be read back in analysis scripts (see
Analyses) using the helper function
config_value()
.
Additionally, this file can contain metadata for each store to emit. This metadata
can be specified under the _properties
key in a port schema as follows:
{
"_properties": {
"metadata": Put anything here.
}
}
Schemas constructed with the listener_schema()
helper
function can populate this metdata concisely. These metadata values are compiled for
all stores in the simulation state hierarchy by
output_metadata()
. In the
saved configuration Parquet file, the metadata values will be located in
columns with names equal to the double-underscore concatenated store path
prefixed by output_metadata__
. For convenience, the
field_metadata()
can be used in
analysis scripts to read this metadata.
history
Each simulation will save Parquet files containing serialized simulation output data
inside its corresponding Hive partition under the history
folder. The columns in
these Parquet files come from flattening the hierarchy of emitted stores. To leverage
Parquet’s columnar compression and efficient reading, we batch many time steps worth
of emits into either NumPy arrays (constant dimensions) or lists of Polars Series (variable
dimensions). These batched emits are efficiently converted into a Polars DataFrame and
written to a Parquet file named {batch size * number of batches}.pq
(e.g.
400.pq
, 800.pq
, etc. for a batch size of 400). The default batch size of
400 has been tuned for our current model but can be adjusted via batch_size
under the emitter_arg
option in a configuration JSON.
DuckDB
DuckDB is the main library that we use to read and query Parquet files. It offers class-leading performance and a fairly user-friendly SQL dialect for constructing complex queries. Refer to the DuckDB documentation to learn more.
We provide a variety of helper functions in ecoli.library.parquet_emitter
to read data using DuckDB. These include:
dataset_sql()
: Construct basic SQL queries to read data fromhistory
andconfiguration
folders. This is mainly intended for ad-hoc Parquet reading (e.g. in a Jupyter notebook). Analysis scripts (see Analyses) receive ahistory_sql
andconfig_sql
that reads data from Parquet files with filters applied when run usingrunscripts.analysis
.union_by_name()
: Modify SQL query fromdataset_sql()
to use DuckDB’s union_by_name. This is useful when reading data from simulations with different columns.num_cells()
: Quickly get a count of the number of cells whose data is included in a SQL queryskip_n_gens()
: Add a filter to an SQL query to skip the first N generations worth of datandlist_to_ndarray()
: Convert a column of nested lists read from Parquet into an N-D Numpy array (usepolars.Series
to do opposite conversion)ndidx_to_duckdb_expr()
: Get a DuckDB SQL expression which can be included in aSELECT
statement that uses Numpy-style indexing to retrieve values from a nested list Parquet columnnamed_idx()
: Get a DuckDB SQL expression which can be included in aSELECT
statement that extracts values at certain indices from each row of a nested list Parquet column and returns them as individually named columnsfield_metadata()
: Read saved store metadata (see configuration)config_value()
: Read option from configuration JSON used to run simulationread_stacked_columns()
: Main interface for reading simulation output fromhistory
folder. Can either immediately read all data in specified columns into memory by supplyingconn
argument or return a DuckDB SQL query that can be iteratively built upon (useful when data too large to read into memory all at once).
Warning
Column names that contain special characters (e.g. spaces, dashes, etc.) must be
enclosed in double quotes when used in DuckDB SQL queries. This is automatically
handled by most of the helper functions above with the notable exception of
read_stacked_columns()
.
Warning
Parquet lists are 1-indexed. ndidx_to_duckdb_expr()
and named_idx()
automatically add 1 to
user-supplied indices.
Construct SQL Queries
The true power of DuckDB is unlocked when SQL queries are iteratively constructed. This can be accomplished in one of two ways:
For simpler queries, you can wrap a complete DuckDB SQL expression in parentheses to use as the input table to another query. For example, to calculate the average cell and dry mass for over all time steps for all cells accessible to an analysis script:
SELECT avg(*) FROM ( SELECT listeners__mass__dry_mass, listeners__mass__cell_mass FROM ( history_sql ) )
history_sql
can be slotted in programmatically using an f-string.For more advanced, multi-step queries, you can use common table expressions (CTEs). For example, to run the same query above but first averaging over all time steps for each cell before averaging the averages over all cells:
WITH cell_avgs AS ( SELECT avg(listeners__mass__dry_mass) AS avg_dry_mass, avg(listeners__mass__cell_mass) AS avg_cell_mass FROM (history_sql) GROUP BY experiment_id, variant, lineage_seed, generation, agent_id ) SELECT avg(*) FROM cell_avgs
Tip
DuckDB will efficiently read only the rows and columns necessary to complete your query.
However, if you are reading a column of lists (e.g. bulk molecule counts every time step)
or nested lists, DuckDB reads the entire nested value for every relevant row in that column,
even if you only care about a small subset of indices. To avoid repeatedly incurring this
cost, we recommend using named_idx()
to select all
indices of interest to be read in one go. As long as the final result fits in RAM, this
should be much faster than reading each index individually.
See new_gene_translation_efficiency_heatmaps
for examples of complex queries, as well as helper functions to create SQL expressions
for common query patterns. These include:
Other Workflow Output
We provide helper functions in ecoli.library.parquet_emitter
to read other
workflow output.
open_arbitrary_sim_data()
: Intended for use in analysis scripts. Accepts thesim_data_paths
dictionary given as input to analysis scripts byrunscripts.analysis
and picks a single arbitrary path in that dictionary to read and unpickle.open_output_file()
: When opening any workflow output file in a Python script, use this function instead of the built-inopen
(e.g.with open_output_file({path}, "r") as f:
). This is mainly intended to future-proof analysis scripts for Google Cloud support.