Output

Simulation output can come in one of two different formats and contain data from as many or few stores as you desire.

Stores to Emit

To indicate that you want to save the data in a simulation store for later, set the _emit key to True in the port schema for all ports connecting to that store. By default, we always emit data for:

Bulk molecules store located at ("bulk",): The numpy_schema() helper function that we use to create the schema for ports to the bulk store automatically sets _emit to True when the name argument is bulk.
Listeners located at ("listeners",): The listener_schema() helper function that we use to create the schema for ports to stores located somewhere in the hierarchy under the listener store automatically sets _emit to True

Serializing Emits

Serialization is the process of converting data into a format that can be stored or transmitted then later reconstructed back into its original format. By default, both of the two available data output formats in vEcoli serialize data by first converting the store hierarchy to save (Stores to Emit) to JSON using orjson, which natively serializes Python’s built-in types as well as basic 1D Numpy arrays. For stores containing data that is not one of these types, vivarium-core allows users to specify custom serializers either on a per-store basis using the _serialize schema key or for all stores of a given type using the Serializer API (see vivarium-core documentation).

For details about reading data back after it has been saved, refer to Querying for the in-memory data format and DuckDB for the persistent storage format.

In-Memory Emitter

When timeseries is specified using the emitter option in a configuration JSON, simulation output is stored transiently in-memory in a dictionary keyed by time that looks like the following:

{
    # Data for time = 0
    0.0: {
        # Store hierarchy as nested dictionary containing all stores and sub-stores
        # where ``_emit`` is True
        "store_1": value,
        "store_2": {
            "inner_store_1": value,
            ...
        },
        ...
    },
    # Data for time = 1
    1.0: {...},
    ...
}

This data format is mainly intended for ad-hoc analysis scripts (e.g. Jupyter notebooks) where a single-cell simulation is run and probed for model development. Importantly, the data saved by this emitter is lost when the Python program used to run the cell simulation terminates.

Querying

Data can be read from the RAM emitter by calling query() on the EcoliSim object used to run the simulation. To deserialize data (reconstitute it after serialization), the deserialize_value() function is called, which calls the deserialize() method of the Serializer instance whose can_deserialize() method returns True on the data to deserialize.

Parquet Emitter

When parquet is specified using the emitter option in a configuration JSON, simulation output is stored in a tabular file format called Parquet inside a nested directory structure called Hive partitioning. For details on the available JSON configuration options, see ParquetEmitter.

Hive Partitioning

In Hive partitioning, certain keys in data are used to partition the data into folders:

key_1=value_1/key_2=value_2/...

In the vEcoli Parquet emitter, the keys used for this purpose are the experiment ID, variant index, lineage seed (initial seed for cell lineage), generation, and agent ID. These keys uniquely identify a single cell simulation, meaning each simulation process will write data to its own folder in the final output with a path like:

experiment_id={}/variant={}/lineage_seed={}/generation={}/agent_id={}

This allows workflows that run simulations with many variant simulation data objects, lineage seeds, generations, and agent IDs to all write data to the same main output folder without simulations overwriting one another.

Parquet Files

Because Parquet is a tabular file format (think in terms of columns like a Pandas DataFrame), additional serialization steps must be taken after the emit data has been converted to JSON format in accordance with Serializing Emits. The Parquet emitter (ParquetEmitter) first calls flatten_dict() in order to flatten the nested store hierarchy into unnested key-value pairs where keys are paths to leaf values concatenated with double underscores and values are leaf values. For example, take the following nested dictionary:

{
    "a": {
        "b": 1,
        "c": {
            "d": 2,
            "e": 3
        },
        "f": 4
    },
    "g": 5
}

This is flattened to:

{
    "a__b": 1,
    "a__c__d": 2,
    "a__c__e": 3,
    "a__f": 4,
    "g": 5
}

Then, np_dtype() is used to get the the type of the Parquet column that will be created for each key-value pair in the flattened dictionary, where each key is the column name and each value is one entry in the column. Parquet files are strongly typed, so emitted store data must always be serialized to the same type as they were in the first time step (default or initial value). The exception to this rule are columns that can contain null values or nested types containing null values (e.g. empty list). For these columns, all values except the null entries must be the same type (e.g. column with lists of integers where some entries are empty lists).

Warning

The Parquet emitter is poorly suited for storing large listeners that have more than a single dimension per time step. We recommend splitting these listeners up if possible, especially if you plan to read specific indices along those dimensions.

The Parquet emitter saves the serialized tabular data to two Hive-partitioned directories in the output folder (out_dir or out_uri option under emitter_arg in JSON Config Files):

configuration: Copy of all configuration options (e.g. from JSON, CLI) that were used to run the simulation as well as store-specific metadata
history: Actual saved simulation output

`configuration`

Each simulation will save a single Parquet file named config.pq inside its corresponding Hive partition under the configuration folder. Many of the columns inside this Parquet file come from flattening the configuration JSON used to run the simulation and can be read back in analysis scripts (see Analyses) using the helper function config_value().

Additionally, this file can contain metadata for each store to emit. This metadata can be specified under the _properties key in a port schema as follows:

{
    "_properties": {
        "metadata": Put anything here.
    }
}

Schemas constructed with the listener_schema() helper function can populate this metdata concisely. These metadata values are compiled for all stores in the simulation state hierarchy by output_metadata(). In the saved configuration Parquet file, the metadata values will be located in columns with names equal to the double-underscore concatenated store path prefixed by output_metadata__. For convenience, the field_metadata() can be used in analysis scripts to read this metadata.

`history`

Each simulation will save Parquet files containing serialized simulation output data inside its corresponding Hive partition under the history folder. The columns in these Parquet files come from flattening the hierarchy of emitted stores. To leverage Parquet’s columnar compression and efficient reading, we batch many time steps worth of emits into either NumPy arrays (constant dimensions) or lists of Polars Series (variable dimensions). These batched emits are efficiently converted into a Polars DataFrame and written to a Parquet file named {batch size * number of batches}.pq (e.g. 400.pq, 800.pq, etc. for a batch size of 400). The default batch size of 400 has been tuned for our current model but can be adjusted via batch_size under the emitter_arg option in a configuration JSON.

DuckDB

DuckDB is the main library that we use to read and query Parquet files. It offers class-leading performance and a fairly user-friendly SQL dialect for constructing complex queries. Refer to the DuckDB documentation to learn more.

We provide a variety of helper functions in ecoli.library.parquet_emitter to read data using DuckDB. These include:

dataset_sql(): Construct basic SQL queries to read data from history and configuration folders. This is mainly intended for ad-hoc Parquet reading (e.g. in a Jupyter notebook). Analysis scripts (see Analyses) receive a history_sql and config_sql that reads data from Parquet files with filters applied when run using runscripts.analysis.
union_by_name(): Modify SQL query from dataset_sql() to use DuckDB’s union_by_name. This is useful when reading data from simulations with different columns.
num_cells(): Quickly get a count of the number of cells whose data is included in a SQL query
skip_n_gens(): Add a filter to an SQL query to skip the first N generations worth of data
ndlist_to_ndarray(): Convert a column of nested lists read from Parquet into an N-D Numpy array (use polars.Series to do opposite conversion)
ndidx_to_duckdb_expr(): Get a DuckDB SQL expression which can be included in a SELECT statement that uses Numpy-style indexing to retrieve values from a nested list Parquet column
named_idx(): Get a DuckDB SQL expression which can be included in a SELECT statement that extracts values at certain indices from each row of a nested list Parquet column and returns them as individually named columns
field_metadata(): Read saved store metadata (see configuration)
config_value(): Read option from configuration JSON used to run simulation
read_stacked_columns(): Main interface for reading simulation output from history folder. Can either immediately read all data in specified columns into memory by supplying conn argument or return a DuckDB SQL query that can be iteratively built upon (useful when data too large to read into memory all at once).

Warning

Column names that contain special characters (e.g. spaces, dashes, etc.) must be enclosed in double quotes when used in DuckDB SQL queries. This is automatically handled by most of the helper functions above with the notable exception of read_stacked_columns().

Warning

Parquet lists are 1-indexed. ndidx_to_duckdb_expr() and named_idx() automatically add 1 to user-supplied indices.

Construct SQL Queries

The true power of DuckDB is unlocked when SQL queries are iteratively constructed. This can be accomplished in one of two ways:

For simpler queries, you can wrap a complete DuckDB SQL expression in parentheses to use as the input table to another query. For example, to calculate the average cell and dry mass for over all time steps for all cells accessible to an analysis script:
SELECT avg(*) FROM ( SELECT listeners__mass__dry_mass, listeners__mass__cell_mass FROM ( history_sql ) )
history_sql can be slotted in programmatically using an f-string.

For more advanced, multi-step queries, you can use common table expressions (CTEs). For example, to run the same query above but first averaging over all time steps for each cell before averaging the averages over all cells:

WITH cell_avgs AS (
    SELECT avg(listeners__mass__dry_mass) AS avg_dry_mass,
        avg(listeners__mass__cell_mass) AS avg_cell_mass
    FROM (history_sql)
    GROUP BY experiment_id, variant, lineage_seed, generation, agent_id
)
SELECT avg(*) FROM cell_avgs

Tip

DuckDB will efficiently read only the rows and columns necessary to complete your query. However, if you are reading a column of lists (e.g. bulk molecule counts every time step) or nested lists, DuckDB reads the entire nested value for every relevant row in that column, even if you only care about a small subset of indices. To avoid repeatedly incurring this cost, we recommend using named_idx() to select all indices of interest to be read in one go. As long as the final result fits in RAM, this should be much faster than reading each index individually.

See new_gene_translation_efficiency_heatmaps for examples of complex queries, as well as helper functions to create SQL expressions for common query patterns. These include:

Other Workflow Output

We provide helper functions in ecoli.library.parquet_emitter to read other workflow output.

open_arbitrary_sim_data(): Intended for use in analysis scripts. Accepts the sim_data_paths dictionary given as input to analysis scripts by runscripts.analysis and picks a single arbitrary path in that dictionary to read and unpickle.
open_output_file(): When opening any workflow output file in a Python script, use this function instead of the built-in open (e.g. with open_output_file({path}, "r") as f:). This is mainly intended to future-proof analysis scripts for Google Cloud support.

Output

Stores to Emit

Serializing Emits

In-Memory Emitter

Querying

Parquet Emitter

Hive Partitioning

Parquet Files

configuration

history

DuckDB

Construct SQL Queries

Other Workflow Output

`configuration`

`history`