Output

Simulation output can come in one of two different formats and contain data from as many or few stores as you desire.

Stores to Emit

To indicate that you want to save the data in a simulation store for later, set the _emit key to True in the port schema for all ports connecting to that store. By default, we always emit data for:

  • Bulk molecules store located at ("bulk",): The numpy_schema() helper function that we use to create the schema for ports to the bulk store automatically sets _emit to True when the name argument is bulk.

  • Listeners located at ("listeners",): The listener_schema() helper function that we use to create the schema for ports to stores located somewhere in the hierarchy under the listener store automatically sets _emit to True

Serializing Emits

Serialization is the process of converting data into a format that can be stored or transmitted then later reconstructed back into its original format. By default, both of the two available data output formats in vEcoli serialize data by first converting the store hierarchy to save (Stores to Emit) to JSON using orjson, which natively serializes Python’s built-in types as well as basic 1D Numpy arrays. For stores containing data that is not one of these types, vivarium-core allows users to specify custom serializers either on a per-store basis using the _serialize schema key or for all stores of a given type using the Serializer API (see vivarium-core documentation).

For details about reading data back after it has been saved, refer to Querying for the in-memory data format and DuckDB for the persistent storage format.

In-Memory Emitter

When timeseries is specified using the emitter option in a configuration JSON, simulation output is stored transiently in-memory in a dictionary keyed by time that looks like the following:

{
    # Data for time = 0
    0.0: {
        # Store hierarchy as nested dictionary containing all stores and sub-stores
        # where ``_emit`` is True
        "store_1": value,
        "store_2": {
            "inner_store_1": value,
            ...
        },
        ...
    },
    # Data for time = 1
    1.0: {...},
    ...
}

This data format is mainly intended for ad-hoc analysis scripts (e.g. Jupyter notebooks) where a single-cell simulation is run and probed for model development. Importantly, the data saved by this emitter is lost when the Python program used to run the cell simulation terminates.

Querying

Data can be read from the RAM emitter by calling query() on the EcoliSim object used to run the simulation. To deserialize data (reconstitute it after serialization), the deserialize_value() function is called, which calls the deserialize() method of the Serializer instance whose can_deserialize() method returns True on the data to deserialize.

Parquet Emitter

When parquet is specified using the emitter option in a configuration JSON, simulation output is stored in a tabular file format called Parquet inside a nested directory structure called Hive partitioning.

Hive Partitioning

In Hive partitioning, certain keys in data are used to partition the data into folders:

key_1=value_1/key_2=value_2/...

In the vEcoli Parquet emitter, the keys used for this purpose are the experiment ID, variant index, lineage seed (initial seed for cell lineage), generation, and agent ID. These keys uniquely identify a single cell simulation, meaning each simulation process will write data to its own folder in the final output with a path like:

experiment_id={}/variant={}/lineage_seed={}/generation={}/agent_id={}

This allows workflows that run simulations with many variant simulation data objects, lineage seeds, generations, and agent IDs to all write data to the same main output folder without simulations overwriting one another.

Parquet Files

Because Parquet is a tabular file format (think in terms of columns like a Pandas DataFrame), additional serialization steps must be taken after the emit data has been converted to JSON format in accordance with Serializing Emits. The Parquet emitter (ParquetEmitter) first calls flatten_dict() in order to flatten the nested store hierarchy into unnested key-value pairs where keys are paths to leaf values concatenated with double underscores and values are leaf values. For example, take the following nested dictionary:

{
    "a": {
        "b": 1,
        "c": {
            "d": 2,
            "e": 3
        },
        "f": 4
    },
    "g": 5
}

This is flattened to:

{
    "a__b": 1,
    "a__c__d": 2,
    "a__c__e": 3,
    "a__f": 4,
    "g": 5
}

Then, get_encoding() is used to get the the type of the Parquet column that will be created for each key-value pair in the flattened dictionary, where each key is the column name and each value is one entry in the column. Parquet files are strongly typed, so emitted store data must always be serialized to the same type as they were in the first time step (default or initial value). The exception to this rule are columns that can contain null values or nested types containing null values (e.g. empty list). For these columns, all values except the null entries must be the same type (e.g. column with lists of integers where some entries are empty lists).

Currently, the Parquet emitter does not support saving data from stores that are dynamically added or removed over the course of a simulation or workflow. This means that all cells in a workflow must emit a consistent store hierarchy that flattens to a dictionary with the same double-underscore concatenated keys (resulting Parquet files all have the same columns). This limitation may be relaxed in the future via an optional flag that will come with an unknown performance penalty.

The Parquet emitter saves the serialized tabular data to two Hive-partitioned directories in the output folder (out_dir or out_uri option under emitter_arg in JSON Config Files):

  • configuration: Copy of all configuration options (e.g. from JSON, CLI) that were used to run the simulation as well as store-specific metadata

  • history: Actual saved simulation output

configuration

Each simulation will save a single Parquet file named config.pq inside its corresponding Hive partition under the configuration folder. Many of the columns inside this Parquet file come from flattening the configuration JSON used to run the simulation and can be read back in analysis scripts (see Analyses) using the helper function get_config_value().

Additionally, this file can contain metadata for each store to emit. This metadata can be specified under the _properties key in a port schema as follows:

{
    "_properties": {
        "metadata": Put anything here.
    }
}

Schemas constructed with the listener_schema() helper function can populate this metdata concisely. These metadata values are compiled for all stores in the simulation state hierarchy by get_output_metadata(). In the saved configuration Parquet file, the metadata values will be located in columns with names equal to the double-underscore concatenated store path prefixed by output_metadata__. For convenience, the get_field_metadata() can be used in analysis scripts to read this metadata.

history

Each simulation will save Parquet files containing serialized simulation output data inside its corresponding Hive partition under the history folder. The columns in these Parquet files come from flattening the hierarchy of emitted stores. To leverage Parquet’s columnar compression and efficient reading, we batch many time steps worth of emits into a temporary newline-delimited JSON file before reading them into a PyArrow table where each row contains the column values for a single time step. This PyArrow table is then written to a Parquet file named {batch size * number of batches}.pq (e.g. 400.pq, 800.pq, etc. for a batch size of 400). The default batch size of 400 has been tuned for our current model but can be adjusted via emits_to_batch under the emitter_arg option in a configuration JSON.

DuckDB

DuckDB is the main library that we use to read and query Parquet files. It offers class-leading performance and a fairly user-friendly SQL dialect for constructing complex queries. Refer to the DuckDB documentation to learn more.

We provide a variety of helper functions in ecoli.library.parquet_emitter to read data using DuckDB. These include:

  • get_dataset_sql(): Construct basic SQL queries to read data from history and configuration folders. This is mainly intended for ad-hoc Parquet reading (e.g. in a Jupyter notebook). Analysis scripts (see Analyses) receive a history_sql and config_sql that reads data from Parquet files with filters applied when run using runscripts.analysis.

  • num_cells(): Quickly get a count of the number of cells whose data is included in a SQL query

  • skip_n_gens(): Add a filter to an SQL query to skip the first N generations worth of data

  • ndlist_to_ndarray(): Convert a PyArrow column of nested lists into a N-D Numpy array

  • ndarray_to_ndlist(): Convert a N-D Numpy array into a PyArrow column of nested lists

  • ndidx_to_duckdb_expr(): Get a DuckDB SQL expression which can be included in a SELECT statement that uses Numpy-style indexing to retrieve values from a nested list Parquet column

  • named_idx(): Get a DuckDB SQL expression which can be included in a SELECT statement that extracts values at certain indices from each row of a nested list Parquet column and returns them as individually named columns

  • get_field_metadata(): Read saved store metadata (see configuration)

  • get_config_value(): Read option from configuration JSON used to run simulation

  • read_stacked_columns(): Main interface for reading simulation output from history folder. Can either immediately read all data in specified columns into memory by supplying conn argument or return a DuckDB SQL query that can be iteratively built upon (useful when data to large to read into memory all at once).

Warning

Parquet lists are 1-indexed. ndidx_to_duckdb_expr() and named_idx() automatically add 1 to user-supplied indices.

Construct SQL Queries

The true power of DuckDB is unlocked when SQL queries are iteratively constructed. This can be accomplished in one of two ways:

  • For simpler queries, you can wrap a complete DuckDB SQL expression in parentheses to use as the input table to another query. For example, to calculate the average cell and dry mass for over all time steps for all cells accessible to an analysis script:

    SELECT avg(*) FROM (
        SELECT listeners__mass__dry_mass, listeners__mass__cell_mass FROM (
            history_sql
        )
    )
    

    history_sql can be slotted in programmatically using an f-string.

  • For more advanced, multi-step queries, you can use common table expressions (CTEs). For example, to run the same query above but first averaging over all time steps for each cell before averaging the averages over all cells:

    WITH cell_avgs AS (
        SELECT avg(listeners__mass__dry_mass) AS avg_dry_mass,
            avg(listeners__mass__cell_mass) AS avg_cell_mass
        FROM (history_sql)
        GROUP BY experiment_id, variant, lineage_seed, generation, agent_id
    )
    SELECT avg(*) FROM cell_avgs
    

See new_gene_translation_efficiency_heatmaps for examples of complex queries, as well as helper functions to create SQL expressions for common query patterns. These include:

Other Workflow Output

We provide helper functions in ecoli.library.parquet_emitter to read other workflow output.

  • open_arbitrary_sim_data(): Intended for use in analysis scripts. Accepts the sim_data_paths dictionary given as input to analysis scripts by runscripts.analysis and picks a single arbitrary path in that dictionary to read and unpickle.

  • open_output_file(): When opening any workflow output file in a Python script, use this function instead of the built-in open (e.g. with open_output_file({path}, "r") as f:). This is mainly intended to future-proof analysis scripts for Google Cloud support.