Quickstart Guide

Cytomulate has an incredibly simple API for our two simulation modes:

  • Creation Mode

  • Emulation

In this tutorial, we will walk you through the basics of each mode along with some easy tips and tricks that may be beneficial to you as you start cytomulating away in your everyday life.

Before we get started, let’s set a seed so that we can get reproducible results:

import numpy as np
np.random.seed(42)

Now, let’s start our journey!


Creation Mode: Probabilistic Model-Based Simulation

Creation mode, as its name implies, allows Cytomulate to formulate its own recipe to create datasets. In other words, we don’t mimic any existing datasets. Rather, you can specify all the parameters and get exactly that in return (think of it as customizing your frozen yogurt instead of having a set menu item).

>>> from cytomulate import CreationCytofData

>>> cytof_data = CreationCytofData()
>>> cytof_data.initialize_cell_types()
>>> expression_matrices, labels, _, _ = cytof_data.sample(n_samples = 100)

By default, the Creation Mode simulates 1 batch of 10 cell types with 20 markers and your specified number of cells. Now, let’s look at the results we get:

>>> expression_matrices
{0: array([[0.0174032 , 0.56522606, 0.30617146, ..., 1.45473617, 0.31720944,
            0.56082489],
           [1.51403051, 0.58787876, 1.47135531, ..., 1.46453964, 1.47289462,
            0.56293968],
           [1.50258188, 0.30403334, 0.55254835, ..., 0.55964366, 0.11598993,
            0.11538903],
           ...,
           [0.02217756, 0.58209763, 0.29106511, ..., 1.43931079, 0.31428325,
            0.55701247],
           [0.52981196, 1.46558442, 0.56604271, ..., 1.46975045, 0.1134796 ,
            0.10429606],
           [1.47349014, 0.62232074, 1.46480141, ..., 1.45502123, 1.4689097 ,
            0.56239449]])}
>>> labels
{0: array([2, 0, 7, 4, 2, 6, 1, 4, 0, 8, 5, 2, 0, 2, 7, 1, 2, 5, 0, 2, 4, 0,
           0, 2, 0, 1, 7, 2, 2, 2, 3, 2, 1, 2, 5, 2, 7, 7, 2, 3, 5, 0, 9, 2,
           8, 2, 2, 2, 4, 2, 0, 2, 0, 4, 7, 7, 0, 5, 4, 2, 2, 4, 4, 7, 7, 3,
           0, 9, 6, 0, 7, 2, 9, 2, 7, 5, 0, 1, 0, 8, 1, 2, 7, 2, 4, 0, 2, 0,
           6, 2, 2, 4, 0, 5, 2, 0, 0, 2, 5, 0])}

As some of your keen-eyed readers may notice, the outputs are dictionaries: this is because Cytomulate can accomodate multiple samples as indexed by the dictionary keys. Of course, you can procceed to extract the expression matrix and then work with it in downstream analyses.

PyCytoData Output

For those of you who are familiar with PyCytoData or want a cleaner interface to work with, Cytomulate can output a PyCytoData object.

Note

PyCytoData is required to be installed for this to work. Since it is an optional dependency, read our Installation Guide for further details. Once installed, PyCytoData is fully compatible with Cytomulate

To do this, simply use the following method instead:

>>> from cytomulate import CreationCytofData

>>> cytof_data = CreationCytofData()
>>> cytof_data.initialize_cell_types()
>>> dataset = cytof_data.sample_to_pycytodata(n_samples = 100)

Now, let’s look at the object and access the expression matrix and labels:

>>> type(dataset)
>>> dataset.expression_matrix
array([[0.03390264, 0.03323944, 0.79319831, ..., 0.00431289, 1.56704157,
        0.11522665],
       [0.00213084, 0.32081423, 0.04375508, ..., 0.03236736, 1.59080603,
        0.03952084],
       [0.03625106, 0.03741527, 0.80485429, ..., 0.00405477, 1.5738564 ,
        0.07236162],
       ...,
       [0.78996499, 0.80564232, 0.03399493, ..., 0.06597879, 0.03527863,
        0.31189172],
       [0.        , 0.32236194, 0.05363561, ..., 0.02794309, 1.58739998,
        0.0293298 ],
       [0.03699177, 0.04021649, 0.80394265, ..., 0.00309843, 1.57274021,
        0.07097411]])
>>> dataset.cell_types
array([1, 7, 1, 9, 4, 7, 7, 6, 9, 1, 6, 1, 6, 4, 3, 4, 1, 1, 1, 1, 9, 9,
       4, 6, 0, 4, 1, 7, 1, 4, 4, 4, 3, 1, 1, 3, 7, 3, 3, 1, 1, 5, 4, 3,
       1, 1, 4, 6, 1, 1, 1, 1, 1, 9, 6, 6, 1, 3, 1, 1, 4, 3, 1, 4, 1, 4,
       7, 1, 7, 1, 6, 6, 3, 9, 6, 1, 6, 3, 6, 9, 4, 1, 6, 6, 1, 9, 6, 6,
       4, 1, 4, 6, 4, 4, 4, 6, 3, 3, 7, 1])
>>> dataset.sample_index
array(['0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0',
       '0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0',
       '0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0',
       '0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0',
       '0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0',
       '0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0',
       '0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0',
       '0', '0', '0', '0', '0', '0', '0', '0', '0'], dtype='<U1')

As you can see, PyCytoData manages uses simple array instead of dictionaries because it has capabilities to manage batches samples. Of course, the details of this package is out of the scope of this project, but to find out more about PyCytoData, you can read the detailed documentation written by its lovely devs here.

As always, you can use the results for downstream analyses as usual.


Emulation Mode: Real Data-Based Simulation

If you already have the expression matrices and datasets of your dream but you still want to experience the glory of Cytomulate, let us introduce Emulation Mode. In this mode, Cytomulate uses an existing dataset as a basis for generating new expressions. The key advantage of this mode is that it can quickly replicate existing data without the need of resampling.

To use this mode, we require prior information on cell types, which will ensure the best approximation. To do this, let’s use PyCytoData again! First, let’s load our existing datasets:

>>> from PyCytoData import DataLoader
>>> data = DataLoader.load_dataset(dataset="Levine13")
>>> data.expression_matrix
array([[ 1.24334908e+02,  6.28371582e+01, -6.17444396e-01, ...,
         1.06896072e+02,  6.39934635e+00,  7.14621687e+00],
       [ 1.22633148e+02,  5.52684593e+01, -3.17519844e-01, ...,
         1.27218781e+02, -3.17452759e-01,  1.12626851e+00],
       [ 3.30561943e+01,  1.73848724e+01, -7.71313131e-01, ...,
         3.32087189e+02, -2.46072114e-01,  8.84189606e-01],
       ...,
       [ 3.49014664e+01,  4.32544184e+00,  8.33491230e+00, ...,
         2.79086884e+02,  1.60285759e+01,  3.90819855e+01],
       [ 1.70956116e+01,  9.30270076e-01, -1.08385071e-01, ...,
         3.84983948e+02,  4.54559469e+00,  9.67729034e+01],
       [ 1.04753265e+01, -7.23805502e-02, -5.91436803e-01, ...,
         5.08439331e+02,  2.38833976e+00,  1.06308832e+01]])
>>> data.cell_types
array(['Plasmacytoid DC', 'Plasmacytoid DC', 'Plasmacytoid DC', ...,
   'MEP', 'MEP', 'MEP'], dtype='<U17')

For those of you who are familiar the Levine13 dataset, this will be right at home! For others, this is a well-known benchmark dataset.

Now, to start cytomulating, the overall interface is very similar but with a different class:

>>> from cytomulate import EmulationCytofData

>>> cytof_data = EmulationCytofData()
>>> cytof_data.initialize_cell_types(expression_matrix=data.expression_matrix,
                                     labels=data.cell_types)
>>> expression_matrices, labels, _, _ = cytof_data.sample(n_samples = 100)

Now, let’s look at our outputs:

>>> expression_matrices
{0: array([[9.89849695e+01, 9.49059169e+00, 8.13185395e-01, ...,
            3.45029702e+01, 3.18472044e-01, 5.29247895e+02],
           [2.15393928e+02, 2.98800278e+01, 1.59270843e+00, ...,
            1.67047977e+02, 5.34828676e+00, 1.03305364e+02],
           [3.82536701e+02, 2.91190531e+02, 1.05922645e+02, ...,
            0.00000000e+00, 0.00000000e+00, 4.10342314e+01],
           ...,
           [2.21592856e+02, 1.08856275e+00, 6.48076690e-01, ...,
            0.00000000e+00, 0.00000000e+00, 3.48087446e+02],
           [1.72226786e+01, 0.00000000e+00, 7.60774586e+00, ...,
            0.00000000e+00, 4.61570855e+00, 1.45485442e+02],
           [0.00000000e+00, 1.21741897e+01, 2.83614833e+00, ...,
            5.37901263e+00, 6.80561586e+00, 3.41060042e+01]])}
>>> labels
{0: array(['Mature CD4+ T', 'NotGated', 'NotGated', 'Mature CD38lo B',
           'NotGated', 'NotGated', 'NotGated', 'NotGated', 'Naive CD4+ T',
           'NotGated', 'NotGated', 'CD11bhi Monocyte', 'CD11bmid Monocyte',
           'NotGated', 'Mature CD4+ T', 'Naive CD8+ T', 'CD11b- Monocyte',
           'NotGated', 'NotGated', 'Mature CD4+ T', 'NotGated',
           'Mature CD4+ T', 'Megakaryocyte', 'NotGated', 'NotGated',
           'NotGated', 'NotGated', 'Mature CD4+ T', 'NotGated', 'NotGated',
           'NotGated', 'NotGated', 'Megakaryocyte', 'NotGated',
           'Mature CD8+ T', 'NotGated', 'Mature CD8+ T', 'Mature CD4+ T',
           'NotGated', 'NotGated', 'Naive CD4+ T', 'NotGated',
           'CD11bhi Monocyte', 'NotGated', 'NotGated', 'NotGated', 'NotGated',
           'Megakaryocyte', 'NotGated', 'NK', 'NotGated', 'CD11bhi Monocyte',
           'Naive CD8+ T', 'Naive CD8+ T', 'NotGated', 'NotGated',
           'Mature CD4+ T', 'Naive CD8+ T', 'NotGated', 'NotGated',
           'Mature CD8+ T', 'NotGated', 'Mature CD38lo B', 'NotGated', 'NK',
           'NotGated', 'Mature CD8+ T', 'NotGated', 'NotGated',
           'Mature CD8+ T', 'CD11bhi Monocyte', 'NotGated', 'NotGated',
           'Mature CD8+ T', 'NotGated', 'HSC', 'Erythroblast', 'NotGated',
           'Mature CD8+ T', 'NotGated', 'NotGated', 'NotGated', 'NotGated',
           'Erythroblast', 'Mature CD8+ T', 'Mature CD4+ T', 'Megakaryocyte',
           'Mature CD8+ T', 'NotGated', 'NotGated', 'NotGated',
           'Megakaryocyte', 'NotGated', 'NotGated', 'Naive CD4+ T',
           'NotGated', 'NotGated', 'Mature CD4+ T', 'NotGated',
           'Erythroblast'], dtype='<U17')}

PyCytoData Output

If you have fallen in love with PyCytoData, good news: the emulation mode is compatible with PyCytoData output as well! The procedure is exactly the same as the Creation Mode:

>>> from cytomulate import EmulationCytofData

>>> cytof_data = EmulationCytofData()
>>> cytof_data.initialize_cell_types(expression_matrix=data.expression_matrix,
                                     labels=data.cell_types)
>>> dataset = cytof_data.sample_to_pycytodata(n_samples = 100)

It’s as simple as this! The rest is the same as the Creation Mode!

Congratulations!! You’ve officially made it through the Quickstart Guide! You’re on track to become a Cytomulate expert! Now, you can read more about settings and complex simulation situations in this tutorial.