# Emulation Mode

Sometimes when we already have data on hand and we can’t help but wanting more datasets, then this is the
time to Cytomulate with the **Emulation Mode**. As introduced in the Quickstart Guide,
the Emulation Mode allows users to input their own datasets and then simulate data based on the given
dataset. This is useful in a number of scenarios:

More datasets are needed to validate a method.

The current dataset has only 1 batch, but more than 1 is desired.

Biological interpretation with real protein channels and cell types are desired.

Analyses call for a basis on real data.

Of course, there are many more reasons to pick Emulation Mode over Creation Mode. The only prerequisite is that users have to have their own dataset already. If not, they will have to find a dataset online or actually perform some experiments in the lab. But for the sake of this tutorial, we will assume that everyone has a dataset available to be emulated.

## Setting Things Up

Before we start, let’s make some necessary imports:

```
>>> from cytomulate import EmulationCytofData
>>> from PyCytoData import DataLoader, FileIO, PyCytoData
```

Throughout the rest of the tutorial, we will use these functions directly.

Note

`PyCytoData`

is optional for the demonstration here, and it is not a required dependency
for `Cytomulate`

unless the CLI is used. We include examples with `PyCytoData`

for
its convenience in Python.

## Load Datasets

First things first, we need to load our existing dataset into Python. For this tutorial, we would
like to assume that the data are stored in plain text (*e.g.* txt, csv, etc.) rather than the `fcs`

format. We understand that the latter is popular in the field, but we do not yet have a good tutorial
on this. We would defer to the documentation of tools such as fcspasrser
or flowCore. If you do have
`fcs`

files, you can use these tools to load the files, extract the expression matrices and cell
types, and then save the results as plain text files.

In this section, we will showcase a few methods of loading the datasets. For all methods, we will
save the expression matrix as `dataset`

whereas the cell types as `cell_types`

, both of which
are NumPy arrays. Further, we generally assume that you don’t have instrument channels, such as
“Time” and “Bead”, in your dataset! If you do, you may need some preprocssing before passing them to
Cytomulate.

### The PyCytoData Method

We highly recommend the usage of PyCytoData to load the data into Python because it is quite easy to do! If you have the data on hand, you can quickly load the expression matrix and cell types with the following snippet of code:

```
>>> dataset = FileIO.load_delim("<Path_to_Expressions>", skiprows = 1, delim = "\t", dtype = float)
>>> cell_types = FileIO.load_delim("<Path_to_Cell_Types>", skiprows = 0, delim = "\t", dtype = str)
```

You will replace the the paths with the actual paths to your files. Note that in the above example, we assume that the expression matrix has the first row as protein marker names, which will be skipped. We further assume that the file is saved as a tab-delimited file. You can customize the process to your own files.

If you would like to use the `PyCytoData`

object, you can use the following example:

```
>>> data = FileIO.load_expression("<Path_to_Expressions>", col_names = True, delim = "\t", dtype = float)
>>> cell_types = FileIO.load_delim("<Path_to_Cell_Types>", skiprows = 0, delim = "\t", dtype = str)
>>> data.cell_types = cell_types
```

This has the advantage of storing the channel names in the object via the `data.channels`

attribute. Now,
the only difference is that you need to use `data.expression_matrix`

to access your expression
matrix.

PyCytoData has many other features and details which are all helpful in some way, but we cannot rewrite the documentation here. You can read go ahead and read the documentation if interested.

### The numpy Method

The PyCytoData’s IO methods are actually built on top of the NumPy package. If you wish to use the NumPy directly, it is also easy to do so:

```
>>> import numpy as np
>>> dataset = np.loadtxt("<Path_to_Expressions>", skiprows = 1, delim = "\t", dtype = float)
>>> cell_types = np.loadtxt("<Path_to_Cell_Types>", skiprows = 0, delim = "\t", dtype = str)
```

which has a similar interface to the PyCytoData interface. This is arguably a simpler interface. However, NumPy does not have the benefits of a PyCytoData object, which you may or may not need.

### The pandas Method

Of course, if you would like to read data using Pandas, you can do so with ease as well. We are going to read the data and extract the necessary components to be used by Cytomulate. Here, we assume a few things for demonstration purposes:

Everything is in one file, including a column with cell types.

The cell type column is named “Cell_Types”.

Column names are available.

```
>>> import pandas as pd
>>> df = pd.read_csv("<Path_to_Expressions>", sep = "\t", header = 0)
>>> cell_types = df["Cell_Types"].to_numpy()
>>> dataset = df.drop("Cell_Types", axis = 1).to_numpy()
```

As with all other methods, there are tons of variations based on the data available! You likely need some daat wrangling, but the principles are simple!

## Simple Simulation

Once you have your datasets, you can run your simplest simulation:

```
>>> cytof_data = EmulationCytofData(n_batches = 1)
>>> cytof_data.initialize_cell_types(expression_matrix = dataset,
labels = cell_types,
max_components = 9,
min_components = 1,
covariance_types = ("full", "tied", "diag", "spherical"))
>>> expression_matrices, labels, _, _ = cytof_data.sample(n_samples = 100)
```

As opposed to the Quickstart Guide, here
we are more explicit about the parameters that users can supply or change. Note that all above
listed parameters are defaults, except for the `expression_matrix`

and `labels`

that need to be
user supplied. In the above function calls, we are doing the following:

Initialize a

`EmulationCytofData`

object with 1 batch.Ask the object to emulate the given

`dataset`

with the given`cell_types`

. The GMM model is allowed to use between 1 and 9 components with one of the four covariance types.We sample 100 cells in total for the dataset.

We would like to note that `n_samples`

indicates the sample size, not the number of batches. In this case, we have
the following outputs:

```
>>> expression_matrices
{0: array([[158.04318784, 143.54615373, 14.55784182, ..., 28.23468037,
2.053792 , 0. ],
[1.00986273, 6.6360988 , 0. , ..., 6.54024249,
1.7951401 , 19.13473223],
[4.30063734, 5.7733213 , 0. , ..., 23.30509231,
8.80240499, 21.81913369],
...,
[191.03235818, 18.41559666, 0. , ..., 333.3101886 ,
0. , 215.16717053],
[310.09391901, 123.0910225 , 0. , ..., 0. ,
0. , 164.39680483],
[382.54861993, 8.39272611, 3.83182708, ..., 48.24234018,
3.20327527, 184.68550649]])}
>>> labels
{0: array(['Mature CD38lo B', 'Erythroblast', 'Erythroblast',
'CD11bhi Monocyte', 'Mature CD38lo B', 'NotGated', 'NotGated',
'CD11b- Monocyte', 'NotGated', 'NotGated', 'NotGated', 'NotGated',
'NotGated', 'NotGated', 'NotGated', 'NotGated', 'Mature CD4+ T',
'NotGated', 'NotGated', 'NotGated', 'NotGated', 'Mature CD8+ T',
'NotGated', 'Mature CD4+ T', 'NotGated', 'Mature CD4+ T',
'NotGated', 'NotGated', 'NotGated', 'NotGated', 'Mature CD4+ T',
'NotGated', 'NotGated', 'NotGated', 'NotGated', 'Erythroblast',
'NotGated', 'NotGated', 'CD11bhi Monocyte', 'NotGated', 'NotGated',
'CD11bhi Monocyte', 'CD11bmid Monocyte', 'NotGated',
'Naive CD4+ T', 'Erythroblast', 'Plasmacytoid DC', 'Naive CD8+ T',
'NotGated', 'Erythroblast', 'NotGated', 'CD11bhi Monocyte',
'Megakaryocyte', 'NotGated', 'Mature CD4+ T', 'NotGated',
'Mature CD4+ T', 'Mature CD4+ T', 'NotGated', 'NK', 'NotGated',
'Naive CD8+ T', 'NotGated', 'NotGated', 'Mature CD8+ T',
'NotGated', 'NK', 'NotGated', 'Mature CD8+ T', 'NotGated',
'NotGated', 'NotGated', 'Mature CD4+ T', 'Mature CD38lo B',
'CD11bhi Monocyte', 'NotGated', 'Mature CD38lo B', 'Naive CD8+ T',
'Mature CD4+ T', 'NotGated', 'NotGated', 'Erythroblast',
'NotGated', 'NotGated', 'NotGated', 'NotGated', 'Mature CD4+ T',
'Mature CD4+ T', 'Erythroblast', 'Mature CD38lo B', 'Erythroblast',
'NotGated', 'NotGated', 'Naive CD8+ T', 'NotGated',
'Mature CD8+ T', 'NotGated', 'NotGated', 'Naive CD8+ T',
'Mature CD8+ T'], dtype='<U17')}
```

Notice that each of the output is a dictionary of length 1 because we are only simulating 1 batch. We can
see that inside of the dictionary, each is an array. You can extract these arrays or use the
`expression_matrices`

and `labels`

directly for downstream analyses.

Another thing to note is that the `cytof_data.sample(n_samples = 100)`

call has 4 returns in a Tuple,
and the example above simply used Tuple unpacking. As the names suggest, the first is expression matrices,
and the second is the labels. In this case, we are not using the third and fourth returns, which are reserved
for pseudo-time and cell hierarchy in the cellular trajectory simulation from the
Complex Simulation.

## GMM and Model Selection

Cytomulate supports numerous model configurations for each dataset. Since the overall model is based on the Gaussian Mixture Model (GMM), there are two major tuning parameters that are needed:

The number of components

The covariance matrix structure

As you can see from the example above, you can specify these manually using the simple simulation or go with the defaults. If you have a specific configuration in mind (e.g. 3 components with a full covariance matrix), you can simulate with the following:

```
>>> cytof_data = EmulationCytofData(n_batches = 1)
>>> cytof_data.initialize_cell_types(expression_matrix = dataset,
... labels = cell_types,
... max_components = 3,
... min_components = 3,
... covariance_types = "full")
```

which tells Cytomulate to use 3 components exactly with a full covariance matrix.

For components, you can choose any you like (except that they should be positive integers! We can’t owe you
any components even if you wish!). For covariance matrix, you have four choices: “full”, “diag”, “tied”, or
“spherical”. Look at `sklearn`

’s
documentation here
for an explanation of how each of the four options work. In short, if you want everything, the “full” is a
good choice; or if you want independent channels, you need to choose “diag”. While we don’t recommend the
latter as we saw that correlation between protein channels are present, you can certainly try it if you
see a need.

### Model Selection

As evident from the example above, there are many choices regarding these two crucial parameters. By default, Cytomulate performs a model selection procedure using the Bayesian Information Criterion (BIC) for model selection. We perform this for both the number of components and the covariance structure.

For example, if you wish to slect between 1 and 5 components (inclusive) but with the full covariance matrix, you can run the following code:

```
>>> cytof_data = EmulationCytofData(n_batches = 1)
>>> cytof_data.initialize_cell_types(expression_matrix = dataset,
... labels = cell_types,
... max_components = 5,
... min_components = 1,
... covariance_types = "full")
```

Or if you wish to fix the number of components while selecting the type of covariance matrices, run this instead:

```
>>> cytof_data = EmulationCytofData(n_batches = 1)
>>> cytof_data.initialize_cell_types(expression_matrix = dataset,
... labels = cell_types,
... max_components = 5,
... min_components = 5,
... covariance_types = ("full", "tied", "diag", "spherical"))
```

Note that as long as `max_component == min_components`

, the specified number will be used. Also,
you can specify a subset of teh `covariance_types`

without having to run all four.

Whether you wish to run model selection or not is up to your dataset and need. By default, Cytomulate does perform model selection to allow for the best fit for the data. The downside is that this process does slow down the algorithm because the model needs to be fit multiple times.

### Defaults and Recommendations

As mentioned, model selection is performed by Cytomulate. If you would like to select the parameters yourself, we have a few heuristics that may be helpful:

When in doubt, use the “full” covariance matrix to allow for the most flexible model.

the default betwene 1 and 9 components per cell type is often sufficient (hence the default).

If setting the components manually, perform an exploratory data analysis on your data first to see the data. Increase the numner of components if Cytomulate doesn’t fit well.

Of course, there is no one-size-fit-all recommendation! Try it and find out!

## Cell Abundance

First, let us define cell abundance: The proportion of each cell type in a given sample. From a first glance, this is incredibly vague because, and then, it may occur to people that this is somewhat trivial if we are not dealing with simulated datasets. However, being a good simulator, Cytomulate allows users to tweak the cell abundance in their dataset or use a real-data-based decision rule for simulation. But before diving into details, we will discuss a bit of context for Cytomulate.

As defined in our paper, Cytomulate fits a GMM for each cell type, meaning that there are as many GMMs
as cell types. The cell types are user specified naturally. It may be tempting to fit one single GMM
to all the data, which is definitely possible, but the caveate is that a very complex model is needed.
So Cytomulate tries to take advantage of cell types or clusters as given by the user. Then, when the GMMs
are estimated, the next step is to sample from them, which is the `sample()`

method seen earlier. Recall
that the method takes only 1 argument, which is the number of cells in the output rather than the number
of cells in each cell type. Thus, this section deals with how we estimate the makeup of our dataset.

### Default: Data-Based

By default, we learn from the dataset itself. So if you use the following snippet as shown earlier,

```
>>> cytof_data = EmulationCytofData(n_batches = 1)
>>> cytof_data.initialize_cell_types(expression_matrix = dataset,
... labels = cell_types,
... max_components = 9,
... min_components = 1,
... covariance_types = ("full", "tied", "diag", "spherical"))
>>> expression_matrices, labels, _, _ = cytof_data.sample(n_samples = 100)
```

Cytomulate will estimate the proportion of cells from real data. This is generally not affected by other parameters, such as the number of components and the number of batches. We recommend the default if you wish the simulated dataset to be as similar to the simulated data as possible.

### Random Cell Abundance

There are some situations in which you may want to change the cell abundance, such as

inducing more variance

wanting more rare cell types

needing some difference from the reference dataset.

In such cases, we can allow cell abundance to be randomly generated. There are in general two settings you
can choose. If you wish each cell type to have **equal probability**, then you should run the following
snippet:

```
>>> cytof_data = EmulationCytofData(n_batches = 1)
>>> cytof_data.initialize_cell_types(expression_matrix = dataset,
... labels = cell_types,
... max_components = 9,
... min_components = 1,
... covariance_types = ("full", "tied", "diag", "spherical"))
>>> cytof_data.generate_cell_abundance(use_observed = False,
is_random = False)
```

Note

It is uncessary to call `generate_cell_abundance`

if you wish to use the observed cell abundance
since it’s the default behavior.

On the other hand, if you wish to add some randomness into the mix, the probability for each cell type
can be generated with the Dirichlet distribution by simply setting `is_random`

to `True`

:

```
>>> cytof_data = EmulationCytofData(n_batches = 1)
>>> cytof_data.initialize_cell_types(expression_matrix = dataset,
... labels = cell_types,
... max_components = 9,
... min_components = 1,
... covariance_types = ("full", "tied", "diag", "spherical"))
>>> cytof_data.generate_cell_abundance(use_observed = False,
is_random = True)
```

This last example is especially useful is you wish to have a sample that is slightly different from real samples.