Creation Mode
Creation Mode, by design, creates new datasets from ~~thin air~~ our well-designed GMM model without the need for existing datasets! All you need to do is to decide the key parameters, such as the number of cell types, the number of cells, etc., and you are on your way to a great dataset! There are many scenarios in which you may want to try the Creation Mode:
You don’t have access to real data!
You are running benchmarks and need precise controls on things such as # of protein markers.
You would like to experiment quickly without being bogged down by real data’s preprocessing, etc.
You need to perform analysis on many different kinds of datasets, which are hard to collect in real life.
Of course, there are many more reasons to pick Creation Mode over Emulation Mode. Here, as long as you have Cytomulate installed, you’re good to go!
Setting Things Up
As always, let’s make some necessary imports before we start:
>>> from cytomulate import CreationCytofData
In this tutorial, we’re not showcasing PyCytoData since it’s purely downstream from what we’re doing! See
our QuickStart Guide on how to output a
PyCytoData
object from Cytomulate and we will point out at the appropriate place as well.
Simple Simulation
In the simplest setting, a Creation Mode simulation looks like the following:
Again, we risk being a broken record by writing out all the defaults in this simulation! Given the nature of the creation mode, there are many more options to contend with. Here is a list of default options, and we explain some of the more obscure ones in detail:
1 batch
10 cell types
20 protein markers
2 differentiation trees: Note that trees merely represent relationships between cell types, which is necessary even without trajectory.
100 cells in total
Then, all the parameters in the initialized_cell_types
methods pertain to the simulation model, which will be explained later in more
complex simulation situations.
Note
If you wish to use PyCytoData as output format, replace the cytof_data.sample()
method
with cytof_data.sample_to_pycytodata()
. The interface otherwise remains unchanged.
Fine Tuning of Cytomulate Models
The Creation Mode also uses Gaussian Mixture Models (GMMs), which sounds awfully similar to the Emulation Mode, but the key difference is that we simulate the marker pattens and statistics necessary for each GMM rather than learning from data. This part of the tutorial walks you through some of the details in the implementation.
Cell Types and Expressions
With the Creation Mode, we have the assumption that cell types are identified via gating in real life. For example, we use CD4 and CD8 as markers to identify T cells. For a given cell type, we can thus some channels to be highly expressed whereas other channels to be not as highly expressed. This framework serves as the foundation of Cytomulate’s Creation Mode. Internally, we use a matrix to encode the expression pattern for each cell type, where as levels of expression for each channel is encoded. You can tune these parameters to make sure that diverse pattrns can be generated.
The first parameter you can tune is the levels of expressions for all channels, which is the L
parameter in the initialize_cell_types
method. Intuitively, we at least need two levels because otherwise all cell types will have the same pattern. By default, we
used 4 levels, where the smallest level represents “not expressed at all” and the highest level represents “most highly expressed”.
For simple datasets with few cell types, you a smaller L
may be fine, but in general, we think that the default will work well.
You can tune this parameter by using the following snippet (all other parameters are defaults):
>>> cytof_data = CreationCytofData()
>>> cytof_data.initialize_cell_types(L = 2)
Once you have the levels determined, you can decide how different these levels are because the levels are generated using a
truncated normal distribution. If you wish to have more spead-out levels, a large variance is needed and vice verse. In
Cytomulate, we use the scale
parameter because the truncated Normal belongs to the Location-Scale family. By default,
the scale (standard deviation) is 0.5, and you can change it to any positive real you prefer:
>>> cytof_data = CreationCytofData()
>>> cytof_data.initialize_cell_types(scale = 0.9)
Generating Expression Patterns
Once the above parameters are set, the expression patterns will be generated by calling the initialize_cell_types
method
as shown. The expression pattern is not shown and Cytomulate will have it taken care of. An example of tuning both parameters
is like this:
>>> cytof_data = CreationCytofData()
>>> cytof_data.initialize_cell_types(L = 2,
... scale = 0.9)
Fitting GMMs
Finally, we fit a GMM for each cell type using the expression patterns generated above, which are used as mean parameters. Here, we recommend the defaults, which just uses 1 component. In real data, we often need more components because the expressions can have many modes and cell-typing is not always a perfect procedures. However, the Creation Mode makes things a bit easier. If you wish to use more components, you still can:
>>> cytof_data = CreationCytofData()
>>> cytof_data.initialize_cell_types(n_components = 2)
Besides the number if components, there is an additional parameter called variance_mode
, which is used to
calculate shape and rate parameters used for the Inverse Wishert Distribution. We do not recommend changing this
parameter.