Benchmark Analyses

Since Cytomulate is open source, we also would like to share our pipelines for obtaining the benchmark results in our paper. Here, we write a brief tutorial on how to use the codes and how to benefit from it!

Downloading Source Codes

We released our source codes as a GitHub release, which is linked here. All you will have to do is the following:

Download the benchmark.zip under the “Assets” tab.
Decompress the folder with the software available for your OS and access the contents.

While we provided documentation in the form of comments in the files themselves, here we would like to point out a few more things that may be helpful in your Cytomulate journey.

Dependencies

To run the codes, you will need the following softwares installed on your system:

Python with Cytomulate (You can download the source codes on the same page or use Cytomulate v0.2.0 release if you prefer.)
R installation with all the packages listed in the library() calls.

Datasets

You will also have to download the necessary datasets used in our paper. All the accession methods and their availablility is in the XXX section of our paper!

Codes and Functionalities

In this section, we explain each part of the code (sorted in directories) and what they do in accordance with our paper’s analyses.

Note

The FileIO class and the KLdivergence function are included in multiple Python scripts for convenience purposes only. In reality, it is okay to write a separate module to house these! In fact, the FileIO class is now part of PyCytoData, which makes life easier.

Directory: batch

This directory contains codes used to benchmark batch correction methods as shown in the Comparing Batch Normalization Methods using Cytomulate section. The data_gen.py generates datasets with multiple batches using the complex simulation functionalities of Cytomulate; then, batch_correction.R performs batch correction. Finally, benchmark.py computes the benchmarks as shown in paper.

Directory: clustering

This directory contains codes used to benchmark clustering methods as shown in the Validating Clustering Performance using Cytomulate The overall structure is similar to that of batch correction codes with the exception of clustering.R which performs clustering rather than batch correction and benchmark.py which uses a different metric.

Directory: masking

These codes are used to randomly mask cells in the Levine_32dim dataset to assess the performance of all of the methods. The data_gen.py generates the masked cell types and datasets, whereas the compute_kl.py computes the KL benchmark each method as presented in Fig. 5 of our paper. These results are presented in the Cytomulate is robust against cell-type resolution secion.

Directory: metric_computation

This directory contains codes to compute the main metrics used in our paper. The three main metrics on mean, covariance, and KL divergence are computed in python. The pMSE metric from synthpop is computed in R.

Directory: processing_time

The R script in this directory benchmarks the processing times of Cytomulate competitors in R. Since Cytomulate is the only Python method in our paper, it is not included in the R script. Rather, if the benchmarking of Cytomulate is desired, you can use the /usr/bin/time in linux for timing Cytomulate’s CLI or use the various timing modules in python.

The results of processing time and Cytomulate’s efficiency are included in the Cytomulate is efficient section of our paper.

Directory: simulations

This directory contains codes to generate the simulation results for each method. Each script is named according to the method. Note that only Cytomulate is in Python, while all others are using R. These results are used throughout our paper.