The impact of package selection and versioning on single-cell RNA-seq analysis
Seurat and Scanpy are two of the most widely-used tools in the analysis of single cell RNA-sequencing (scRNA-seq) data, and are generally thought to implement the standard analysis pipeline very similarly. However, we find that Seurat and Scanpy demonstrate drastic differences in output when comparing standard pipelines. The steps that demonstrate differences by default include highly variable gene selection algorithm, Principal Component Analysis (PCA) after scaling, approximate k-nearest neighbors (KNN) graph algorithm, shared nearest neighbor graph construction method, clustering algorithm, Uniform Manifold Approximation and Projection (UMAP) method, marker gene filtering, log-fold change calculation, and p-value calculation and adjustment. Only a portion of these steps can be made to act identically with intentional selection of function arguments. The degree of differences present between packages can lead to divergent biological conclusions downstream, and essentially amounts to differences seen by downsampling a dataset to as little as 4% of the original reads or 16% of the original cells. Additionally, the version of Seurat or Scanpy can have an impact on differential expression analysis, particularly during marker gene selection and, in the case of Seurat, log-fold change calculation technique. The version of Cell Ranger used to generate the count matrix can impact all steps of analysis including cell and gene filtering, especially when there is a difference in the setting of intron inclusion. Put together, our study makes a case for the importance of careful and consistent package selection and version control when conducting scRNA-seq data analysis in order to minimize technical noise.