BioC2024

Singlet: Fast and Interpretable Dimension Reduction of Single-cell Data
07-25, 09:00–09:08 (US/Eastern), Tomatis Auditorium

Non-negative Matrix Factorization (NMF) is a popular and interpretable method for dimension reduction. NMF can be used in the same analysis pipelines as Principal Component Analysis, but also provides direct insights into patterns of gene co-expression (biological processes) and context of cell similarities (soft clustering). However, NMF implementations have historically been too slow to scale to the size of datasets available today. We implemented very fast and scalable NMF in our "singlet" R package, which natively supports sparse matrices and can scale to datasets with tens of millions of cells. We trained an NMF model on 28 million human transcriptomes from the Chan Zuckerberg Initiative CellCensus dataset. Our embeddings are publicly available as CellCensus models for transfer learning. We also developed new data structures for single-cell data that outperform BPCells in terms of compression by ~4x and exceed performance, allowing for in-core NMF on high-memory HPC nodes of over 100 million single-cell transcriptomes, in theory. Here we show how a standard Seurat object can make use of "singlet" NMF, what insights can be derived from this data, and we share some of the use cases for our new CellCensus model, and show how our new data structures enable in-core analysis of single-cell datasets that normally would require distributed computing.

Zach DeBruine is Assistant Professor of Computing at Grand Valley State University, and Research Scientist in the GVSU Applied Computing Institute. He completed his Ph.D. and postdoc studies at Van Andel Institute at first in structural biology, then bioinformatics, and finally high-performance machine learning. His lab currently is building large multimodal foundation models on genomics and biobank data.

This speaker also appears in: