Zachary DeBruine
Zach DeBruine is Assistant Professor of Computing at Grand Valley State University, and Research Scientist in the GVSU Applied Computing Institute. He completed his Ph.D. and postdoc studies at Van Andel Institute at first in structural biology, then bioinformatics, and finally high-performance machine learning. His lab currently is building large multimodal foundation models on genomics and biobank data.
Sessions
Non-negative Matrix Factorization (NMF) is a popular and interpretable method for dimension reduction. NMF can be used in the same analysis pipelines as Principal Component Analysis, but also provides direct insights into patterns of gene co-expression (biological processes) and context of cell similarities (soft clustering). However, NMF implementations have historically been too slow to scale to the size of datasets available today. We implemented very fast and scalable NMF in our "singlet" R package, which natively supports sparse matrices and can scale to datasets with tens of millions of cells. We trained an NMF model on 28 million human transcriptomes from the Chan Zuckerberg Initiative CellCensus dataset. Our embeddings are publicly available as CellCensus models for transfer learning. We also developed new data structures for single-cell data that outperform BPCells in terms of compression by ~4x and exceed performance, allowing for in-core NMF on high-memory HPC nodes of over 100 million single-cell transcriptomes, in theory. Here we show how a standard Seurat object can make use of "singlet" NMF, what insights can be derived from this data, and we share some of the use cases for our new CellCensus model, and show how our new data structures enable in-core analysis of single-cell datasets that normally would require distributed computing.
Foundation models of single-cell transcriptomics data promise unprecedented capability to reason about any new information in the context of a vast knowledge about gene co-expression and regulation patterns. While current work in this space largely uses transformer- or GPT-based architectures, we are building multimodal generative architectures that rely on mixture-of-experts variational autoencoders with generate adversarial feedback. Our approaches permit scalable integration of transcriptomes across any supervised metadata labels (e.g. organ, disease, assay, dataset id, etc.) and modalities (e.g. RNA, ADT, ATAC), but also across species without making any a priori assumptions about gene homology. Here we present our work to date on these new architectures involving models trained on ~40 million single-cell transcriptomes from human and mouse in the Chan Zuckerberg Initiative CellCensus, and also zebrafish data from several other datasets. These foundation models open new opportunities for highly-powered prediction tasks even on small sample sizes by leveraging a vast knowledge of transcriptomics.