BioC2024

Ryan Thompson

Ryan C. Thompson, PhD is an Assistant Professor in the Division of Data Driven and Digital Medicine (D3M) in the Department of Medicine in the Icahn School of Medicine at Mount Sinai. He completed his PhD at Scripps Research in La Jolla, CA. He has a long history of applying and adapting bioinformatics methods to address both anticipated and unanticipated research challenges, even developing new methods when the need arises. His broad methodological background includes normalization, differential expression, machine learning, gene set testing, and data visualization with both sequencing and array-based data. In addition, he has a broad background in general biology and in immunology in particular, along with a strong foundation in statistics. In his current work at Mount Sinai, he has developed a colored graph visualization tool to aid in quick, accurate identification and correction of mislabeled samples in a data set of 2385 RNA-seq and whole-genome sequencing samples. His broad understanding of commonly used statistical and computational methods enables him to continue adapting to and overcoming similar unexpected analysis challenges that may arise in the course of the proposed research. Outside the lab, Ryan considers it a success if his D&D group has to pull out a physics textbook to figure out what happens next.

The speaker's profile picture

Sessions

07-26
12:05
8min
A scalable and flexible network-based approach to identify, diagnose, and resolve mislabeled samples in molecular data
Charles Deng, Ryan Thompson

Mislabeled data can result in diminished statistical power and, more critically, introduce systematic biases that affect the accuracy of results. Here, we present a scalable network-based algorithm that identifies and resolves mislabels in any dataset containing multiple samples per individual and sufficient genetic information to genotype samples. In our approach, a label network linking samples with the same subject label and a genotype network linking samples with the same genotype are constructed. An ensemble of custom search algorithms then proposes mislabel corrections by resolving discrepancies between these two networks. Our approach can also use constraints from the experimental design to rule out improbable events, such as swaps between different data types, and therefore resolve mislabeling events that would otherwise be ambiguous. In over one million simulated datasets with up to 100,000 samples and 20% mislabeling, our algorithm corrects mislabels with median 99% accuracy. When constraints from experimental design are incorporated, our algorithm remains effective (median accuracy over 90%) even on datasets with 50% mislabeling. On the Mount Sinai COVID-19 Biobank, which spans 3,756 RNAseq, WGS, and genotype array samples across 696 individuals, our algorithm automatically identifies 190 mislabels. 186 (98%) of these suggested corrections matched those previously determined by manual review, 3 (1.6%) were previously unidentified, and 1 (0.5%) was determined to be incorrect. The algorithm’s scalability was further tested on 17,350 RNAseq samples across 948 individuals from GTEx, where the algorithm reported no discrepancies between the label and genotype networks, suggesting perfect sample labeling as expected from this heavily studied dataset. Our approach, along with its accompanying visualization tools, is implemented as a R package.

Tomatis Auditorium