BioC2024

Charles Deng

Charles is an associate bioinformatician in the Beckmann Lab at Mount Sinai. After graduating from Brown in 2019 with a major in Applied Math-Economics, he worked for three years building trading algorithms in quantitative finance before leaving his job to pursue a career in medicine. He will be starting medical school this fall of 2024.


Sessions

07-26
12:05
8min
A scalable and flexible network-based approach to identify, diagnose, and resolve mislabeled samples in molecular data
Charles Deng, Ryan Thompson

Mislabeled data can result in diminished statistical power and, more critically, introduce systematic biases that affect the accuracy of results. Here, we present a scalable network-based algorithm that identifies and resolves mislabels in any dataset containing multiple samples per individual and sufficient genetic information to genotype samples. In our approach, a label network linking samples with the same subject label and a genotype network linking samples with the same genotype are constructed. An ensemble of custom search algorithms then proposes mislabel corrections by resolving discrepancies between these two networks. Our approach can also use constraints from the experimental design to rule out improbable events, such as swaps between different data types, and therefore resolve mislabeling events that would otherwise be ambiguous. In over one million simulated datasets with up to 100,000 samples and 20% mislabeling, our algorithm corrects mislabels with median 99% accuracy. When constraints from experimental design are incorporated, our algorithm remains effective (median accuracy over 90%) even on datasets with 50% mislabeling. On the Mount Sinai COVID-19 Biobank, which spans 3,756 RNAseq, WGS, and genotype array samples across 696 individuals, our algorithm automatically identifies 190 mislabels. 186 (98%) of these suggested corrections matched those previously determined by manual review, 3 (1.6%) were previously unidentified, and 1 (0.5%) was determined to be incorrect. The algorithm’s scalability was further tested on 17,350 RNAseq samples across 948 individuals from GTEx, where the algorithm reported no discrepancies between the label and genotype networks, suggesting perfect sample labeling as expected from this heavily studied dataset. Our approach, along with its accompanying visualization tools, is implemented as a R package.

Tomatis Auditorium