BioC2024

OmicsMLRepo: Ontology-leveraged metadata harmonization to improve AI/ML-readiness of omics data in Bioconductor
07-26, 11:35–11:43 (US/Eastern), Tomatis Auditorium

Efforts to establish comprehensive biological data repositories have been significant at national and institutional levels. Despite the large volume of data collected from diverse studies, the cross-study analysis across those repositories and joint modeling between omics and non-omics data remains largely limited due to the nature of non-omics metadata, such as lack of standardization, high complexity, and heterogeneity. This lack of metadata harmonization also hinders the application and development of machine learning tools, which can serve a pivotal role in managing and analyzing complex and high-dimensional multi-omics data.
To address this issue, we initiated the OmicsMLRepo project, harmonizing and standardizing metadata from omics data resources. This process involved the manual review of metadata schema, the consolidation of similar or identical information, and the incorporation of ontologies. As a result, we have harmonized hundreds of studies on metagenomics and cancer genomics data, accessible through two R/bioconductor packages - curatedMetagenomicData and cBioPortalData. Furthermore, we developed a software package, OmicsMLRepoR, allowing users to leverage the ontologies in metadata search. In summary, the OmicsMLRepo project simplifies the process of cross-study, multi-faceted data analyses through metadata harmonization and standardization, making omics data more AI/ML-ready.

I am an Assistant Professor at CUNY SPH, with expertise in both experimental biology and bioinformatics. As a molecular biologist by training, I had studied DNA repair and telomere maintenance mechanisms during my doctoral and postdoctoral research. As a bench scientist, I started to notice the limitations of arguing the extent to which my findings in cell lines were actually happening in living organisms and relevant to public health, and this made me interested in the potential of large public datasets. I made a career transition from a bench scientist to a bioinformatics scientist and joined Dr. Waldron’s lab at CUNY SPH as a postdoctoral researcher in 2017. Since then, I had worked on many research projects, published papers, and have developed a wide collaborative network and profound experience and understanding of large public omics data analysis, statistical method development for high-dimensional data, Cloud-based computing, AnVIL workspace and workflow developments, user-friendly software development. Currently, I am working on a NIH-funded project to construct an omics data repository designed for the easy application of Artificial Intelligence and Machine Learning tools. My over-arching career goal is to facilitate interdisciplinary research through the development of intuitive bioinformatics infrastructure and user-friendly tools that lower barriers across different disciplines and resources. In my free time, I enjoys ballroom dancing and exploring different neighborhoods in New York.