07-24, 14:45–14:53 (US/Eastern), Tomatis Auditorium
Background and Rationale: Antibiotic misuse has led to the rapid evolution of antimicrobial resistance (AMR) among bacteria, including multidrug resistance in the ESKAPE pathogens. Identifying AMR in clinical isolates is critical for treatment, but AMR arises from many mechanisms spanning multiple molecular scales (e.g., genes, protein domains). Prior in silico research has leveraged machine learning (ML) with AMR-labeled bacterial genome sequence data to predict AMR and discover associated genes. However, few to none thoroughly explore cross-species and multi-drug AMR models, despite an expected overlap in AMR genes due to mechanisms such as horizontal gene transfer. To identify these mechanisms and predict resistance, we leverage abundant publicly available whole genome sequences and AMR phenotype data from across the ESKAPE pathogens to train supervised machine learning models. Our comprehensive ML framework predicts the top AMR features across ESKAPE pathogens and across drug classes spanning at least two molecular scales (genes, protein domains). We develop a companion R package, amR, to provide a programmatic interface (with optional docker images) carrying all relevant ESKAPE pangenome data and metadata, along with ML models and top genomic features for further research, benchmarking, and experimental follow-up.
Approach: amR is an R package (installable via GitHub) including datasets, machine learning (ML) models, and top AMR genomic (gene/protein-domain) predictors. We start with all publicly available ESKAPE genomes, annotate them by PGAP, and construct pangenomes by/across species. These pangenomes are abstracted as gene presence/absence (and transformed genes→protein-domains) feature matrices with corresponding AMR resistance/susceptibility labels per genome (from BV-BRC), that serve as inputs to our ML models (e.g., logistic regression, LR, random forest, RF, boosting algorithms). We trained each binary classification model to predict AMR phenotypes per bug (ESKAPE species), per drug (antibiotic, drug class), and across bugs and drugs. All amR feature matrices, labels, and models are loadable as Rdata objects. For each model, users can access the top model coefficients (i.e., for gene/protein-domain predictors) and the adjusted p-values of feature-wise hypothesis testing against the resistance phenotype (Fisher’s Exact test with Benjamini–Hochberg correction). The amR package includes functions with roxygen2 documentation to reproduce all aforementioned steps, plus additional functions to summarize data availability by metadata (e.g., temporal, geographic, clinical vs. environmental isolate information), model performance, and multiscale genomic predictors(genes/domains) of AMR.
Results: The amR package carries 10k unique ESKAPE genomes with corresponding genomic metadata and AMR phenotype labels to train, validate, and test each prediction model. ML models built for each species-antibiotic combination resulted in consistently high median auPRCs ranging from 0.7–1. These classifiers are highly performant for AMR prediction across species and drugs, and consistently predict top features (genes, protein domains) known to be associated with AMR and horizontal gene transfer (e.g., tetK for tetracycline; mecA for methicillin-resistance; N-acetyltransferase domain for levofloxacin). Alongside known mechanisms, several highly ranked genes are not well-characterized for AMR and are prime targets for the discovery of novel AMR contributors. With inbuilt AMR classification models, thousands of genomes with metadata and phenotypes, their ranked predictor gene/domain sets, and custom data summarization/visualization functions, the amR package provides the first comprehensive, programmatic method to study AMR, starting with the notorious ESKAPE pathogens.
I am an Assistant Professor at the University of Colorado Anschutz Medical Campus, Dept. of Biomedical Informatics, Center for Health Artificial Intelligence (with ties to Dept. of Immunology and Microbiology). I completed my PhD in Computational Biology at Virginia Tech and postdoctoral research at the Public Health Research Institute, Rutgers Biomedical, and Health Sciences. I recently moved from the Depts. of Pathobiology & Diagnostic Investigation, Microbiology & Molecular Genetics, Michigan State University.
We develop general-purpose computational approaches that integrate large-scale heterogeneous public datasets that lead to the mechanistic understanding of microbial genotypes, phenotypes, and diseases.
Specifically, we focus on two key questions:
- How do we link microbial genotypes to phenotypic traits?
We use a combination of protein sequence-structure-function relationships, comparative genomics, and machine learning to bridge the genotype-phenotype gap (e.g., phenotypes, antimicrobial resistance, host-specificity, microbial pathogenesis).
- How do we delineate molecular mechanisms underlying host response to infection and discover host-directed therapeutics?
We use comparative transcriptomics, disease-drug signatures, and machine learning to learn about host response and drug repurposing.
Our methods are generally pathogen- and disease-agnostic. We also release open data/software and easy-to-use web applications for wide use by the biomedical community.
I am also actively engaged in training, education, and outreach, and committed to creating and sustaining a diverse and inclusive ecosystem in data science and R for learners and professionals alike, focusing on increasing the participation of underrepresented minorities in data science and R programming. Towards this effort, I founded R-Ladies East Lansing and R-Ladies Aurora, and co-founded Women+ Data Science and AsiaR. I also co-chair the R/Bioconductor Community Advisory Board.