BioC2024

To see our schedule with full functionality, like timezone conversion and personal scheduling, please enable JavaScript and go here.
07:30
07:30
60min
Breakfast and Registration
Tomatis Auditorium
08:30
08:30
30min
Welcome Session

Welcome to BioC2024!

Tomatis Auditorium
09:00
09:00
15min
Break
Tomatis Auditorium
09:15
09:15
90min
Applying tidy principles to investigating chromatin composition and architecture
Jacques Serizay

The integration of tidy tools into genomics analysis catalyzes a paradigm shift in the way researchers approach data manipulation and interpretation. By adhering to the principles of tidy data organization and the elegant syntax of tidyverse packages, researchers can navigate the complexities of genomic datasets with unprecedented ease and efficiency. I will unveil two novel packages: plyinteractions, specifically developed to manipulate chromatin conformation capture (3C, Hi-C micro-C, etc), and tidyCoverage, to manipulate and extract coverage tracks within the tidyomics framework. plyinteractions and tidyCoverage packages introduce novel SummarizedExperiment-derived S4 classes to store genomics data and expand tidy methods. They synergize the existing functionalities of tidyverse and Bioconductor, to seamlessly intertwine data manipulation, aggregation, visualization, and modeling within a unified framework. This integrated workflow not only enhances the clarity and reproducibility of analyses but also empowers researchers to extract deeper biological insights from their data, offering a glimpse into the future of genomics analysis.

Workshops
Tomatis Auditorium
09:15
90min
Introduction to Package Development

Abstract to be submitted.

Workshops
Room 3104-5
10:45
10:45
15min
Break
Tomatis Auditorium
11:00
11:00
90min
Exploratory spatial data analysis from single molecules to multiple samples
Lambda Moses

Imaging based spatial transcriptomics technologies with single molecule resolution such as MERFISH and Xenium are commercialized and increasingly adopted. Meanwhile, many studies have collected spatial data from multiple subjects across biological conditions, sometimes across multiple modalities. While many data analysis tools have been written to integrate data across tissue sections and modalities, fewer spatial data analysis tools go below the single cell resolution to analyze subcellular transcript localization, or go above adjacent sections to compare spatial characteristics of different biological conditions. Different spatial phenomena can be observed at different scales in the same region. Here we present a new version of Bioconductor packages SpatialFeatureExperiment (SFE) and Voyager to perform exploratory spatial data analysis from single molecules to multiple samples and scales in between. We demonstrate new functions to read output from commercial imaging based technologies, coupled with a language agnostic serialization of the SFE object to reformat the transcript spot data for spatial operations and faster reading. The transcript spots are used in spatial point process analyses. Functionalities on images have also been expanded, to support OME-TIFF with BioFormats and to extract data from raster images with vector geometries to relate gene expression data to histology. Between single molecules to tissue sections, we demonstrate spatial binning at different scales at molecular or cellular levels as an exploratory spatial data analysis tool on length scales. For example, Moran’s I can flip signs from single cell resolution to coarser bins. Above the tissue section level, we spatially align a Visium and a MALDI-MSI lipidomics dataset, and find genes that are spatially autocorrelated in wild type but not mutant mice on a high fat diet in a mouse adipose dataset.

Workshops
Tomatis Auditorium
11:00
90min
Interoperability between R and Python using BiocPy Workshop

AbThe BiocPy project aims to facilitate development of Bioconductor workflows in Python. This workshop will provide an overview of the core data structures implemented in BiocPy (e.g., GenomicRanges, SummarizedExperiment) that were ported from R. Participants will be guided through two Bioconductor-derived workflows in Python. The first will involve reading an RDS file containing genomic ranges and performing downstream range-based analyses. The second will use the scRNAseq package to access public single-cell RNA-seq datasets, followed by cell type annotation using the SinglePy package. Attendees will learn how to represent and manipulate their datasets in Python in the same manner as in R/Bioconductor. All packages in BiocPy are published to PyPI and code is opensource @ https://github.com/BiocPy.stract to be submitted.

Workshops
Room 3104-5
12:30
12:30
60min
Lunch
Tomatis Auditorium
13:30
13:30
8min
A hierarchical Bayesian model for the identification of technical length variants in miRNA sequencing data
Hannah Swan

MicroRNAs (miRNAs) are small, single-stranded non-coding RNA molecules found in a range of organisms including plants, animals, and some viruses. MiRNAs regulate gene expression through imprecise base pairing of the miRNA molecule to a target messenger RNA (mRNA) molecule. Canonical synthesis of miRNAs begins with transcription of a primary or pri-miRNA molecule, approximately 70 nucleotides in length. The pri-miRNA then goes through a series of processing steps, including cleavage by Drosha and Dicer enzymes. MiRNA biosynthesis results in a mature, single-stranded miRNA molecule that is loaded onto an Argonaute (AGO) protein to form the RNA-induced silencing complex (RISC). Certain steps of the miRNA synthesis pathway, such as cleavage by Drosha and Dicer, can result in miRNA isoforms that differ from the canonical miRNA sequence in nucleotide sequence and/or length. These miRNA isomers, called isomiRs, which may differ from the canonical sequence by as few as one or two nucleotides, can have different mRNA targets and stability than the corresponding canonical miRNA. As the body of research demonstrating the role of isomiRs in disease grows, the need for differential expression analysis of miRNA data at scale finer than miRNA-level grows too. Unfortunately, errors during the amplification and sequencing processes can result in technical miRNA isomiRs identical to biological isomiRs, making resolving variation at this scale challenging. We present a novel algorithm for the identification and correction of technical miRNA length variants in miRNA sequencing data. The algorithm assumes that the transformed degradation rate of canonical miRNA sequences in a sample follows a hierarchical normal Bayesian model. The algorithm then draws from the posterior predictive distribution and constructs 95% posterior predictive intervals to determine if the observed counts of degraded sequences are consistent with our error model. We present the theory underlying the model and assess the performance of the model using an experimental benchmark data set.

Tomatis Auditorium
13:40
13:40
8min
Assessing differential expression strategies for small RNA sequencing using real and simulated data
Ernesto Aparicio-Puerta

Despite significant advances in the experimental side of small RNA sequencing, the statistical analysis of its data is still limited to differential expression methodologies designed for standard RNA-seq. However, small RNA-seq data violate several of the assumptions of these methods. Specifically, methods for mRNA-seq analysis assume approximate independence between feature counts, although the small total number of miRNAs and presence of a small number of very highly expressed miRNAs result in a lack of independence between miRNA counts. Furthermore, this skewed distribution often results in a negative correlation between the top two miRNAs in each cell type, a phenomenon not observed with mRNAs. In extreme cases, upregulated miRNAs might falsely seem downregulated (or vice versa) because of these technical effects. Additionally, assessing the accuracy of differential expression methods represents a challenge because the correct outcome for a given estimation must be known a priori. To address these issues, we present a benchmark of several differentially expression methods commonly employed on small RNA-seq data. Our benchmarking strategy is based on two independent datasets: a custom-generated ad hoc dataset of known RNA mixtures and simulated data derived from existing miRNA-seq studies. These ground truth datasets, designed to mirror the complete dynamic range of miRNA expression, offer a more realistic benchmark compared to previous efforts. This comparative assessment offers a clear insight into the reliability of differential expression methods when applied to small RNA-seq, which we hope can guide miRNA researchers in selecting differential expression methods for their analyses.

Tomatis Auditorium
13:50
13:50
8min
Statistical modelling of microRNA-seq data
Seong-Hwan Jun

MicroRNAs (miRNAs) are pivotal in regulating gene expression and influencing disease progression. Despite their critical role as disease modulators, statistical analysis methods for miRNAs have not been as thoroughly developed as those for messenger RNAs (mRNAs). Commonly, techniques designed for mRNAs are repurposed for miRNA data without considering the unique characteristics of miRNAs. This study critically examines the assumptions of mRNA-based methods when applied to miRNAs. We challenge these assumptions, highlighting the competitive nature of miRNA expression. Our research introduces novel statistical methods and modelling strategies tailored for miRNA sequencing data. These approaches account for the distinctive sources of variability in miRNA data, including competition for expression, library size variations, and data sparsity. We demonstrate the efficacy of our models through validation on both microRNAome datasets and simulated data. Our parameter estimation relies on autograd, while inference employs Laplace's approximation to Bayesian posterior distribution. Our investigation not only questions prevailing practices in miRNA data analysis but also provides a foundation for more accurate and specific miRNA study by examination of sequence level data.

Tomatis Auditorium
14:00
14:00
8min
miRglmm: modeling isomiR-level counts improves estimation of miRNA-level differential expression and uncovers variable differential expression between isomiRs
Andrea Baran

miRglmm: modeling isomiR-level counts improves estimation of miRNA-level differential expression and uncovers variable differential expression between isomiRs

Andrea M. Baran, Arun H. Patil, Marc K. Halushka, Matthew N. McCall

Abstract

Background: microRNAs (miRNAs) can be characterized by small RNA sequencing and sequencing reads are subsequently aligned to known miRNAs. Typically, read counts of sequences that align to the same miRNA are summed to produce miRNA-level read counts. This aggregation discards information about different miRNA transcript isoforms, called isomiRs, whose use might more accurately determine biological miRNA abundance. The aggregated miRNA counts are then used for subsequent differential expression (DE) analyses using tools developed for mRNA-Seq data analysis. There are important differences between miRNA-Seq data and mRNA-Seq data that may make key assumptions of these methods invalid when applied to miRNA-seq data, necessitating the need for a method designed specifically for miRNA-Seq data that can utilize the more granular isomiR-level data.

Methods: We establish miRglmm, a differential expression (DE) method that uses a negative binomial mixed effects model that models isomiR-level counts and accounts for dependencies due to reads coming from the same sample and/or from the same isomiR sequence. The isomiR random effect variances can be used to quantify variability in differential expression between isomiR sequences that align to the same miRNA; thereby, facilitating detection of miRNA with differential isomiR usage.

Results: Using synthetic benchmark data, we show that miRglmm provides the lowest mean squared error (MSE) among all DE tools while maintaining coverage at the nominal level. Using biological data to simulate miRNA with random isomiR variability in the group effect, we demonstrate that miRglmm provides markedly lower MSE and much better coverage than other DE tools, especially when significant isomiR variability is present. In real biological data, miRglmm provides fold change estimates that are similar to results from commonly used DE tools (intraclass correlation coefficients ≥ 0.8) and finds significant isomiR-level variability for most miRNA in our analyses.

Conclusions: In cases where significant isomiR variability exists, the loss of information due to aggregation of isomiR-level counts to miRNA-level counts is detrimental to the performance of commonly used DE tools. Our method, miRglmm, can account for this variability, and provides consistently high performance in estimating DE for miRNA, whether or not there is significant isomiR variability within miRNA.

Tomatis Auditorium
14:10
14:10
8min
Sobol4RV: sensitivity in random settings
Frederic Bertrand

Sensitivity analysis is commonly used to assess how changes in a model's input data affect its results, and to determine the extent to which changes in a set of model input parameters will affect the model's results.

Global sensitivity analysis, which is becoming increasingly widespread, attempts to quantify the uncertainty due to the uncertainty of input factors taken in isolation or in combination with others. Sobol's sensitivity analysis method is widely used as the classic approach to global sensitivity analysis.

Classical Sobol sensitivity indices assume that the distribution of model parameters is fully known for a given model, which is generally difficult to measure in practical problems.

In our context, these parameter distributions may depend on random parameters in several different ways, all associated with real modeling contexts that we will describe. The aim of this study is to determine in which of these contexts the use of these Sobol indices, either classically via a quantity of interest, or via a summary of the quantity of interest, remains relevant.

We created the R package Sobol4RV to deal with sensitivity analyses in random settings and will provide several example of uses in biological settings.

Tomatis Auditorium
14:20
14:20
15min
Break
Tomatis Auditorium
14:35
14:35
8min
PathSeeker: A Statistical Package for Enhanced Pathogen Identification and Characterization in RNA Sequencing Data
Mercedeh Movassagh

PathSeeker is a statistical R package, tailored to improve pathogen identification and characterization from RNA sequencing (RNA-seq) data. Addressing challenges in datasets with numerous potential organism identifications, it leverages summary statistics, topic models, regressions, ranks tests and visualization techniques for precise organisms detection and identification.
PathSeeker uses the Chan Zuckerberg ID (CZID) pipeline for initial organism identification, followed by an intricate process involving filtrations and statistical modeling. This is based on blank samples and initial library quantity, using a regression curve to differentiate true organisms from contaminants. A feature of PathSeeker is its delta approach, employed to minimize cross-contamination errors. This is supplemented by user-defined filters, allowing for positive sample identification based on exceeding median levels in blank controls or meeting specific statistical benchmarks set by the user.
Incorporating topic models, PathSeeker establishes baselines of pathogen expression across samples to discern true pathogenic signals in scenarios lacking controls. This enhances PathSeeker’s ability to identify authentic pathogenic signals in complex datasets.
Furthermore, PathSeeker encompasses an extensive suite of differential abundance analysis tools, adept at handling diverse data distributions and experimental conditions. It incorporates models based on negative binomial and log-normal assumptions, crucial for analyzing data with varying levels of dispersion and normality. For pairwise comparisons, the package employs the Wilcoxon test, ensuring robust analysis of two-condition scenarios. In more complex experimental designs involving three or more conditions, PathSeeker utilizes the Kruskal-Wallis test, adept at managing multi-group comparisons. This comprehensive approach to differential abundance analysis enables researchers to rigorously evaluate the presence and significance of pathogens under various experimental conditions, further strengthening the package’s capability in delivering insightful pathogen profiles from RNA-seq data.
Methodologically, PathSeeker accuracy was confirmed using RNA-seq datasets of neonatal hydrocephalus and maternal placental infections for various viral, bacterial, and parasitic organism detection. It addresses the challenge of noise reduction for pathogen identification using RNA-seq. For in-depth confirmation of PathSeeker results, polymerase chain reaction (PCR), and rapid diagnostic blood (RDT) was utilized.
PathSeeker's commitment to transparency and accessibility will see its methodologies made available to the scientific and health communities via open-access R packages on GitHub and Bioconductor. This approach not only promotes collaborative research but also propels the field forward in the accurate interpretation of RNA-seq data in pathogen research in a user friendly manner.

Tomatis Auditorium
14:45
14:45
8min
amR: an R package to predict and explore the top antimicrobial resistance features
Janani Ravi

Background and Rationale: Antibiotic misuse has led to the rapid evolution of antimicrobial resistance (AMR) among bacteria, including multidrug resistance in the ESKAPE pathogens. Identifying AMR in clinical isolates is critical for treatment, but AMR arises from many mechanisms spanning multiple molecular scales (e.g., genes, protein domains). Prior in silico research has leveraged machine learning (ML) with AMR-labeled bacterial genome sequence data to predict AMR and discover associated genes. However, few to none thoroughly explore cross-species and multi-drug AMR models, despite an expected overlap in AMR genes due to mechanisms such as horizontal gene transfer. To identify these mechanisms and predict resistance, we leverage abundant publicly available whole genome sequences and AMR phenotype data from across the ESKAPE pathogens to train supervised machine learning models. Our comprehensive ML framework predicts the top AMR features across ESKAPE pathogens and across drug classes spanning at least two molecular scales (genes, protein domains). We develop a companion R package, amR, to provide a programmatic interface (with optional docker images) carrying all relevant ESKAPE pangenome data and metadata, along with ML models and top genomic features for further research, benchmarking, and experimental follow-up.

Approach: amR is an R package (installable via GitHub) including datasets, machine learning (ML) models, and top AMR genomic (gene/protein-domain) predictors. We start with all publicly available ESKAPE genomes, annotate them by PGAP, and construct pangenomes by/across species. These pangenomes are abstracted as gene presence/absence (and transformed genes→protein-domains) feature matrices with corresponding AMR resistance/susceptibility labels per genome (from BV-BRC), that serve as inputs to our ML models (e.g., logistic regression, LR, random forest, RF, boosting algorithms). We trained each binary classification model to predict AMR phenotypes per bug (ESKAPE species), per drug (antibiotic, drug class), and across bugs and drugs. All amR feature matrices, labels, and models are loadable as Rdata objects. For each model, users can access the top model coefficients (i.e., for gene/protein-domain predictors) and the adjusted p-values of feature-wise hypothesis testing against the resistance phenotype (Fisher’s Exact test with Benjamini–Hochberg correction). The amR package includes functions with roxygen2 documentation to reproduce all aforementioned steps, plus additional functions to summarize data availability by metadata (e.g., temporal, geographic, clinical vs. environmental isolate information), model performance, and multiscale genomic predictors(genes/domains) of AMR.

Results: The amR package carries 10k unique ESKAPE genomes with corresponding genomic metadata and AMR phenotype labels to train, validate, and test each prediction model. ML models built for each species-antibiotic combination resulted in consistently high median auPRCs ranging from 0.7–1. These classifiers are highly performant for AMR prediction across species and drugs, and consistently predict top features (genes, protein domains) known to be associated with AMR and horizontal gene transfer (e.g., tetK for tetracycline; mecA for methicillin-resistance; N-acetyltransferase domain for levofloxacin). Alongside known mechanisms, several highly ranked genes are not well-characterized for AMR and are prime targets for the discovery of novel AMR contributors. With inbuilt AMR classification models, thousands of genomes with metadata and phenotypes, their ranked predictor gene/domain sets, and custom data summarization/visualization functions, the amR package provides the first comprehensive, programmatic method to study AMR, starting with the notorious ESKAPE pathogens.

Tomatis Auditorium
14:55
14:55
8min
microgenomeR: an R workflow for integrating genomic metadata and bacterial phenotypes
Raymond Lesiyon

Background: Trait-based approaches in the fields of ecology and quantitative biology have been gaining traction. Investigating the bacterial trait space at a macro scale is crucial in understanding the niches they occupy. Specifically, we seek to understand the traits that delineate pathogenic, or more broadly host-associated bacteria. To tackle this problem, access to robust bacterial trait data is fundamental. Major issues with current data repositories, (BacDive, and bugPhyzz), include; 1) accessibility, setting a barrier to entry for biologists/ecologists lacking programming experience, and 2) the dispersal of different traits across multiple siloed repositories, requiring advanced data wrangling. We introduce the microgenomeR R package to tackle these issues by 1) readily presenting diverse trait data through R data files, 2) minimizing necessary data wrangling for the users with custom cleanup and summarization functions, and 3) enhancing trait data visualization with an interactive dashboard.

Approach: MicroGenomeR is an R package (with proposed installation via GitHub) encapsulating bacterial species metadata ranging from quantitative genomic traits to phenotypic traits. The microgenomeR package builds upon initial data integration and harmonization workflows (by Madin [2020], bugphyzz), which incorporate ~30+ different bacterial datasets and merge them into species and strain levels. The workflow was repurposed first by updating the constituent bacterial datasets. Secondly, for each species, the missing trait values were imputed using bugPhyzz, an R data package with a collection of regularly updated bacterial physiology traits. Where relevant, pathogenicity and host information was further added using a list of manually curated and text-mined bacterial host and pathogen data. The package exports the data as a R-data file for ease of access. For advanced users, the package allows updating of trait values using bugPhyzz. Additionally, the package is accompanied by a R Shiny dashboard for quick exploratory data analysis.

Results: The microgenomeR package contains 22 traits for around 15K+ bacterial species. The traits span 6 groups ranging from abiotic environmental traits to morphological traits. Further, ~2K species were designated as pathogens. For host information, ~4k species are assigned to 1356 unique host species spanning 12 major groups ranging from mammals, to plants, to invertebrates.

Tomatis Auditorium
15:05
15:05
8min
Inclusive internships in genomic data science: Outreachy and Bioconductor
Svetlana Ugarcina Perovic

Access to the products and methods of modern genome-scale biology is inequitable, with many institutions and communities excluded owing to resource constraints and cultural and administrative barriers. In partnership with the Outreachy program, Bioconductor developers create internship opportunities in a number of different areas of software development and
genomic data analysis. This talk describes the population of internship applicants, their application activities, the scope of mentoring activities, and the outcomes of awarded internships.

Tomatis Auditorium
15:15
15:15
8min
Statistical Methods for the Tissue Microenvironment of Multiplex Images in a Clinical-relevant Manner
Xinyue Cui

In the realm of cancer research understanding the tumor microenvironment (TME) plays a role, in predicting tumor behavior and patient outcomes. Due to the nature of TME advanced imaging methods like Multiplexed Ion Beam Imaging by Time of Flight (MIBI TOF) are necessary to gather spatial information. Our research presents an approach using spatial Latent Dirichlet Allocation (LDA) to analyze this data comparing it with traditional LDA to showcase its effectiveness in capturing spatial variations within TME. Our methodology involves a two step process; initially using LDA to identify patterns in cell phenotype distributions and then utilizing a modified linear model to address spatial differences among cells. This efficient method showcases the capabilities of spatial LDA in understanding TME complexity.

Tomatis Auditorium
15:30
15:30
15min
Break
Tomatis Auditorium
15:45
15:45
90min
BioC lightning talks (Day 1)

Lightning talks from the community.

Tomatis Auditorium
07:30
07:30
30min
Registration and Breakfast
Tomatis Auditorium
08:00
08:00
45min
Keynote: Sündüz Keleş, PhD

Keynote talk by Sündüz Keleş, PhD

Keynote
Tomatis Auditorium
08:45
08:45
15min
Break
Tomatis Auditorium
09:00
09:00
8min
Singlet: Fast and Interpretable Dimension Reduction of Single-cell Data
Zachary DeBruine

Non-negative Matrix Factorization (NMF) is a popular and interpretable method for dimension reduction. NMF can be used in the same analysis pipelines as Principal Component Analysis, but also provides direct insights into patterns of gene co-expression (biological processes) and context of cell similarities (soft clustering). However, NMF implementations have historically been too slow to scale to the size of datasets available today. We implemented very fast and scalable NMF in our "singlet" R package, which natively supports sparse matrices and can scale to datasets with tens of millions of cells. We trained an NMF model on 28 million human transcriptomes from the Chan Zuckerberg Initiative CellCensus dataset. Our embeddings are publicly available as CellCensus models for transfer learning. We also developed new data structures for single-cell data that outperform BPCells in terms of compression by ~4x and exceed performance, allowing for in-core NMF on high-memory HPC nodes of over 100 million single-cell transcriptomes, in theory. Here we show how a standard Seurat object can make use of "singlet" NMF, what insights can be derived from this data, and we share some of the use cases for our new CellCensus model, and show how our new data structures enable in-core analysis of single-cell datasets that normally would require distributed computing.

Tomatis Auditorium
09:10
09:10
8min
The impact of package selection and versioning on single-cell RNA-seq analysis
Joseph Rich

Seurat and Scanpy are two of the most widely-used tools in the analysis of single cell RNA-sequencing (scRNA-seq) data, and are generally thought to implement the standard analysis pipeline very similarly. However, we find that Seurat and Scanpy demonstrate drastic differences in output when comparing standard pipelines. The steps that demonstrate differences by default include highly variable gene selection algorithm, Principal Component Analysis (PCA) after scaling, approximate k-nearest neighbors (KNN) graph algorithm, shared nearest neighbor graph construction method, clustering algorithm, Uniform Manifold Approximation and Projection (UMAP) method, marker gene filtering, log-fold change calculation, and p-value calculation and adjustment. Only a portion of these steps can be made to act identically with intentional selection of function arguments. The degree of differences present between packages can lead to divergent biological conclusions downstream, and essentially amounts to differences seen by downsampling a dataset to as little as 4% of the original reads or 16% of the original cells. Additionally, the version of Seurat or Scanpy can have an impact on differential expression analysis, particularly during marker gene selection and, in the case of Seurat, log-fold change calculation technique. The version of Cell Ranger used to generate the count matrix can impact all steps of analysis including cell and gene filtering, especially when there is a difference in the setting of intron inclusion. Put together, our study makes a case for the importance of careful and consistent package selection and version control when conducting scRNA-seq data analysis in order to minimize technical noise.

Tomatis Auditorium
09:20
09:20
8min
scHiCcompare - differential analysis of single-cell Hi-C data
My Nguyen

Title: scHiCcompare - differential analysis of single-cell Hi-C data
Presenter: My Nguyen
Advisor: Mikhail Dozmorov
Abstract:
Changes in the three-dimensional (3D) structure of the genome are an
established hallmark of cancer and developmental disorders. To comprehend
these global 3D structures, techniques such as chromatin conformation
capture (Hi-C) have been devised. A typical Hi-C experiment requires millions
of cells and ultra-high sequencing depth, on the order of 1 billion reads per
sample (bulk Hi-C). In contrast, single-cell Hi-C technologies allow for
capturing 3D structures in individual cells, albeit with the trade-off of
encountering high data sparsity (characterized as high proportion of zeros).
Despite numerous methods for differential 3D analysis in bulk Hi-C data,
analyzing single-cell Hi-C data differentials remains underdeveloped. We
propose to adapt our HiCcompare method for scHi-C data normalization and
differential analysis. Briefly, clusters of scHi-C data will be imputed
univariately by random forest, converted to pseudo-bulk Hi-C datasets, and
analyzed as we described previously. The single cells datasets of 14 different
cell types, covering each chromosome in the human prefrontal cortex, will be
used to apply our method and detect chromosome regions differences
between pairs of cell types. We will present the results scHiCcompare and
benchmark methods (Negative Binomial model, T.test, Kolmogorov–Smirnov
test, Wilcoxon signed-rank test, SnapHiC-D, and scHiCDiff) in terms of
controlling false positives under the null hypothesis and detecting simulated
differential chromatin contacts. Notably, scHiCcompare outperforms those
methods across various measurement metrics, including the Matthews
Correlation Coefficient (MCC), F1 score, sensitivity, and specificity, with a
smaller single-cell data sample, enhancing detection power.

Tomatis Auditorium
09:30
09:30
8min
iscream: Fast and memory efficient (sc)WGBS data handler
James Eapen

WGBS sequencing of the 28 million CpG loci in the human genome produces large quantities of data. Reading and storing this data is memory-intensive and can be slow. Further, while bulk WGBS data covers the majority of CpGs, single-cell WGBS data typically does not and requires a sparse matrix representation.

iscream is designed to handle both bulk and single-cell WGBS methylation data from the BISCUIT and Bismark aligners. Its goal is to provide users a fast and memory efficient way to load WGBS data for analysis, whether full genomes or regions of interest. It aims to reduce the memory footprint and runtime of data loading by reading and manipulating the data in-place without deep copying. This is achieved using Rcpp for speed and fine-grained memory usage control. Rcpp also provides access to htslib for fast genomic region queries.

iscream outputs data as sparse or dense matrices that can be used to create BSseq objects for analysis. Users can aggregate single-cell methylation information across regions of interest for analysis with scMET. Future work will include support to use the arrow library so that large datasets can be lazily queried without having to load them into memory first.

The package is still under development, but may be found at https://github.com/huishenlab/iscream.

Tomatis Auditorium
09:40
09:40
8min
vmrseq: Probabilistic Modeling of Single-cell Methylation Heterogeneity
Ning Shen

Single-cell DNA methylation measurements reveal genome-scale inter-cellular epigenetic heterogeneity, but extreme sparsity and noise challenges rigorous analysis. Previous methods to detect variably methylated regions (VMRs) have relied on predefined regions or sliding windows, and report regions insensitive to heterogeneity level present in input. We present vmrseq, a statistical method that overcomes these challenges to detect VMRs with increased accuracy in synthetic benchmarks and improved feature selection in case studies. vmrseq also highlights context-dependent correlations between methylation and gene expression, supporting previous findings and facilitating novel hypotheses on epigenetic regulation. vmrseq is available at https://github.com/nshen7/vmrseq.

Tomatis Auditorium
09:50
09:50
25min
Break
Tomatis Auditorium
10:15
10:15
45min
Keynote: Joshua Welch, PhD

Keynote talk by Joshua Welch, PhD

Keynote
Tomatis Auditorium
11:00
11:00
15min
Break
Tomatis Auditorium
11:15
11:15
8min
Cell type co-localization and cell type-specific microenvironment analysis on spatial transcriptomics data
Mengbo Li

Spatial transcriptomics technologies are now able to measure spatially resolved gene expression at unprecedented high throughput and resolution. A central aim of analyzing spatial transcriptomics data is to decipher the organization of cells and tissues, spatially and temporally. Main tasks include definition of anatomical regions such as tumors, capture of cell-to-cell co-localization patterns, analysis of cellular microenvironment across different regions, and detection of spatially variable genes, to name but a few. Meanwhile, from a data point of view, one significant question is how we make use of the spatial coordinates of each measurement unit (cells or spots). We present scider, an R package for cell type co-localization and cell type-specific microenvironment analysis on spatial transcriptomics data. For each cell type, cell coordinates are summarized by the kernel-smoothed spatial density function, whereby cell type-specific regions of interest (ROIs) can be defined at each density level. We are then able to examine cell type co-localization and composition patterns within each ROI, which provides a description for the local neighborhood on the cellular microenvironment for the cell type of interest. In addition, scider takes advantage of geometric operations from geospatial analysis, based on which ROIs can also be defined by areas between every two contour lines of the spatial density function. This allows us to perform differential expression analysis within each cell type across diverse tissue contexts, while accounting for the localization of other cell types. scider is publicly available at https://bioconductor.org/packages/release/bioc/html/scider.html.

Tomatis Auditorium
11:25
11:25
8min
Identification of spatial domains by smoothing for compositional analyses in spatial transcriptomics data
Lukas Weber

Spatial transcriptomics platforms enable the measurement of transcriptome-scale gene expression levels at spatial resolution, and have become widely applied to study spatial variation in cell type composition and gene expression patterns within tissue samples. Depending on the technological platform, the spatial resolution of the measurements may either be at molecular resolution or consist of pooled measurements from one or more cells per spatial location, and measurements may also be characterized by high levels of sparsity due to sampling variation. The identification of spatial domains consisting of tissue regions with relatively uniform cell type composition or mixtures and consistent gene expression signatures represents a key step during computational analysis workflows. Spatial domains may then be further investigated by applying tools for cell type compositional analyses or differential analyses. We have developed a new method, smoothclust, to identify spatial domains in spatial transcriptomics data in an interpretable and computationally scalable manner, based on spatial smoothing of gene expression values followed by unsupervised clustering. We have evaluated the method using data from several technological platforms and compared against existing and baseline methods. The method is available as an R package from Bioconductor and is integrated into Bioconductor-based analysis workflows for spatial transcriptomics data.

Tomatis Auditorium
11:35
11:35
8min
Scalable count-based models for unsupervised detection of spatially variable genes
Boyi Guo

Unsupervised feature selection methods are well sought after in the analysis of high-dimensional genomics data. The recent development of spatially resolved technologies poses novel computational challenges, including the identification and ranking of genes that vary spatially, i.e. spatially variable genes (SVG). While many SVG methods have been proposed to model continuous normalized gene expression data, they are susceptible to any bias attributed to normalization strategies and vulnerable to the violation of isotropic assumption, leading to erroneous findings. While few available count-based SVG methods are theoretically sound, they are computationally prohibitive and less palatable for real-world application. To address these challenges, we propose a scalable approach that extends the generalized geoadditive framework to the analysis of spatially resolved transcriptomics data. Our method identifies genes whose expression exhibits spatial patterns and accounts for effect differences across pre-defined spatial domains when applicable. In addition, our method provides flexibility in modeling raw gene expression data, accommodating multiple count-based distributions including Poisson, Negative Binomial and Tweedie. In simulation studies and real-world applications, we demonstrate that our proposed count-based models outperform the state-of-the-art SVG methods.

Tomatis Auditorium
11:45
11:45
8min
SpotSweeper: spatially-aware quality control for the removal of technical artifacts and local outliers in spatial transcriptomics
Michael Totty

Quality control (QC) is a crucial step to ensure the reliability and accuracy of the data obtained from sequencing experiments. Recent technological advances now allow for whole-transcriptome profiling with spatial resolution. Until now, standard procedures for the QC for spatially-resolved transcriptomics (SRT) data adopt methods developed for single-cell RNA sequencing (scRNA-seq). We show here that QC methods developed for scRNA-seq are inappropriate for SRT. Additionally, SRT are subject to large technical artifacts, such hangnails or tissue folds, that arise due to tissue processing errors unique to SRT. To address this, we have developed SpotSweeper; an R/Bioconductor package for the detection and removal of both local outliers and technical artifacts in SRT using spatially-aware methods. By comparing individual spots to their local neighborhood, we show that SpotSweeper avoids the bias present in global (i.e, whole tissue) comparisons. Similarly, we demonstrate that technical artifacts can be classified with high accuracy using the local variance in mitochondrial ratio. Collectively, SpotSweeper is the first computational methods developed specifically for SRT.

Tomatis Auditorium
11:55
11:55
8min
Using high-throughput spatial proteomics as a platform to elucidate protein relocalisation events during viral production
Charlotte Hutchings

The potential of recombinant adeno-associated viruses (rAAVs) as therapeutic DNA delivery agents has been established, and four such gene therapies are now commercially available. The most common way in which rAAVs are manufactured is through transient triple transfection of HEK293 cells, essentially using the cells as virus production factories. Unfortunately, low rAAV yield and recovery during this process currently limits the wider use of these vehicles. Improvements in the manufacturing of rAAVs to ensure the long-term success of rAAV gene therapies will require a better understanding of the molecular mechanism(s) by which rAAVs are produced inside HEK293 cells. Using high-throughput mass spectrometry-based spatial proteomics we have generated sub-cellular protein maps of control (non-producing) and rAAV-producing HEK293 cells. Global spatial maps were acquired using the well-established Localisation of Organellar Proteins by Isotope Tagging after Differential Centrifugation (LOPIT-DC) method. Data was analysed using dedicated Bioconductor proteomics packages, namely QFeatures, MSnbase, pRoloc and handle. By exploiting the semi-supervised Bayesian methods available in pRoloc and bandle, we were able to systematically localise thousands of proteins per experiment to distinct subcellular organelles and protein complexes. Comparison of cellular protein localisation in control and rAAV-producing cells has helped to elucidate whether differential protein localisation of cellular or viral proteins could contribute to limitations in the rAAV production process.

Tomatis Auditorium
12:05
12:05
8min
Context is important! Identifying context aware spatial relationships with Kontextual.
Farhan Ameen

State-of-the-art technologies such as PhenoCycler, IMC, CosMx, Xenium, and others can deeply phenotype cells in their native environment, providing a high throughput means to effectively quantify spatial relationships between diverse cell populations in the context of their native tissue environments. However, the experimental design choice of which spatial domains or region of interests (ROI) will be imaged can greatly impact the interpretation of spatial quantifications. That is, spatial relationships identified in one ROI may not be applicable in other ROIs. To address this challenge, we introduce Kontextual, a method which considers alternative contexts for the evaluation of spatial relationships. These contexts may represent landmarks, spatial domains, or groups of functionally similar cells which are consistent across ROIs. By modelling spatial relationships between cells relative to these contexts, Kontextual produces robust spatial quantifications that are not confounded by the ROI selected. We applied Kontextual to a MIBI-TOF and Xenium breast cancer dataset and identified biologically relevant relationships which were consistent across ROIs. Furthermore, we performed comparison with other spatial and non-spatial features and observed that Kontextual was better suited for prediction of patient prognosis. These results suggest Kontextual is a useful approach to overcome the challenges in analysis of high throughput spatial omics across patient cohorts.

Tomatis Auditorium
12:15
12:15
30min
BioC Awards Ceremony
Tomatis Auditorium
12:45
12:45
60min
Lunch
Tomatis Auditorium
13:45
13:45
45min
Keynote: Stephen Piccolo, PhD

Keynote talk by Stephen Piccolo, PhD

Keynote
Tomatis Auditorium
14:30
14:30
15min
Break
Tomatis Auditorium
14:45
14:45
45min
BANKSY unifies cell typing and tissue domain segmentation for scalable spatial omics data analysis
Vipul Singhal, Joseph Lee

A core property of solid tissue is the spatial arrangement of cell types into stereotypical spatial patterns. These cell types can be investigated with spatial omics technologies to reveal both their omics features (transcriptomes, proteomes, etc), and their spatial coordinates. Because a cell’s state can be influenced by interactions with other cells, it is informative to cluster cells using both these types of features. We present BANKSY, an algorithm with R and Python implementations that identifies cell types and tissue domains from spatially-resolved omics data. It does so by embedding cells in a product space of their own and neighbourhood omics features. In our tests, BANKSY revealed niche-dependent cell states in the mouse brain, and outperformed competing methods on domain segmentation and cell-typing benchmarks. BANKSY can also be used for quality control of spatial transcriptomics data and for spatially aware batch correction. Critically, it is substantially faster and more scalable than existing methods, enabling the processing of millions of cell datasets. In R, BANKSY can be used in the Bioconductor ecosystem via the SingleCellExperiment class.

Package Demo
Tomatis Auditorium
14:45
45min
Easy access, interactive exploration and analysis of public gene expression datasets with Phantasus
Alexey Sergushichev

Transcriptomic profiling became a standard approach to quantify a cell state, which led to accumulation of huge amounts of public gene expression datasets. However, reuse of these datasets is inhibited by the lack of uniform format of how such data is deposited. This leads to the situation when actually getting the data ready for analysis requires many steps such as loading the gene expression values, feature and sample annotation, normalization, quality check and outlier filtering. Here we present Phantasus – a user-friendly web application for interactive gene expression analysis which provides streamlined access to almost 120,000 public gene expression datasets from Gene Expression Omnibus. Importantly, Phantasus supports both public microarray and RNA-seq datasets as quantified by ARCHS4 and DEE2 projects. With Phantasus, researchers can quickly and easily perform quality checks and analysis through its intuitive and highly interactive JavaScript user interface integrated with opencpu-based R backend. Phantasus is available at https://alserglab.wustl.edu/phantasus as well as the Bioconductor R package (https://bioconductor.org/packages/phantasus). We further provide a lightweight phantasusLite package (https://bioconductor.org/packages/devel/bioc/html/phantasusLite.html) featuring helper R functions for working with public datasets, in particular providing an access for RNA-seq counts stored at the remote HSDS (Highly Scalable Data Service) server.

Package Demo
Room 3104-5
15:30
15:30
30min
Break
Tomatis Auditorium
16:00
16:00
45min
Long-read methylation data analysis with NanoMethViz and Bioconductor
Shian Su

In this workshop, we provide a Bioconductor analysis pipeline for DNA methylation. We highlight NanoMethViz, an R package for the analysis of DNA methylation using long-read sequencing data. DNA methylation is a critical epigenetic mechanism involving the addition of methyl groups to DNA, affecting gene expression without altering the genetic sequence. This process plays a pivotal role in development, health, and disease, making its study essential. Starting from modBAM files, which are currently the standard output of ONT-based modification calling pipelines, we will learn to perform exploratory data analysis to uncover high level methylation patterns over genes and across samples. We proceed to delve deeper to find differential methylated regions (DMRs), and associate them with genes to potentially uncover features that are affected by epigenetic regulation. Using NanoMethViz we can plot the methylation signals in the discovered DMRs or other regions of interest in order to generate a high resolution plots of methylation profiles, as well as data from individual long-reads. We will also cover data querying features of NanoMethViz to perform more custom analyses on the raw data, as well as more advanced features of the package for methylation data analysis.

Package Demo
Tomatis Auditorium
16:00
45min
Unraveling Immunogenomic Diversity in Single-Cell Data
Ahmad Al Ajami

The human immune system is governed by a complex interplay of molecules encoded by highly diverse genetic loci. Immune genes such as B and T cell receptors, human leukocyte antigens (HLAs), and killer Ig-like receptors (KIRs) exhibit remarkable allelic diversity across populations. However, conventional single-cell analysis methods often overlook this diversity, leading to erroneous quantification of immune mediators and compromised inter-donor comparability.

To address these challenges and unlock deeper insights from single-cell studies, we present a comprehensive workflow comprising two software and one data packages:

  1. scIGD (single-cell ImmunoGenomic Diversity): A Snakemake workflow designed to automate allele-typing processes for immune genes, with a focus on key targets like HLAs. In addition, it facilitates allele-specific quantification from single-cell RNA-sequencing (scRNA-seq) data using donor-specific references.

  2. SingleCellAlleleExperiment: This (soon-to-be-submitted) R/Bioconductor package maximizes the analytical potential of results obtained from scIGD. It offers a versatile multi-layer data structure, allowing representation of immune genes at various levels, from alleles to genes to functionally similar gene groups. This enables comprehensive analysis across different layers of immunologically-relevant annotation.

  3. scaeData: An R/ExperimentHub data package housing datasets processed by scIGD. These datasets can be utilized to perform exploratory and downstream analysis using the novel SingleCellAlleleExperiment data structure.

Preliminary findings demonstrate accurate quantification of different HLA allele groups in (amplicon-based and whole-transcriptome-based) scRNA-seq datasets from diverse sources, including cancer patients and human atlas samples. This not only enhances the comparability of immune profiles across donors but also sheds light on population-specific susceptibilities to infections. Our work lays the groundwork for precise immunological analysis of multi-omics data, particularly in elucidating allele-specific interactions.

If accepted as a package demo, I intend to showcase all three tools, emphasizing the utilization of SingleCellAlleleExperiment and its functionalities on one of the example datasets available in scaeData, for exploratory and downstream analysis across the three layers offered by the data structure.

Package Demo
Room 3104-5
07:30
07:30
30min
Breakfast
Tomatis Auditorium
08:00
08:00
45min
Keynote: Sandra Safo, PhD

Keynote talk by Sandra Safo, PhD

Keynote
Tomatis Auditorium
08:45
08:45
15min
Break
Tomatis Auditorium
09:00
09:00
8min
Computational framework for inference of genetic ancestry from challenging human molecular data
Pascal Belleau

The correlation between the incidence of multiple human diseases and the ancestral background of patients has been well-documented in epidemiological data. This phenomenon suggests a link between the biology and genetics of the disease and genetic ancestry. Recent research in cancer genomics highlights genetic and phenotypic differences between tumors occurring in patient populations with differing genetic ancestries. We aim to facilitate the analysis of such data on a larger scale by performing genetic ancestry inference directly from various types of molecular data not typically used for this purpose.

Inference of genetic ancestry from molecular data other than germline DNA sequence presents two challenges. The first is data-specific, such as the uneven coverage of the genome by RNA-seq or ATAC-seq. The second challenge is cancer-specific: tumor genomes are replete with somatic alterations, including copy number variants, translocations, loss of heterozygosity, microsatellite instabilities, and structural variations.

To address these challenges, we developed a systematic computational framework for ancestry inference from molecular data, across assays and conditions. This resulting Bioconductor package, called Robust Ancestry Inference using Data Synthesis (RAIDS), optimizes and assesses the inference accuracy specifically for a given input molecular profile. RAIDS combines existing inferential algorithms with an adaptive procedure for inference parameter optimization and rigorous performance assessment, called "data synthesis". Briefly, a training set of molecular profiles is synthesized. Each synthetic profile consists of sequence reads from the input profile, into which sequence variants of an individual of known genetic ancestry are embedded. Current version of RAIDS enables inference from new molecular profiles originating in RNA-seq and ATAC-seq protocols, whole genomes and whole exomes, particularly those derived from cancer. For all such input, RAIDS infers global ancestry at the subcontinental level of resolution. The performance of RAIDS has been successfully validated using molecular data from large patient cohorts, including a major subset of The Cancer Genome Atlas (TCGA).

Future releases of RAIDS will enable inference of ancestral admixtures and provide support for molecular profiles originating from additional protocols, such as whole-genome and reduced-representation bisulfite-altered sequences. RAIDS relies on Bioconductor packages GenomicRanges, gdsfmt and SNPRelate.

In summary, an inferential framework embodied by RAIDS makes vast amounts of existing molecular data available for ancestry-oriented studies of various human conditions and molecular phenotypes, including cancer.

RAIDS is available at:
https://bioconductor.org/packages/RAIDS/

Tomatis Auditorium
09:10
09:10
8min
Exvar: An R Package for Gene Expression And Genetic Variation Data Analysis And Visualization
Hiba Ben Aribi

RNA sequencing data manipulation workflows are complex and require various skills and tools. This creates the need for user-friendly and integrated genomic data analysis and visualization tools.
We developed a novel R package using multiple Cran and Bioconductor packages to perform gene expression analysis and genetic variant calling from RNA sequencing data. Multiple public datasets, available on the SRA database, were analyzed using the developed package to validate the pipeline for all the supported species.
The developed R package, named "Exvar," includes multiple data analysis functions for preprocessing RNA sequencing FASTQ files, analyzing gene expression, and calling variants (SNP, CNV, and Indels); as well as three data visualization functions integrating interactive Shiny apps for visualizing gene expression, SNP, and CNV data.
Also, it could be used to analyze eight species’ data (Homo sapiens, Mus musculus, Arabidopsis thaliana, Drosophila melanogaster, Danio rerio, Rattus norvegicus, Caenorhabditis elegans, and Saccharomyces cerevisiae).
The Exvar package is available in the project’s GitHub repository (https://github.com/omicscodeathon/Exvar).

Tomatis Auditorium
09:20
09:20
8min
Unraveling the Intricate Molecular Landscape and Potential Biomarkers in Lung Adenocarcinoma through Integrative Epigenomic and Transcriptomic Data Analysis
Arnab Mukherjee

Lung carcinoma is one of the most prevalent and life-threatening cancers globally, with tobacco smoking being the most significant cause of lung cancer deaths. Lung adenocarcinoma (LUAD) accounts for approximately 80-85% of reported lung cancer cases and unfolds in a sequential multistage pattern, gradually developing genetic and epigenetic alterations. Alterations in DNA methylation at CpG sites are associated with smoking-induced lung cancer. Smoking-related epigenetic alterations are involved in the modulation of multiple biological pathways. Numerous tumors exhibit atypical methylation patterns, which can involve either increased (hypermethylation) or decreased (hypomethylation) addition of a methyl group to the cytosine. Demethylation of CpG sites is associated with the upregulation of oncogenes and genomic instability observed in multiple solid tumors, including lung cancer. However, hypermethylation is linked to the downregulation of the genes and silencing of tumor suppressors. Enhancers govern gene expression across great distances by looping DNA and offering distant regulatory regions closer to their target gene promoters.
Therefore, we employed Illumina HM450k DNA methylation data of patients from The Cancer Genome Atlas (TCGA) to determine enhancers and link enhancer status with the expression of target genes to discover transcriptional targets using The Enhancer Linking by Methylation/Expression Relationship (ELMER) package of Bioconductor. In this study, we investigated a technique for predicting enhancer-target interactions by combining epigenomic and transcriptomic data from a substantial collection of primary tumor samples. This approach allowed us to identify target genes specifically regulated by enhancers with differential methylation patterns in LUAD and revealed the target genes of the differentially methylated sites and the enriched motifs modulating their expression in LUAD progression. The network-based approach aided in determining the hub genes playing a key role as central regulators of ribosome biogenesis, RNA processing, cell cycle regulation, and MMR pathways in LUAD pathogenesis.

Tomatis Auditorium
09:30
09:30
8min
Patterns: Deciphering Biological Networks with Patterned Heterogeneous (multiOmics) Measurements
Myriam Maumy

The Patterns package is a modelling tool dedicated to biological network modelling.

It is designed to work with patterned data. Famous examples of problems related to patterned data are:
* recovering signals in networks after a stimulation (cascade network reverse engineering),
* analysing periodic signals.

It allows for single or joint modelling of, for instance, genes and proteins.

It provides tools to select relevant actors to be used in the reverse engineering step -based on their differential effects (e.g. gene expression or protein abundance) or on their time course profile-.

It performs reverse engineering based on the observed time course patterns of the actors.

It provides many inference functions that are dedicated to getting specific features for the inferred network such as sparsity, robust links, high confidence links or stable through resampling links. They can be based on weighted or unweighted versions of lasso, spls, elastic net, stability selection, robust lasso, selectboost, from the selectboost package.

Some simulation and prediction tools are also available for cascade networks.

Examples of use with microarray or RNA-Seq data are provided.

Tomatis Auditorium
09:40
09:40
8min
Multi-omic analysis methods for identifying phenotypic plasticity
Kelly Street

Plasticity is a cell's ability to rapidly and reversibly alter its phenotype, typically without significant epigenetic remodeling. While plasticity is known to play an important role in many human systems, such as colon crypt maintenance, it may also help to explain how a single progenitor cell can give rise to a tumor with many different cellular phenotypes. We found evidence of phenotypic plasticity by examining DNA methylation profiles and single-cell RNA-Seq data from both normal colon crypts and colorectal cancers. Specifically, we found that genes with higher expression variability tended to have more conserved, or less variable, methylation profiles. However, many existing quantitative methods for the analysis of multi-omic data are designed to identify correlation or covariance between epigenetic features and gene expression. Such methods may not be well-calibrated to detect phenotypic plasticity, in which conservation, not variability, of the epigenome plays an important role.

Tomatis Auditorium
09:50
09:50
25min
Break
Tomatis Auditorium
10:15
10:15
45min
Keynote: Luca Pinello, PhD

Keynote talk by Luca Pinello, PhD

Keynote
Tomatis Auditorium
11:00
11:00
15min
Break
Tomatis Auditorium
11:15
11:15
8min
Engineering Foundation Models of Single-cell Transcriptomics Data
Zachary DeBruine

Foundation models of single-cell transcriptomics data promise unprecedented capability to reason about any new information in the context of a vast knowledge about gene co-expression and regulation patterns. While current work in this space largely uses transformer- or GPT-based architectures, we are building multimodal generative architectures that rely on mixture-of-experts variational autoencoders with generate adversarial feedback. Our approaches permit scalable integration of transcriptomes across any supervised metadata labels (e.g. organ, disease, assay, dataset id, etc.) and modalities (e.g. RNA, ADT, ATAC), but also across species without making any a priori assumptions about gene homology. Here we present our work to date on these new architectures involving models trained on ~40 million single-cell transcriptomes from human and mouse in the Chan Zuckerberg Initiative CellCensus, and also zebrafish data from several other datasets. These foundation models open new opportunities for highly-powered prediction tasks even on small sample sizes by leveraging a vast knowledge of transcriptomics.

Tomatis Auditorium
11:25
11:25
8min
Assigning treatment regimens to Irish patients in head and neck squamous cell carcinoma with large language models
Meghana Kshirsagar, Gauri Vaidya, Deleted User, Kaushal Bhavsar, Conor Ryan

Unleashing the potential of large language models (LLMs) in healthcare is a hot topic of research amongst computational scientists and bioinformatics researchers worldwide. Large language models have the potential to assist in various aspects of medicine given their capability to process complex concepts. The constant development of new treatment regimens is daunting for healthcare professionals and the introduction of these models allows for professionals to be updated with the latest advancements which improves patient care. CancerGPT [1] , an LLM, employs zero-shot learning to extract knowledge from medical literature and then use it to infer biological tasks. With approximately 124 million parameters, it matches the scale of the larger, fine-tuned GPT-3 model with around 175 million parameters. The observations from the study demonstrated that CancerGPT provided accurate responses for seven different biological tasks. However, the study also highlights the vulnerability of LLMs to hallucination. To address hallucination, which is incorrect and nonsensical content of large language models , our research seeks to address this limitation of LLM by seeking answers to the following questions. Can these models be trusted to responsibly recommend personalized treatment plans for cancer patients thereby improving their health outcomes? Can they serve as reliable clinical predictor tools for tailoring treatment strategies? To answer these questions, we curate a dataset consisting of treatment outcomes for monoclonal antibody treatment regimens in head and neck squamous cell carcinoma (HNSCC), which is the seventh most common cancer worldwide. The treatment regimens [2] include Cetuximab, Pembrolizumab, Durvalumab, and Nivolumab prescribed from the National Cancer Control Programme (NCCP) and The Irish Society of Medical Oncology (ISMO) which has a total of thirteen different drug regimens for HNSCC. The treatment outcomes are based on RECIST [3] score for each of the four drug regimens. Ten clinical trials for each of the individual treatment regimens were incorporated in our study. We augment prompt engineering process by providing contextual information such as drug regimens for targeted therapy, regimen doses, targeted oncogenes, and clinical outcomes from the clinical trials used in our study. We use one shot learning during prompt engineering phase. We test the LLM responses by providing contextual and non-contextual information. The LLM responses are evaluated using true positives and true negative scores on ten patient profiles of similar and different cancer type on the same drug regimen. Preliminary results indicate that contextual information significantly reduces LLM’s from hallucinating. To make the models reliable and trustworthy our study encompasses age, gender and genomic profile of patients. By overcoming hallucination and incorporating multimodal features, we demonstrate that LLMs holds the potential to recommend personalised treatments for HNSCC in Irish patients.

References:

[1] Li, T., Shetty, S., Kamath, A., Jaiswal, A., Jiang, X., Ding, Y., & Kim, Y. (2024). CancerGPT for few shot drug pair synergy prediction using large pretrained language models. In npj Digital Medicine (Vol. 7, Issue 1). Springer Science and Business Media LLC. https://doi.org/10.1038/s41746-024-01024-9

[2]https://www.hse.ie/eng/services/list/5/cancer/profinfo/chemoprotocols/headandneck/

[3] Nishino, M., Jagannathan, J. P., Ramaiya, N. H., & Van den Abbeele, A. D. (2010). Revised RECIST Guideline Version 1.1: What Oncologists Want to Know and What Radiologists Need to Know. In American Journal of Roentgenology (Vol. 195, Issue 2, pp. 281–289). American Roentgen Ray Society. https://doi.org/10.2214/ajr.09.4110

Tomatis Auditorium
11:35
11:35
8min
OmicsMLRepo: Ontology-leveraged metadata harmonization to improve AI/ML-readiness of omics data in Bioconductor
Sehyun Oh

Efforts to establish comprehensive biological data repositories have been significant at national and institutional levels. Despite the large volume of data collected from diverse studies, the cross-study analysis across those repositories and joint modeling between omics and non-omics data remains largely limited due to the nature of non-omics metadata, such as lack of standardization, high complexity, and heterogeneity. This lack of metadata harmonization also hinders the application and development of machine learning tools, which can serve a pivotal role in managing and analyzing complex and high-dimensional multi-omics data.
To address this issue, we initiated the OmicsMLRepo project, harmonizing and standardizing metadata from omics data resources. This process involved the manual review of metadata schema, the consolidation of similar or identical information, and the incorporation of ontologies. As a result, we have harmonized hundreds of studies on metagenomics and cancer genomics data, accessible through two R/bioconductor packages - curatedMetagenomicData and cBioPortalData. Furthermore, we developed a software package, OmicsMLRepoR, allowing users to leverage the ontologies in metadata search. In summary, the OmicsMLRepo project simplifies the process of cross-study, multi-faceted data analyses through metadata harmonization and standardization, making omics data more AI/ML-ready.

Tomatis Auditorium
11:45
11:45
8min
GeneSetCluster 2.0: an upgraded comprehensive toolset for summarizing and integrating gene-sets analysis
Asier Ortega Legarreta

GeneSetCluster is a powerful toolset for clustering gene-sets based on shared genes, facilitating the identification of groups with similar gene content. In its latest iteration, Version 2.0, GeneSetCluster has evolved to address the growing complexities of gene-set analysis (GSA). It is particularly useful when handling multiple GSAs derived from various data types or contrasting studies. The latest iteration of GeneSetCluster incorporates innovative approaches to analysis, functional annotations, and enhanced visualization techniques. A significant addition is the development of a user-friendly Shiny interface for GeneSetCluster2. This interface is specifically tailored to assist clinicians and biologists with limited bioinformatics experience in their analyses. Moreover, we have meticulously implemented and documented the optimal interaction between the Shiny application and the R package. This enhancement is aimed at fostering efficient collaboration among multidisciplinary teams engaged in gene-set analysis.

Tomatis Auditorium
11:55
11:55
8min
Visualization of functional enrichment results into biological networks with Bioconductor enrichViewNet package
Astrid Deschênes

Functional enrichment analysis has emerged as a valuable bioinformatics approach to investigate large-scale genomic, proteomic and transcriptomic datasets. This method assesses the over-representation of functional terms within a query dataset, leveraging biological databases such as Reactome, KEGG, TRANSFAC, and GO. Several tools, including Enrichr, David, g:Profiler and clusterProfiler, have been developed to facilitate this analysis.

The output of functional enrichment analyses often consists of extensive lists of significantly over-represented terms. To enhance biological interpretation, these terms are frequently visualized using representations such as UpSet plots, bar charts, and dot plots. In contrast, network-based visualization provides a distinct advantage over these visual representations as it highlights relationships and potentially uncovers underlying biologically-relevant patterns and clusters. In network-based visualization, the information is often represented as two major object types: nodes and edges. The nodes usually represent entities while edges represent the interaction between those entities.

The Bioconductor enrichViewNet package enables the visualization of enrichment results as biological network graphs. The package generates two types of network graphs: customizable gene-term networks and enrichment maps. The input format corresponds to the enrichment result format generated by the gprofiler2 package, a R client package for the g:Profiler web server. We chose the g:Profiler R client due to its fluid integration with the enrichViewNet package into a R pipeline, its popularity, as well as the continuous updating of its knowledge database.

The first type of network graphs generated by the enrichViewNet package is a gene-term network. In gene-term networks, genes and functional terms are both represented as nodes, with edges connecting genes that are part of a given functional term. The networks are automatically loaded into Cytoscape software with the Bioconductor RCy3 package. Those graphs allow for the rapid exploration of gene-term relationships and identification of key functional groups as well as the personalisation of the network elements in Cytoscape.

Enrichment maps, on the other hand, represent functional enriched terms as nodes connected by edges when these share a minimal ratio of elements (as specified by user). The enrichViewNet package incorporates functions from the Bioconductor enrichplot package to generate those maps. This approach allows for the visualization of intricate relationships between terms. Furthermore, results from more than one functional enrichment analyses can be visualized on the same network graph.

To illustrate the positive benefit of enrichViewNet, we analyzed the protein-protein interactions of the inhibitory receptor, CLEC12A, identified in HEK293 cells by quantitative iTRAQ proteomics. The enrichViewNet package revealed a complex network of interactions that highlighted the potential roles for CLEC12A in cellular processes not previously associated with this myeloid inhibitory receptor.

In conclusion, the enrichViewNet package is a flexible tool for the visualization of functional enrichment analysis results as biological network graphs. Its seamless integration with R pipeline, coupled with its ability to generate insightful network representations, makes it a valuable addition to the toolkit of biological researchers.

The enrichViewNet package is available on Bioconductor: https://bioconductor.org/packages/enrichViewNet

Tomatis Auditorium
12:05
12:05
8min
A scalable and flexible network-based approach to identify, diagnose, and resolve mislabeled samples in molecular data
Charles Deng

Mislabeled data can result in diminished statistical power and, more critically, introduce systematic biases that affect the accuracy of results. Here, we present a scalable network-based algorithm that identifies and resolves mislabels in any dataset containing multiple samples per individual and sufficient genetic information to genotype samples. In our approach, a label network linking samples with the same subject label and a genotype network linking samples with the same genotype are constructed. An ensemble of custom search algorithms then proposes mislabel corrections by resolving discrepancies between these two networks. Our approach can also use constraints from the experimental design to rule out improbable events, such as swaps between different data types, and therefore resolve mislabeling events that would otherwise be ambiguous. In over one million simulated datasets with up to 100,000 samples and 20% mislabeling, our algorithm corrects mislabels with median 99% accuracy. When constraints from experimental design are incorporated, our algorithm remains effective (median accuracy over 90%) even on datasets with 50% mislabeling. On the Mount Sinai COVID-19 Biobank, which spans 3,756 RNAseq, WGS, and genotype array samples across 696 individuals, our algorithm automatically identifies 190 mislabels. 186 (98%) of these suggested corrections matched those previously determined by manual review, 3 (1.6%) were previously unidentified, and 1 (0.5%) was determined to be incorrect. The algorithm’s scalability was further tested on 17,350 RNAseq samples across 948 individuals from GTEx, where the algorithm reported no discrepancies between the label and genotype networks, suggesting perfect sample labeling as expected from this heavily studied dataset. Our approach, along with its accompanying visualization tools, is implemented as a R package.

Tomatis Auditorium
12:15
12:15
30min
BioC lightning talks (Day 3)

Lightning talks from community members.

Tomatis Auditorium
12:45
12:45
60min
Lunch
Tomatis Auditorium
13:45
13:45
45min
Cloud Methods Working Group
Erdal Cosgun

The Bioconductor Cloud Methods Working Group invites enthusiasts to join our session, where we delve into our collaborative efforts in enhancing the scalability, security, and strategic direction of Bioconductor resources in the cloud environments. Throughout our journey, we have focused on pivotal areas including the integration of GitHub Actions for streamlined workflows, crafting a comprehensive strategic plan to guide our future endeavors, and fortifying the cloud security protocols surrounding Bioconductor Hubs datasets and related Virtual Machines. Our session will provide insights into the challenges we've faced, the solutions we've formulated, and the opportunities for further innovation by leveraging cloud AI technologies and collaboration within the Bioconductor community.

Tomatis Auditorium
13:45
45min
Live-fire reproducible research: htmlwidgets, observable, webR, and Bioconductor
Tim Triche

Over 30 years ago, the first binary of R ("an Open Source statistical package not unlike S") was deposited on statlib at CMU by Ross Ihaka and Robert Gentleman. As the description suggests, R began in part as a response to S-PLUS licensing costs, which were prohibitively expensive for student instruction. Just shy of 10 years later, "two nice guys from Harvard" renamed a package of microarray data management & analysis tools to "Bioconductor," and began work on a monograph to guide users through the burgeoning ecosystem of packages which had germinated around the R project.

Opinions vary on how best to introduce uncertainty and experimentation to students, but a minimum of inessential overhead and a maximum of interactivity serve as useful guidelines. Most fundamental statistical concepts are more easily grasped by simulation and visualization than by derivation, at least upon first encounter. A typical smartphone in 2024 has sufficient compute resources to perform cohort-scale resampling analyses of germline variants in childhood leukemia, for example.

In parallel, polished tools such as Quarto enable researchers, instructors, and students to produce documents, slides, websites, and books directly from executable code chunks. However, the output of these tools is not sufficient for teaching; interaction and exploration are vital to comprehending how the sausage is made. The adoption of reactive JavaScript frameworks, and WebAssembly-based compilation of existing R packages for interactive use via webR, allow instructors and developers to both "show" and "tell" students how to solve classes of problems, and to live-fire test understanding. Projects such as the NSF-supported Seeing Theory offer enticing reasons to pursue exactly this.

With the promise of nearly infinite portability and accessibility comes new challenges, of course. JavaScript frameworks such as d3/observable are fast and multi-threaded, but passing objects between webR processes and JavaScript code is nontrivial even with recent updates to ease this. Moreover, ecosystems such as Bioconductor rely heavily upon compiled foundational packages such as S4vectors and Rsamtools; implementing effective instruction in substantial research applications remains challenging. Last but certainly not least, security and versioning of the compiled WebAssembly packages is a challenge that currently must be met by users via direct hosting. The Kanaverse (Lun and Kancherla 2023) offers additional perspectives on meeting these challenges, and we will briefly touch on the issue of balancing live versus cached or API-driven compute offloading.

I will present some of our efforts to address substantial research questions (assessing germline variant burdens in childhood cancers, phenotypic discordance in twin pairs, and projection of single-cell transcriptomes into compressed foundation-scale models) and first-year graduate education in experimental design using webR, htmlwidgets, observable, and Quarto, in hopes of stimulating discussion and improvements. We remain cautiously optimistic that our efforts to democratize atlas-scale insight and inference, along with more focused efforts to emphasize resampling and simulation in graduate and undergraduate statistical education, will help make statistical and quantitative thinking as broadly accessible as Ihaka and Gentleman intended.

Room 3104-5
14:30
14:30
15min
Break
Tomatis Auditorium
14:45
14:45
45min
Igniting full-length isoform and mutation analysis of single-cell RNA-seq data with FLAMES
Changqing Wang

Long-read single-cell RNA-sequencing (scRNA-seq) enables accurate determination of novel isoforms in order to assess transcript heterogeneity in health and diseases. In addition, Single-nucleotide variants (SNPs) and small insertions and deletions (INDELs) can be quantified at the single-cell level to investigate cancer heterogeneity. The analysis of long-read scRNA-seq data is currently limited by the scarcity of relevant software. To fill this gap, we have developed the open-source FLAMES software, which covers all major aspects of long-read scRNA-seq data analysis from preprocessing through to differential analyses. FLAMES is fully featured and flexible R/Bioconductor package that is integrated with standard Bioconductor containers and can support data generated using different protocols, including the emerging spatial transcriptomics protocols, and across multiple samples. In addition, the software collects and reports key quality metrics, supports the use of external packages for barcode demultiplexing and isoform discovery (e.g. flexiplex, bambu) and provides additional data visualisation functions to generate publication quality figures. Our enhanced FLAMES pipeline thus provides a complete beginning-to-end workflow for isoform-level analysis of data from long-read scRNA-seq experiments and is freely available from Bioconductor (FLAMES).

Package Demo
Room 3104-5
14:45
45min
Ontologies for Genomics: new approaches with Bioconductor's ontoProc
Vincent Carey

The new version of ontoProc includes interfaces to the owlready2 ontology processing system. This permits users to work directly with ontology content in OWL. We'll use the new features in discussion of applications of ontology to single cell genomics (with Cell Ontology), genome-wide association studies (with EFO), and software tool categorization and discovery (with EDAM).

Package Demo
Tomatis Auditorium
15:30
15:30
30min
Break
Tomatis Auditorium
16:00
16:00
45min
Identifying Tandem Duplication Events from Soft-Clipped Short Reads
PeterHuang

Tandem duplication events are a key source of novel genes over evolutionary time. For example, the loss of a sterile alpha motif from the duplicated p63 and p73 genes in teleosts yielded rapid evolution of the homologous p53 gene in vertebrates, suppressing viral transposable elements and insertional mutagenesis. (Ironically, further duplication of the p53 gene itself as retrogenes in pachyderms appears to diminish their risk of tumors.)
Whole-genome or whole-chromosome duplication events are typically embryonic lethal, “hopeful monsters” whose persistence faces long odds and whose persistence usually yields speciation. The duplication and elaboration of individual genes or protein domains, by contrast, can often be tolerated at organismal or lineage levels.
Nowhere is this delicate balance more apparent than in acute leukemia, where loss of p53 is quite rare and mutational burdens quite low. The single most common short sequence variant (FLT3 internal tandem duplication, or FLT3-ITD) in acute leukemia is, as the name suggests, a tandem duplication event. In a landmark 2019 paper, Borrow and colleagues showed that terminal deoxynucleotidyl transferase activity at germline or occult regions of microhomology can explain the duplications and complex sub-variants seen in FLT3 via replication slippage. Detection of tandem duplication events in high-GC-content exons remains challenging, and the difficulty is compounded by paired-end short-read sequencing. Partly as a result of this challenge, the origins of a refractory subtype of myeloid leukemia characterized by tandem duplications in upstream binding transcription factor (UBTF) remained uncharacterized prior to a Herculean effort by Umeda and colleagues in 2022.
We previously developed the bamSliceR package to extract genomic aligned DNA and RNA reads from the NCI Genomic Data Commons (GDC), efficiently detecting mutations (e.g. SNVs and short indels) and allelic bias by estimating variant allele frequency (VAF) in DNA and RNA. Here we show that bamSliceR can be leveraged to detect the signatures of complex tandem duplication events such as those in UBTF, without local reassembly, by filtering for soft-clipped short reads in read pairs with negative insert sizes. By assessing SA flags, CIGAR strings, and insert size estimates in existing aligned reads, this procedure is fast, broadly applicable, and complements targeted approaches (e.g. km from Audemard et al) as well as graph-based tools such as gGnome. Cohort-scale traversal of controlled- and open-access read collections allows for substantial increases in sample size when hunting for “missing” variants in the so-called twilight zone (100-1000bp) for short read methods.
In contrast to the majority of short sequence variants (where variant annotation and pathogenicity estimation is most easily performed at the transcript level, supported by maximal exact matches of variant-containing read pairs), we find that tandem and partial tandem duplication events are readily screened against genomic reference coordinates. The variant allele frequency of read pairs bearing the signature of (often quite complex, partial or interrupted) tandem duplications is then computed against the predominant transcript, which is otherwise quite challenging owing to segmental overlap and clipping.
The transcriptome-first approach to variant annotation is particularly helpful in cohorts where most or all subjects have RNA sequencing data, with only a minority having exome or whole-genome results. However, the principle is quite general, and this approach provides a data-driven answer to the frequent clinical question of “against which transcript?” We show that transcriptome-aligned reads from both controlled and open-access sequence data can and should be indexed to provide maximum utility to researchers, clinicians, and patients.
We demonstrate the feasibility of this approach in the Leucegene cohort of acute myeloid leukemia patients (n = 452), where we screen for tandem duplications in UBTF as well as MYC, FLT3, KMT2A (MLL), and EGFR (a rare but recurrent target of tandem duplications in pediatric spindle cell tumors). We contend that human and mouse data, at a minimum, merit indexed access (by proxy, in the case of controlled access via GDC; and via direct index access, in the case of aligned reads such as those used by SRA in quantifying RNA copy number as of 2023), and we demonstrate the diagnostic yield to be expected from challenging variants in uncommon diseases when these indices are made available.

Package Demo
Room 3104-5
16:00
45min
scDiagnostics: diagnostic functions to assess the quality of cell type annotations in single-cell RNA-seq data
Anthony Christidis

Annotation transfer from a reference dataset for the cell type annotation of a new query single-cell RNA-sequencing (scRNA-seq) experiment has become an integral component of the typical analysis workflow. The approach provides a fast, automated, and reproducible alternative to the manual annotation of cell clusters based on marker gene expression. However, dataset imbalance and undiagnosed incompatibilities between query and reference dataset can lead to erroneous annotation and distort downstream applications. We present scDiagnostics, an R/Bioconductor package for the systematic evaluation of cell type assignments in scRNA-seq data. scDiagnostics offers a suite of diagnostic functions to assess whether both (query and reference) datasets are aligned, ensuring that annotations can be transferred reliably. scDiagnostics also provides functionality to assess annotation ambiguity, cluster heterogeneity, and marker gene alignment. The implemented functionality helps researchers to determine how accurately cells from a new scRNA-seq experiment can be assigned to known cell types.

Availability: The scDiagnostics package is available from GitHub under https://github.com/ccb-hms/scDiagnostics. A Bioconductor submission is currently in preparation and is planned for May 2024. This timeline should provide sufficient time for package review and inclusion in Bioconductor prior to the conference.

Package Demo
Tomatis Auditorium