BioC2024

Live-fire reproducible research: htmlwidgets, observable, webR, and Bioconductor
07-26, 13:45–14:30 (US/Eastern), Room 3104-5

Over 30 years ago, the first binary of R ("an Open Source statistical package not unlike S") was deposited on statlib at CMU by Ross Ihaka and Robert Gentleman. As the description suggests, R began in part as a response to S-PLUS licensing costs, which were prohibitively expensive for student instruction. Just shy of 10 years later, "two nice guys from Harvard" renamed a package of microarray data management & analysis tools to "Bioconductor," and began work on a monograph to guide users through the burgeoning ecosystem of packages which had germinated around the R project.

Opinions vary on how best to introduce uncertainty and experimentation to students, but a minimum of inessential overhead and a maximum of interactivity serve as useful guidelines. Most fundamental statistical concepts are more easily grasped by simulation and visualization than by derivation, at least upon first encounter. A typical smartphone in 2024 has sufficient compute resources to perform cohort-scale resampling analyses of germline variants in childhood leukemia, for example.

In parallel, polished tools such as Quarto enable researchers, instructors, and students to produce documents, slides, websites, and books directly from executable code chunks. However, the output of these tools is not sufficient for teaching; interaction and exploration are vital to comprehending how the sausage is made. The adoption of reactive JavaScript frameworks, and WebAssembly-based compilation of existing R packages for interactive use via webR, allow instructors and developers to both "show" and "tell" students how to solve classes of problems, and to live-fire test understanding. Projects such as the NSF-supported Seeing Theory offer enticing reasons to pursue exactly this.

With the promise of nearly infinite portability and accessibility comes new challenges, of course. JavaScript frameworks such as d3/observable are fast and multi-threaded, but passing objects between webR processes and JavaScript code is nontrivial even with recent updates to ease this. Moreover, ecosystems such as Bioconductor rely heavily upon compiled foundational packages such as S4vectors and Rsamtools; implementing effective instruction in substantial research applications remains challenging. Last but certainly not least, security and versioning of the compiled WebAssembly packages is a challenge that currently must be met by users via direct hosting. The Kanaverse (Lun and Kancherla 2023) offers additional perspectives on meeting these challenges, and we will briefly touch on the issue of balancing live versus cached or API-driven compute offloading.

I will present some of our efforts to address substantial research questions (assessing germline variant burdens in childhood cancers, phenotypic discordance in twin pairs, and projection of single-cell transcriptomes into compressed foundation-scale models) and first-year graduate education in experimental design using webR, htmlwidgets, observable, and Quarto, in hopes of stimulating discussion and improvements. We remain cautiously optimistic that our efforts to democratize atlas-scale insight and inference, along with more focused efforts to emphasize resampling and simulation in graduate and undergraduate statistical education, will help make statistical and quantitative thinking as broadly accessible as Ihaka and Gentleman intended.

This speaker also appears in: