iscream: Fast and memory efficient (sc)WGBS data handler
WGBS sequencing of the 28 million CpG loci in the human genome produces large quantities of data. Reading and storing this data is memory-intensive and can be slow. Further, while bulk WGBS data covers the majority of CpGs, single-cell WGBS data typically does not and requires a sparse matrix representation.
iscream is designed to handle both bulk and single-cell WGBS methylation data from the BISCUIT and Bismark aligners. Its goal is to provide users a fast and memory efficient way to load WGBS data for analysis, whether full genomes or regions of interest. It aims to reduce the memory footprint and runtime of data loading by reading and manipulating the data in-place without deep copying. This is achieved using Rcpp for speed and fine-grained memory usage control. Rcpp also provides access to htslib for fast genomic region queries.
iscream outputs data as sparse or dense matrices that can be used to create BSseq objects for analysis. Users can aggregate single-cell methylation information across regions of interest for analysis with scMET. Future work will include support to use the arrow library so that large datasets can be lazily queried without having to load them into memory first.
The package is still under development, but may be found at https://github.com/huishenlab/iscream.