A hierarchical Bayesian model for the identification of technical length variants in miRNA sequencing data
MicroRNAs (miRNAs) are small, single-stranded non-coding RNA molecules found in a range of organisms including plants, animals, and some viruses. MiRNAs regulate gene expression through imprecise base pairing of the miRNA molecule to a target messenger RNA (mRNA) molecule. Canonical synthesis of miRNAs begins with transcription of a primary or pri-miRNA molecule, approximately 70 nucleotides in length. The pri-miRNA then goes through a series of processing steps, including cleavage by Drosha and Dicer enzymes. MiRNA biosynthesis results in a mature, single-stranded miRNA molecule that is loaded onto an Argonaute (AGO) protein to form the RNA-induced silencing complex (RISC). Certain steps of the miRNA synthesis pathway, such as cleavage by Drosha and Dicer, can result in miRNA isoforms that differ from the canonical miRNA sequence in nucleotide sequence and/or length. These miRNA isomers, called isomiRs, which may differ from the canonical sequence by as few as one or two nucleotides, can have different mRNA targets and stability than the corresponding canonical miRNA. As the body of research demonstrating the role of isomiRs in disease grows, the need for differential expression analysis of miRNA data at scale finer than miRNA-level grows too. Unfortunately, errors during the amplification and sequencing processes can result in technical miRNA isomiRs identical to biological isomiRs, making resolving variation at this scale challenging. We present a novel algorithm for the identification and correction of technical miRNA length variants in miRNA sequencing data. The algorithm assumes that the transformed degradation rate of canonical miRNA sequences in a sample follows a hierarchical normal Bayesian model. The algorithm then draws from the posterior predictive distribution and constructs 95% posterior predictive intervals to determine if the observed counts of degraded sequences are consistent with our error model. We present the theory underlying the model and assess the performance of the model using an experimental benchmark data set.