Assessing differential expression strategies for small RNA sequencing using real and simulated data
Despite significant advances in the experimental side of small RNA sequencing, the statistical analysis of its data is still limited to differential expression methodologies designed for standard RNA-seq. However, small RNA-seq data violate several of the assumptions of these methods. Specifically, methods for mRNA-seq analysis assume approximate independence between feature counts, although the small total number of miRNAs and presence of a small number of very highly expressed miRNAs result in a lack of independence between miRNA counts. Furthermore, this skewed distribution often results in a negative correlation between the top two miRNAs in each cell type, a phenomenon not observed with mRNAs. In extreme cases, upregulated miRNAs might falsely seem downregulated (or vice versa) because of these technical effects. Additionally, assessing the accuracy of differential expression methods represents a challenge because the correct outcome for a given estimation must be known a priori. To address these issues, we present a benchmark of several differentially expression methods commonly employed on small RNA-seq data. Our benchmarking strategy is based on two independent datasets: a custom-generated ad hoc dataset of known RNA mixtures and simulated data derived from existing miRNA-seq studies. These ground truth datasets, designed to mirror the complete dynamic range of miRNA expression, offer a more realistic benchmark compared to previous efforts. This comparative assessment offers a clear insight into the reliability of differential expression methods when applied to small RNA-seq, which we hope can guide miRNA researchers in selecting differential expression methods for their analyses.