BioC2024

Computational framework for inference of genetic ancestry from challenging human molecular data
07-26, 09:00–09:08 (US/Eastern), Tomatis Auditorium

The correlation between the incidence of multiple human diseases and the ancestral background of patients has been well-documented in epidemiological data. This phenomenon suggests a link between the biology and genetics of the disease and genetic ancestry. Recent research in cancer genomics highlights genetic and phenotypic differences between tumors occurring in patient populations with differing genetic ancestries. We aim to facilitate the analysis of such data on a larger scale by performing genetic ancestry inference directly from various types of molecular data not typically used for this purpose.

Inference of genetic ancestry from molecular data other than germline DNA sequence presents two challenges. The first is data-specific, such as the uneven coverage of the genome by RNA-seq or ATAC-seq. The second challenge is cancer-specific: tumor genomes are replete with somatic alterations, including copy number variants, translocations, loss of heterozygosity, microsatellite instabilities, and structural variations.

To address these challenges, we developed a systematic computational framework for ancestry inference from molecular data, across assays and conditions. This resulting Bioconductor package, called Robust Ancestry Inference using Data Synthesis (RAIDS), optimizes and assesses the inference accuracy specifically for a given input molecular profile. RAIDS combines existing inferential algorithms with an adaptive procedure for inference parameter optimization and rigorous performance assessment, called "data synthesis". Briefly, a training set of molecular profiles is synthesized. Each synthetic profile consists of sequence reads from the input profile, into which sequence variants of an individual of known genetic ancestry are embedded. Current version of RAIDS enables inference from new molecular profiles originating in RNA-seq and ATAC-seq protocols, whole genomes and whole exomes, particularly those derived from cancer. For all such input, RAIDS infers global ancestry at the subcontinental level of resolution. The performance of RAIDS has been successfully validated using molecular data from large patient cohorts, including a major subset of The Cancer Genome Atlas (TCGA).

Future releases of RAIDS will enable inference of ancestral admixtures and provide support for molecular profiles originating from additional protocols, such as whole-genome and reduced-representation bisulfite-altered sequences. RAIDS relies on Bioconductor packages GenomicRanges, gdsfmt and SNPRelate.

In summary, an inferential framework embodied by RAIDS makes vast amounts of existing molecular data available for ancestry-oriented studies of various human conditions and molecular phenotypes, including cancer.

RAIDS is available at:
https://bioconductor.org/packages/RAIDS/