Abstract
Next-generation sequencing (NGS) technology generates millions of short reads, which provide valuable information for various aspects of cellular activities and biological functions. A key step in NGS applications (e.g., RNA-Seq) is to map short reads to correct genomic locations within the source genome. While most reads are mapped to a unique location, a significant proportion of reads align to multiple genomic locations with equal or similar numbers of mismatches; these are called multireads. The ambiguity in mapping the multireads may lead to bias in downstream analyses. Currently, most practitioners discard the multireads in their analysis, resulting in a loss of valuable information, especially for the genes with similar sequences. To refine the read mapping, we develop a Bayesian model that computes the posterior probability of mapping a multiread to each competing location. The probabilities are used for downstream analyses, such as the quantification of gene expression. We show through simulation studies and RNA-Seq analysis of real life data that the Bayesian method yields better mapping than the current leading methods. We provide a C++ program for downloading that is being packaged into a user-friendly software.
Original language | English (US) |
---|---|
Pages (from-to) | 1215-1224 |
Number of pages | 10 |
Journal | Biometrics |
Volume | 67 |
Issue number | 4 |
DOIs | |
State | Published - Dec 2011 |
Keywords
- Data augmentation
- RNA-Seq
- Read alignment
- Short reads
- Solexa sequencing
- Transcriptome
ASJC Scopus subject areas
- Statistics and Probability
- General Biochemistry, Genetics and Molecular Biology
- General Immunology and Microbiology
- General Agricultural and Biological Sciences
- Applied Mathematics
MD Anderson CCSG core facilities
- Bioinformatics Shared Resource
- Biostatistics Resource Group