BM-map: Bayesian mapping of multireads for next-generation sequencing data

Yuan Ji; Yanxun Xu; Qiong Zhang; Kam Wah Tsui; Yuan Yuan; Clift Norris; Shoudan Liang; Han Liang

doi:10.1111/j.1541-0420.2011.01605.x

BM-map: Bayesian mapping of multireads for next-generation sequencing data

Yuan Ji, Yanxun Xu, Qiong Zhang, Kam Wah Tsui, Yuan Yuan, Clift Norris, Shoudan Liang, Han Liang

Bioinformatics & Computational Biology

Research output: Contribution to journal › Article › peer-review

18 Scopus citations

Abstract

Next-generation sequencing (NGS) technology generates millions of short reads, which provide valuable information for various aspects of cellular activities and biological functions. A key step in NGS applications (e.g., RNA-Seq) is to map short reads to correct genomic locations within the source genome. While most reads are mapped to a unique location, a significant proportion of reads align to multiple genomic locations with equal or similar numbers of mismatches; these are called multireads. The ambiguity in mapping the multireads may lead to bias in downstream analyses. Currently, most practitioners discard the multireads in their analysis, resulting in a loss of valuable information, especially for the genes with similar sequences. To refine the read mapping, we develop a Bayesian model that computes the posterior probability of mapping a multiread to each competing location. The probabilities are used for downstream analyses, such as the quantification of gene expression. We show through simulation studies and RNA-Seq analysis of real life data that the Bayesian method yields better mapping than the current leading methods. We provide a C++ program for downloading that is being packaged into a user-friendly software.

Original language	English (US)
Pages (from-to)	1215-1224
Number of pages	10
Journal	Biometrics
Volume	67
Issue number	4
DOIs	https://doi.org/10.1111/j.1541-0420.2011.01605.x
State	Published - Dec 2011

Keywords

Data augmentation
RNA-Seq
Read alignment
Short reads
Solexa sequencing
Transcriptome

ASJC Scopus subject areas

Statistics and Probability
General Biochemistry, Genetics and Molecular Biology
General Immunology and Microbiology
General Agricultural and Biological Sciences
Applied Mathematics

MD Anderson CCSG core facilities

Bioinformatics Shared Resource
Biostatistics Resource Group

Access to Document

10.1111/j.1541-0420.2011.01605.x

Cite this

@article{582d6395c57544c295ff3afc3f4b8ca8,

title = "BM-map: Bayesian mapping of multireads for next-generation sequencing data",

abstract = "Next-generation sequencing (NGS) technology generates millions of short reads, which provide valuable information for various aspects of cellular activities and biological functions. A key step in NGS applications (e.g., RNA-Seq) is to map short reads to correct genomic locations within the source genome. While most reads are mapped to a unique location, a significant proportion of reads align to multiple genomic locations with equal or similar numbers of mismatches; these are called multireads. The ambiguity in mapping the multireads may lead to bias in downstream analyses. Currently, most practitioners discard the multireads in their analysis, resulting in a loss of valuable information, especially for the genes with similar sequences. To refine the read mapping, we develop a Bayesian model that computes the posterior probability of mapping a multiread to each competing location. The probabilities are used for downstream analyses, such as the quantification of gene expression. We show through simulation studies and RNA-Seq analysis of real life data that the Bayesian method yields better mapping than the current leading methods. We provide a C++ program for downloading that is being packaged into a user-friendly software.",

keywords = "Data augmentation, RNA-Seq, Read alignment, Short reads, Solexa sequencing, Transcriptome",

author = "Yuan Ji and Yanxun Xu and Qiong Zhang and Tsui, {Kam Wah} and Yuan Yuan and Clift Norris and Shoudan Liang and Han Liang",

year = "2011",

month = dec,

doi = "10.1111/j.1541-0420.2011.01605.x",

language = "English (US)",

volume = "67",

pages = "1215--1224",

journal = "Biometrics",

issn = "0006-341X",

publisher = "Wiley-Blackwell",

number = "4",

}

TY - JOUR

T1 - BM-map

T2 - Bayesian mapping of multireads for next-generation sequencing data

AU - Ji, Yuan

AU - Xu, Yanxun

AU - Zhang, Qiong

AU - Tsui, Kam Wah

AU - Yuan, Yuan

AU - Norris, Clift

AU - Liang, Shoudan

AU - Liang, Han

PY - 2011/12

Y1 - 2011/12

N2 - Next-generation sequencing (NGS) technology generates millions of short reads, which provide valuable information for various aspects of cellular activities and biological functions. A key step in NGS applications (e.g., RNA-Seq) is to map short reads to correct genomic locations within the source genome. While most reads are mapped to a unique location, a significant proportion of reads align to multiple genomic locations with equal or similar numbers of mismatches; these are called multireads. The ambiguity in mapping the multireads may lead to bias in downstream analyses. Currently, most practitioners discard the multireads in their analysis, resulting in a loss of valuable information, especially for the genes with similar sequences. To refine the read mapping, we develop a Bayesian model that computes the posterior probability of mapping a multiread to each competing location. The probabilities are used for downstream analyses, such as the quantification of gene expression. We show through simulation studies and RNA-Seq analysis of real life data that the Bayesian method yields better mapping than the current leading methods. We provide a C++ program for downloading that is being packaged into a user-friendly software.

AB - Next-generation sequencing (NGS) technology generates millions of short reads, which provide valuable information for various aspects of cellular activities and biological functions. A key step in NGS applications (e.g., RNA-Seq) is to map short reads to correct genomic locations within the source genome. While most reads are mapped to a unique location, a significant proportion of reads align to multiple genomic locations with equal or similar numbers of mismatches; these are called multireads. The ambiguity in mapping the multireads may lead to bias in downstream analyses. Currently, most practitioners discard the multireads in their analysis, resulting in a loss of valuable information, especially for the genes with similar sequences. To refine the read mapping, we develop a Bayesian model that computes the posterior probability of mapping a multiread to each competing location. The probabilities are used for downstream analyses, such as the quantification of gene expression. We show through simulation studies and RNA-Seq analysis of real life data that the Bayesian method yields better mapping than the current leading methods. We provide a C++ program for downloading that is being packaged into a user-friendly software.

KW - Data augmentation

KW - RNA-Seq

KW - Read alignment

KW - Short reads

KW - Solexa sequencing

KW - Transcriptome

UR - http://www.scopus.com/inward/record.url?scp=83655163970&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=83655163970&partnerID=8YFLogxK

U2 - 10.1111/j.1541-0420.2011.01605.x

DO - 10.1111/j.1541-0420.2011.01605.x

M3 - Article

C2 - 21517792

AN - SCOPUS:83655163970

SN - 0006-341X

VL - 67

SP - 1215

EP - 1224

JO - Biometrics

JF - Biometrics

IS - 4

ER -

BM-map: Bayesian mapping of multireads for next-generation sequencing data

Abstract

Keywords

ASJC Scopus subject areas

MD Anderson CCSG core facilities

Access to Document

Other files and links

Fingerprint

Cite this