A fast and flexible statistical model for large-scale population genotype data: Applications to inferring missing genotypes and haplotypic phase

Paul Scheet; Matthew Stephens

doi:10.1086/502802

A fast and flexible statistical model for large-scale population genotype data: Applications to inferring missing genotypes and haplotypic phase

Paul Scheet, Matthew Stephens

Research output: Contribution to journal › Article › peer-review

1492 Scopus citations

Abstract

We present a statistical model for patterns of genetic variation in samples of unrelated individuals from natural populations. This model is based on the idea that, over short regions, haplotypes in a population tend to cluster into groups of similar haplotypes. To capture the fact that, because of recombination, this clustering tends to be local in nature, our model allows cluster memberships to change continuously along the chromosome according to a hidden Markov model. This approach is flexible, allowing for both "block-like" patterns of linkage disequilibrium (LD) and gradual decline in LD with distance. The resulting model is also fast and, as a result, is practicable for large data sets (e.g., thousands of individuals typed at hundreds of thousands of markers). We illustrate the utility of the model by applying it to dense single-nucleotide-polymorphism genotype data for the tasks of imputing missing genotypes and estimating haplotypic phase. For imputing missing genotypes, methods based on this model are as accurate or more accurate than existing methods. For haplotype estimation, the point estimates are slightly less accurate than those from the best existing methods (e.g., for unrelated Centre d'Etude du Polymorphisme Humain individuals from the HapMap project, switch error was 0.055 for our method vs. 0.051 for PHASE) but require a small fraction of the computational cost. In addition, we demonstrate that the model accurately reflects uncertainty in its estimates, in that probabilities computed using the model are approximately well calibrated. The methods described in this article are implemented in a software package, fastPHASE, which is available from the Stephens Lab Web site.

Original language	English (US)
Pages (from-to)	629-644
Number of pages	16
Journal	American journal of human genetics
Volume	78
Issue number	4
DOIs	https://doi.org/10.1086/502802
State	Published - Apr 2006
Externally published	Yes

ASJC Scopus subject areas

Genetics
Genetics(clinical)

Access to Document

10.1086/502802

Cite this

@article{0d0280ffde794ac19baa20f37d7c87de,

title = "A fast and flexible statistical model for large-scale population genotype data: Applications to inferring missing genotypes and haplotypic phase",

abstract = "We present a statistical model for patterns of genetic variation in samples of unrelated individuals from natural populations. This model is based on the idea that, over short regions, haplotypes in a population tend to cluster into groups of similar haplotypes. To capture the fact that, because of recombination, this clustering tends to be local in nature, our model allows cluster memberships to change continuously along the chromosome according to a hidden Markov model. This approach is flexible, allowing for both {"}block-like{"} patterns of linkage disequilibrium (LD) and gradual decline in LD with distance. The resulting model is also fast and, as a result, is practicable for large data sets (e.g., thousands of individuals typed at hundreds of thousands of markers). We illustrate the utility of the model by applying it to dense single-nucleotide-polymorphism genotype data for the tasks of imputing missing genotypes and estimating haplotypic phase. For imputing missing genotypes, methods based on this model are as accurate or more accurate than existing methods. For haplotype estimation, the point estimates are slightly less accurate than those from the best existing methods (e.g., for unrelated Centre d'Etude du Polymorphisme Humain individuals from the HapMap project, switch error was 0.055 for our method vs. 0.051 for PHASE) but require a small fraction of the computational cost. In addition, we demonstrate that the model accurately reflects uncertainty in its estimates, in that probabilities computed using the model are approximately well calibrated. The methods described in this article are implemented in a software package, fastPHASE, which is available from the Stephens Lab Web site.",

author = "Paul Scheet and Matthew Stephens",

note = "Funding Information: We thank C. Geyer, for helpful discussions, and two anonymous referees, for comments on the submitted version. This research was supported by National Institutes of Health grant 1RO1HG/LM02585-01. ",

year = "2006",

month = apr,

doi = "10.1086/502802",

language = "English (US)",

volume = "78",

pages = "629--644",

journal = "American journal of human genetics",

issn = "0002-9297",

publisher = "Cell Press",

number = "4",

}

TY - JOUR

T1 - A fast and flexible statistical model for large-scale population genotype data

T2 - Applications to inferring missing genotypes and haplotypic phase

AU - Scheet, Paul

AU - Stephens, Matthew

N1 - Funding Information: We thank C. Geyer, for helpful discussions, and two anonymous referees, for comments on the submitted version. This research was supported by National Institutes of Health grant 1RO1HG/LM02585-01.

PY - 2006/4

Y1 - 2006/4

N2 - We present a statistical model for patterns of genetic variation in samples of unrelated individuals from natural populations. This model is based on the idea that, over short regions, haplotypes in a population tend to cluster into groups of similar haplotypes. To capture the fact that, because of recombination, this clustering tends to be local in nature, our model allows cluster memberships to change continuously along the chromosome according to a hidden Markov model. This approach is flexible, allowing for both "block-like" patterns of linkage disequilibrium (LD) and gradual decline in LD with distance. The resulting model is also fast and, as a result, is practicable for large data sets (e.g., thousands of individuals typed at hundreds of thousands of markers). We illustrate the utility of the model by applying it to dense single-nucleotide-polymorphism genotype data for the tasks of imputing missing genotypes and estimating haplotypic phase. For imputing missing genotypes, methods based on this model are as accurate or more accurate than existing methods. For haplotype estimation, the point estimates are slightly less accurate than those from the best existing methods (e.g., for unrelated Centre d'Etude du Polymorphisme Humain individuals from the HapMap project, switch error was 0.055 for our method vs. 0.051 for PHASE) but require a small fraction of the computational cost. In addition, we demonstrate that the model accurately reflects uncertainty in its estimates, in that probabilities computed using the model are approximately well calibrated. The methods described in this article are implemented in a software package, fastPHASE, which is available from the Stephens Lab Web site.

AB - We present a statistical model for patterns of genetic variation in samples of unrelated individuals from natural populations. This model is based on the idea that, over short regions, haplotypes in a population tend to cluster into groups of similar haplotypes. To capture the fact that, because of recombination, this clustering tends to be local in nature, our model allows cluster memberships to change continuously along the chromosome according to a hidden Markov model. This approach is flexible, allowing for both "block-like" patterns of linkage disequilibrium (LD) and gradual decline in LD with distance. The resulting model is also fast and, as a result, is practicable for large data sets (e.g., thousands of individuals typed at hundreds of thousands of markers). We illustrate the utility of the model by applying it to dense single-nucleotide-polymorphism genotype data for the tasks of imputing missing genotypes and estimating haplotypic phase. For imputing missing genotypes, methods based on this model are as accurate or more accurate than existing methods. For haplotype estimation, the point estimates are slightly less accurate than those from the best existing methods (e.g., for unrelated Centre d'Etude du Polymorphisme Humain individuals from the HapMap project, switch error was 0.055 for our method vs. 0.051 for PHASE) but require a small fraction of the computational cost. In addition, we demonstrate that the model accurately reflects uncertainty in its estimates, in that probabilities computed using the model are approximately well calibrated. The methods described in this article are implemented in a software package, fastPHASE, which is available from the Stephens Lab Web site.

UR - http://www.scopus.com/inward/record.url?scp=33644974019&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=33644974019&partnerID=8YFLogxK

U2 - 10.1086/502802

DO - 10.1086/502802

M3 - Article

C2 - 16532393

AN - SCOPUS:33644974019

SN - 0002-9297

VL - 78

SP - 629

EP - 644

JO - American journal of human genetics

JF - American journal of human genetics

IS - 4

ER -

A fast and flexible statistical model for large-scale population genotype data: Applications to inferring missing genotypes and haplotypic phase

Abstract

ASJC Scopus subject areas

Access to Document

Other files and links

Fingerprint

Cite this