Effect of training-sample size and classification difficulty on the accuracy of genomic predictors

Vlad Popovici; Weijie Chen; Brandon G. Gallas; Christos Hatzis; Weiwei Shi; Frank W. Samuelson; Yuri Nikolsky; Marina Tsyganova; Alex Ishkin; Tatiana Nikolskaya; Kenneth R. Hess; Vicente Valero; Daniel Booser; Mauro Delorenzi; Gabriel N. Hortobagyi; Leming Shi; W. Fraser Symmans; Lajos Pusztai

doi:10.1186/bcr2468

Effect of training-sample size and classification difficulty on the accuracy of genomic predictors

Vlad Popovici, Weijie Chen, Brandon G. Gallas, Christos Hatzis, Weiwei Shi, Frank W. Samuelson, Yuri Nikolsky, Marina Tsyganova, Alex Ishkin, Tatiana Nikolskaya, Kenneth R. Hess, Vicente Valero, Daniel Booser, Mauro Delorenzi, Gabriel N. Hortobagyi, Leming Shi, W. Fraser Symmans, Lajos Pusztai

Research output: Contribution to journal › Article › peer-review

158 Scopus citations

Abstract

Introduction: As part of the MicroArray Quality Control (MAQC)-II project, this analysis examines how the choice of univariate feature-selection methods and classification algorithms may influence the performance of genomic predictors under varying degrees of prediction difficulty represented by three clinically relevant endpoints.Methods: We used gene-expression data from 230 breast cancers (grouped into training and independent validation sets), and we examined 40 predictors (five univariate feature-selection methods combined with eight different classifiers) for each of the three endpoints. Their classification performance was estimated on the training set by using two different resampling methods and compared with the accuracy observed in the independent validation set.Results: A ranking of the three classification problems was obtained, and the performance of 120 models was estimated and assessed on an independent validation set. The bootstrapping estimates were closer to the validation performance than were the cross-validation estimates. The required sample size for each endpoint was estimated, and both gene-level and pathway-level analyses were performed on the obtained models.Conclusions: We showed that genomic predictor accuracy is determined largely by an interplay between sample size and classification difficulty. Variations on univariate feature-selection methods and choice of classification algorithm have only a modest impact on predictor performance, and several statistically equally good predictors can be developed for any given classification problem.

Original language	English (US)
Article number	R5
Journal	Breast Cancer Research
Volume	12
Issue number	1
DOIs	https://doi.org/10.1186/bcr2468
State	Published - Jan 11 2010

ASJC Scopus subject areas

Oncology
Cancer Research

MD Anderson CCSG core facilities

Biostatistics Resource Group

Access to Document

10.1186/bcr2468

Cite this

Popovici, V., Chen, W., Gallas, B. G., Hatzis, C., Shi, W., Samuelson, F. W., Nikolsky, Y., Tsyganova, M., Ishkin, A., Nikolskaya, T., Hess, K. R., Valero, V., Booser, D., Delorenzi, M., Hortobagyi, G. N., Shi, L., Symmans, W. F., & Pusztai, L. (2010). Effect of training-sample size and classification difficulty on the accuracy of genomic predictors. Breast Cancer Research, 12(1), Article R5. https://doi.org/10.1186/bcr2468

Popovici, V, Chen, W, Gallas, BG, Hatzis, C, Shi, W, Samuelson, FW, Nikolsky, Y, Tsyganova, M, Ishkin, A, Nikolskaya, T, Hess, KR, Valero, V , Booser, D, Delorenzi, M, Hortobagyi, GN, Shi, L, Symmans, WF & Pusztai, L 2010, 'Effect of training-sample size and classification difficulty on the accuracy of genomic predictors', Breast Cancer Research, vol. 12, no. 1, R5. https://doi.org/10.1186/bcr2468

@article{37260ec83a3943b89ac98fc60e3a7cc5,

title = "Effect of training-sample size and classification difficulty on the accuracy of genomic predictors",

abstract = "Introduction: As part of the MicroArray Quality Control (MAQC)-II project, this analysis examines how the choice of univariate feature-selection methods and classification algorithms may influence the performance of genomic predictors under varying degrees of prediction difficulty represented by three clinically relevant endpoints.Methods: We used gene-expression data from 230 breast cancers (grouped into training and independent validation sets), and we examined 40 predictors (five univariate feature-selection methods combined with eight different classifiers) for each of the three endpoints. Their classification performance was estimated on the training set by using two different resampling methods and compared with the accuracy observed in the independent validation set.Results: A ranking of the three classification problems was obtained, and the performance of 120 models was estimated and assessed on an independent validation set. The bootstrapping estimates were closer to the validation performance than were the cross-validation estimates. The required sample size for each endpoint was estimated, and both gene-level and pathway-level analyses were performed on the obtained models.Conclusions: We showed that genomic predictor accuracy is determined largely by an interplay between sample size and classification difficulty. Variations on univariate feature-selection methods and choice of classification algorithm have only a modest impact on predictor performance, and several statistically equally good predictors can be developed for any given classification problem.",

author = "Vlad Popovici and Weijie Chen and Gallas, {Brandon G.} and Christos Hatzis and Weiwei Shi and Samuelson, {Frank W.} and Yuri Nikolsky and Marina Tsyganova and Alex Ishkin and Tatiana Nikolskaya and Hess, {Kenneth R.} and Vicente Valero and Daniel Booser and Mauro Delorenzi and Hortobagyi, {Gabriel N.} and Leming Shi and Symmans, {W. Fraser} and Lajos Pusztai",

note = "Funding Information: This research was supported by grants from the NCI R-01 program (LP), The Breast Cancer Research Foundation (LP and WFS), The MD Anderson Cancer Center Faculty Incentive Funds (WFS), and the Commonwealth Cancer Fundation (LP, WFS). VP and MD acknowledge the support of the Swiss National Science Foundation NCCR Molecular Oncology. Certain commercial materials and equipment are identified to specify experimental procedures adequately. In no case does such identification imply recommendation or endorsement by the FDA, nor does it imply that the items identified are necessarily the best available for the purpose. The views presented in this article do not necessarily reflect those of the U.S. Food and Drug Administration.",

year = "2010",

month = jan,

day = "11",

doi = "10.1186/bcr2468",

language = "English (US)",

volume = "12",

journal = "Breast Cancer Research",

issn = "1465-5411",

publisher = "BioMed Central Ltd.",

number = "1",

}

TY - JOUR

T1 - Effect of training-sample size and classification difficulty on the accuracy of genomic predictors

AU - Popovici, Vlad

AU - Chen, Weijie

AU - Gallas, Brandon G.

AU - Hatzis, Christos

AU - Shi, Weiwei

AU - Samuelson, Frank W.

AU - Nikolsky, Yuri

AU - Tsyganova, Marina

AU - Ishkin, Alex

AU - Nikolskaya, Tatiana

AU - Hess, Kenneth R.

AU - Valero, Vicente

AU - Booser, Daniel

AU - Delorenzi, Mauro

AU - Hortobagyi, Gabriel N.

AU - Shi, Leming

AU - Symmans, W. Fraser

AU - Pusztai, Lajos

N1 - Funding Information: This research was supported by grants from the NCI R-01 program (LP), The Breast Cancer Research Foundation (LP and WFS), The MD Anderson Cancer Center Faculty Incentive Funds (WFS), and the Commonwealth Cancer Fundation (LP, WFS). VP and MD acknowledge the support of the Swiss National Science Foundation NCCR Molecular Oncology. Certain commercial materials and equipment are identified to specify experimental procedures adequately. In no case does such identification imply recommendation or endorsement by the FDA, nor does it imply that the items identified are necessarily the best available for the purpose. The views presented in this article do not necessarily reflect those of the U.S. Food and Drug Administration.

PY - 2010/1/11

Y1 - 2010/1/11

N2 - Introduction: As part of the MicroArray Quality Control (MAQC)-II project, this analysis examines how the choice of univariate feature-selection methods and classification algorithms may influence the performance of genomic predictors under varying degrees of prediction difficulty represented by three clinically relevant endpoints.Methods: We used gene-expression data from 230 breast cancers (grouped into training and independent validation sets), and we examined 40 predictors (five univariate feature-selection methods combined with eight different classifiers) for each of the three endpoints. Their classification performance was estimated on the training set by using two different resampling methods and compared with the accuracy observed in the independent validation set.Results: A ranking of the three classification problems was obtained, and the performance of 120 models was estimated and assessed on an independent validation set. The bootstrapping estimates were closer to the validation performance than were the cross-validation estimates. The required sample size for each endpoint was estimated, and both gene-level and pathway-level analyses were performed on the obtained models.Conclusions: We showed that genomic predictor accuracy is determined largely by an interplay between sample size and classification difficulty. Variations on univariate feature-selection methods and choice of classification algorithm have only a modest impact on predictor performance, and several statistically equally good predictors can be developed for any given classification problem.

AB - Introduction: As part of the MicroArray Quality Control (MAQC)-II project, this analysis examines how the choice of univariate feature-selection methods and classification algorithms may influence the performance of genomic predictors under varying degrees of prediction difficulty represented by three clinically relevant endpoints.Methods: We used gene-expression data from 230 breast cancers (grouped into training and independent validation sets), and we examined 40 predictors (five univariate feature-selection methods combined with eight different classifiers) for each of the three endpoints. Their classification performance was estimated on the training set by using two different resampling methods and compared with the accuracy observed in the independent validation set.Results: A ranking of the three classification problems was obtained, and the performance of 120 models was estimated and assessed on an independent validation set. The bootstrapping estimates were closer to the validation performance than were the cross-validation estimates. The required sample size for each endpoint was estimated, and both gene-level and pathway-level analyses were performed on the obtained models.Conclusions: We showed that genomic predictor accuracy is determined largely by an interplay between sample size and classification difficulty. Variations on univariate feature-selection methods and choice of classification algorithm have only a modest impact on predictor performance, and several statistically equally good predictors can be developed for any given classification problem.

UR - http://www.scopus.com/inward/record.url?scp=77954168391&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=77954168391&partnerID=8YFLogxK

U2 - 10.1186/bcr2468

DO - 10.1186/bcr2468

M3 - Article

C2 - 20064235

AN - SCOPUS:77954168391

SN - 1465-5411

VL - 12

JO - Breast Cancer Research

JF - Breast Cancer Research

IS - 1

M1 - R5

ER -

Effect of training-sample size and classification difficulty on the accuracy of genomic predictors

Abstract

ASJC Scopus subject areas

MD Anderson CCSG core facilities

Access to Document

Other files and links

Fingerprint

Cite this