TY - GEN
T1 - Ensemble stump classifiers and gene expression signatures in lung cancer
AU - Frey, Lewis
AU - Edgerton, Mary
AU - Fisher, Douglas
AU - Levy, Shawn
PY - 2007
Y1 - 2007
N2 - Microarray data sets for cancer tumor tissue generally have very few samples, each sample having thousands of probes (i.e., continuous variables). The sparsity of samples makes it difficult for machine learning techniques to discover probes relevant to the classification of tumor tissue. By combining data from different platforms (i.e., data sources), data sparsity is reduced, but this typically requires normalizing data from the different platforms, which can be non-trivial. This paper proposes a variant on the idea of ensemble learners to circumvent the need for normalization. To facilitate comprehension we build ensembles of very simple classifiers known as decision stumps-decision trees of one test each. The Ensemble Stump Classifier (ESC) identifies an mRNA signature having three probes and high accuracy for distinguishing between adenocarcinoma and squamous cell carcinoma of the lung across four data sets. In terms of accuracy, ESC outperforms a decision tree classifier on all four data sets, outperforms ensemble decision trees on three data sets, and simple stump classifiers on two data sets.
AB - Microarray data sets for cancer tumor tissue generally have very few samples, each sample having thousands of probes (i.e., continuous variables). The sparsity of samples makes it difficult for machine learning techniques to discover probes relevant to the classification of tumor tissue. By combining data from different platforms (i.e., data sources), data sparsity is reduced, but this typically requires normalizing data from the different platforms, which can be non-trivial. This paper proposes a variant on the idea of ensemble learners to circumvent the need for normalization. To facilitate comprehension we build ensembles of very simple classifiers known as decision stumps-decision trees of one test each. The Ensemble Stump Classifier (ESC) identifies an mRNA signature having three probes and high accuracy for distinguishing between adenocarcinoma and squamous cell carcinoma of the lung across four data sets. In terms of accuracy, ESC outperforms a decision tree classifier on all four data sets, outperforms ensemble decision trees on three data sets, and simple stump classifiers on two data sets.
KW - decision trees
KW - ensembles
KW - microarray
KW - stumps
UR - http://www.scopus.com/inward/record.url?scp=35748968226&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=35748968226&partnerID=8YFLogxK
M3 - Conference contribution
C2 - 17911916
AN - SCOPUS:35748968226
SN - 9781586037741
T3 - Studies in Health Technology and Informatics
SP - 1255
EP - 1259
BT - MEDINFO 2007 - Proceedings of the 12th World Congress on Health (Medical) Informatics
PB - IOS Press
T2 - 12th World Congress on Medical Informatics, MEDINFO 2007
Y2 - 20 August 2007 through 24 August 2007
ER -