TY - JOUR
T1 - Bigger data is better for molecular diagnosis tests based on decision trees
AU - Floares, Alexandru G.
AU - Calin, George A.
AU - Manolache, Florin B.
N1 - Funding Information:
This work was supported by the research grants UEFISCDI PN-II-PT-PCCA-2013-4-1959 INTELCOR and UEFISCDI PN-II-PT-PCCA-2011-3.1-1221 IntelUro, financed by Romanian Ministry of Education and Scientific Research.
Publisher Copyright:
© Springer International Publishing Switzerland 2016.
PY - 2016
Y1 - 2016
N2 - Most molecular diagnosis tests are based on small studies with about twenty patients, and use classical statistics. The prevailing conception is that such studies can indeed yield accurate tests with just one or two predictors, especially when using informative molecules like microRNA in cancer diagnosis. We investigated the relationship between accuracy, the number of microRNA predictors, and the sample size of the dataset used in developing cancer diagnosis tests. The generalization capability of the tests was also investigated. One of the largest existing free breast cancer dataset was used in a binary classification (cancer versus normal) using C5 and CART decision trees. The results show that diagnosis tests with a good compromise between accuracy and the number of predictors (related to costs) can be obtained with C5 or CART on a sample size of more than 100 patients. These tests generalize well.
AB - Most molecular diagnosis tests are based on small studies with about twenty patients, and use classical statistics. The prevailing conception is that such studies can indeed yield accurate tests with just one or two predictors, especially when using informative molecules like microRNA in cancer diagnosis. We investigated the relationship between accuracy, the number of microRNA predictors, and the sample size of the dataset used in developing cancer diagnosis tests. The generalization capability of the tests was also investigated. One of the largest existing free breast cancer dataset was used in a binary classification (cancer versus normal) using C5 and CART decision trees. The results show that diagnosis tests with a good compromise between accuracy and the number of predictors (related to costs) can be obtained with C5 or CART on a sample size of more than 100 patients. These tests generalize well.
UR - http://www.scopus.com/inward/record.url?scp=85007557218&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85007557218&partnerID=8YFLogxK
U2 - 10.1007/978-3-319-40973-3_29
DO - 10.1007/978-3-319-40973-3_29
M3 - Article
AN - SCOPUS:85007557218
SN - 0302-9743
VL - 9714 LNCS
SP - 288
EP - 295
JO - Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
JF - Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
ER -