TY - GEN
T1 - Adapting support vector machines to predict translation initiation sites in the human genome
AU - Akbani, Rehan
AU - Kwek, Stephen
N1 - Copyright:
Copyright 2011 Elsevier B.V., All rights reserved.
PY - 2005
Y1 - 2005
N2 - This study is concerned with predicting Translation Initiation Sites (TIS) in the human genome that start with the nucleotide sequence ATG. This sequence occurs 104 million times in the entire genome. However, current estimates predict that there are only about 30,000 or so TIS in the human genome, giving an imbalance ratio of about 1:3500 for TIS ATG vs. non-TIS ATG sites. Algorithms that are designed using datasets that have low imbalance ratio may not be well suited to predict TIS at the genomic level. In this paper, we modified the SVM algorithm that can handle moderately high imbalance ratio. The F-measures for other approaches were: Linear Discriminant 0%, SVM with under-sampling 4.1%, SVM with over-sampling 8.2%, Neural Network 13.3%, Decision Tree 20%, our approach 44%. This shows how poorly standard approaches perform at the genomic level due to the high imbalance ratio. Our approach improves the performance significantly.
AB - This study is concerned with predicting Translation Initiation Sites (TIS) in the human genome that start with the nucleotide sequence ATG. This sequence occurs 104 million times in the entire genome. However, current estimates predict that there are only about 30,000 or so TIS in the human genome, giving an imbalance ratio of about 1:3500 for TIS ATG vs. non-TIS ATG sites. Algorithms that are designed using datasets that have low imbalance ratio may not be well suited to predict TIS at the genomic level. In this paper, we modified the SVM algorithm that can handle moderately high imbalance ratio. The F-measures for other approaches were: Linear Discriminant 0%, SVM with under-sampling 4.1%, SVM with over-sampling 8.2%, Neural Network 13.3%, Decision Tree 20%, our approach 44%. This shows how poorly standard approaches perform at the genomic level due to the high imbalance ratio. Our approach improves the performance significantly.
UR - http://www.scopus.com/inward/record.url?scp=33749080543&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=33749080543&partnerID=8YFLogxK
U2 - 10.1109/CSBW.2005.18
DO - 10.1109/CSBW.2005.18
M3 - Conference contribution
AN - SCOPUS:33749080543
SN - 0769524427
SN - 9780769524429
T3 - 2005 IEEE Computational Systems Bioinformatics Conference, Workshops and Poster Abstracts
SP - 143
EP - 145
BT - 2005 IEEE Computational Systems Bioinformatics Conference, Workshops and Poster Abstracts
T2 - 2005 IEEE Computational Systems Bioinformatics Conference, Workshops and Poster Abstracts
Y2 - 8 August 2005 through 11 August 2005
ER -