Applying support vector machines to imbalanced datasets

Rehan Akbani; Stephen Kwek; Nathalie Japkowicz

doi:10.1007/978-3-540-30115-8_7

Applying support vector machines to imbalanced datasets

Rehan Akbani, Stephen Kwek, Nathalie Japkowicz

Research output: Contribution to journal › Conference article › peer-review

902 Scopus citations

Abstract

Support Vector Machines (SVM) have been extensively studied and have shown remarkable success in many applications. However the success of SVM is very limited when it is applied to the problem of learning from imbalanced datasets in which negative instances heavily outnumber the positive instances (e.g. in gene profiling and detecting credit card fraud). This paper discusses the factors behind this failure and explains why the common strategy of undersampling the training data may not be the best choice for SVM. We then propose an algorithm for overcoming these problems which is based on a variant of the SMOTE algorithm by Chawla et al, combined with Veropoulos et al's different error costs algorithm. We compare the performance of our algorithm against these two algorithms, along with undersampling and regular SVM and show that our algorithm outperforms all of them.

Original language	English (US)
Pages (from-to)	39-50
Number of pages	12
Journal	Lecture Notes in Artificial Intelligence (Subseries of Lecture Notes in Computer Science)
Volume	3201
DOIs	https://doi.org/10.1007/978-3-540-30115-8_7
State	Published - 2004
Externally published	Yes
Event	15th European Conference on Machine Learning, ECML 2004 - Pisa, Italy Duration: Sep 20 2004 → Sep 24 2004

ASJC Scopus subject areas

Theoretical Computer Science
General Computer Science

Access to Document

10.1007/978-3-540-30115-8_7

Cite this

@article{8c2134b748274ed68067dbc9b7de329e,

title = "Applying support vector machines to imbalanced datasets",

abstract = "Support Vector Machines (SVM) have been extensively studied and have shown remarkable success in many applications. However the success of SVM is very limited when it is applied to the problem of learning from imbalanced datasets in which negative instances heavily outnumber the positive instances (e.g. in gene profiling and detecting credit card fraud). This paper discusses the factors behind this failure and explains why the common strategy of undersampling the training data may not be the best choice for SVM. We then propose an algorithm for overcoming these problems which is based on a variant of the SMOTE algorithm by Chawla et al, combined with Veropoulos et al's different error costs algorithm. We compare the performance of our algorithm against these two algorithms, along with undersampling and regular SVM and show that our algorithm outperforms all of them.",

author = "Rehan Akbani and Stephen Kwek and Nathalie Japkowicz",

year = "2004",

doi = "10.1007/978-3-540-30115-8_7",

language = "English (US)",

volume = "3201",

pages = "39--50",

journal = "Lecture Notes in Artificial Intelligence (Subseries of Lecture Notes in Computer Science)",

issn = "0302-9743",

publisher = "Springer Verlag",

note = "15th European Conference on Machine Learning, ECML 2004 ; Conference date: 20-09-2004 Through 24-09-2004",

}

TY - JOUR

T1 - Applying support vector machines to imbalanced datasets

AU - Akbani, Rehan

AU - Kwek, Stephen

AU - Japkowicz, Nathalie

PY - 2004

Y1 - 2004

N2 - Support Vector Machines (SVM) have been extensively studied and have shown remarkable success in many applications. However the success of SVM is very limited when it is applied to the problem of learning from imbalanced datasets in which negative instances heavily outnumber the positive instances (e.g. in gene profiling and detecting credit card fraud). This paper discusses the factors behind this failure and explains why the common strategy of undersampling the training data may not be the best choice for SVM. We then propose an algorithm for overcoming these problems which is based on a variant of the SMOTE algorithm by Chawla et al, combined with Veropoulos et al's different error costs algorithm. We compare the performance of our algorithm against these two algorithms, along with undersampling and regular SVM and show that our algorithm outperforms all of them.

AB - Support Vector Machines (SVM) have been extensively studied and have shown remarkable success in many applications. However the success of SVM is very limited when it is applied to the problem of learning from imbalanced datasets in which negative instances heavily outnumber the positive instances (e.g. in gene profiling and detecting credit card fraud). This paper discusses the factors behind this failure and explains why the common strategy of undersampling the training data may not be the best choice for SVM. We then propose an algorithm for overcoming these problems which is based on a variant of the SMOTE algorithm by Chawla et al, combined with Veropoulos et al's different error costs algorithm. We compare the performance of our algorithm against these two algorithms, along with undersampling and regular SVM and show that our algorithm outperforms all of them.

UR - http://www.scopus.com/inward/record.url?scp=22944452794&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=22944452794&partnerID=8YFLogxK

U2 - 10.1007/978-3-540-30115-8_7

DO - 10.1007/978-3-540-30115-8_7

M3 - Conference article

AN - SCOPUS:22944452794

SN - 0302-9743

VL - 3201

SP - 39

EP - 50

JO - Lecture Notes in Artificial Intelligence (Subseries of Lecture Notes in Computer Science)

JF - Lecture Notes in Artificial Intelligence (Subseries of Lecture Notes in Computer Science)

T2 - 15th European Conference on Machine Learning, ECML 2004

Y2 - 20 September 2004 through 24 September 2004

ER -

Applying support vector machines to imbalanced datasets

Abstract

ASJC Scopus subject areas

Access to Document

Other files and links

Fingerprint

Cite this