On the computation of stochastic search variable selection in linear regression with UDFs

Mario Navas; Carlos Ordonez; Veerabhadran Baladandayuthapani

doi:10.1109/ICDM.2010.79

On the computation of stochastic search variable selection in linear regression with UDFs

Mario Navas, Carlos Ordonez, Veerabhadran Baladandayuthapani

Biostatistics

Research output: Chapter in Book/Report/Conference proceeding › Conference contribution

5 Scopus citations

Abstract

Computing Bayesian statistics with traditional techniques is extremely slow, specially when large data has to be exported from a relational DBMS. We propose algorithms for large scale processing of stochastic search variable selection (SSVS) for linear regression that can work entirely inside a DBMS. The traditional SSVS algorithm requires multiple scans of the input data in order to compute a regression model. Due to our optimizations, SSVS can be done in either one scan over the input table for large number of records with sufficient statistics, or one scan per iteration for high-dimensional data. We consider storage layouts which efficiently exploit DBMS parallel processing of aggregate functions. Experimental results demonstrate correctness, convergence and performance of our algorithms. Finally, the algorithms show good scalability for data with a very large number of records, or a very high number of dimensions.

Original language	English (US)
Title of host publication	Proceedings - 10th IEEE International Conference on Data Mining, ICDM 2010
Pages	941-946
Number of pages	6
DOIs	https://doi.org/10.1109/ICDM.2010.79
State	Published - 2010
Event	10th IEEE International Conference on Data Mining, ICDM 2010 - Sydney, NSW, Australia Duration: Dec 14 2010 → Dec 17 2010

Publication series

Name	Proceedings - IEEE International Conference on Data Mining, ICDM
ISSN (Print)	1550-4786

Other

Other	10th IEEE International Conference on Data Mining, ICDM 2010
Country/Territory	Australia
City	Sydney, NSW
Period	12/14/10 → 12/17/10

Keywords

Bayesian statistics
UDF
Variable selection

ASJC Scopus subject areas

General Engineering

Access to Document

10.1109/ICDM.2010.79

Cite this

On the computation of stochastic search variable selection in linear regression with UDFs. / Navas, Mario; Ordonez, Carlos; Baladandayuthapani, Veerabhadran.
Proceedings - 10th IEEE International Conference on Data Mining, ICDM 2010. 2010. p. 941-946 5694065 (Proceedings - IEEE International Conference on Data Mining, ICDM).

Research output: Chapter in Book/Report/Conference proceeding › Conference contribution

Navas, M, Ordonez, C & Baladandayuthapani, V 2010, On the computation of stochastic search variable selection in linear regression with UDFs. in Proceedings - 10th IEEE International Conference on Data Mining, ICDM 2010., 5694065, Proceedings - IEEE International Conference on Data Mining, ICDM, pp. 941-946, 10th IEEE International Conference on Data Mining, ICDM 2010, Sydney, NSW, Australia, 12/14/10. https://doi.org/10.1109/ICDM.2010.79

@inproceedings{e59f7c3e9b4d40b5beef430a5003531d,

title = "On the computation of stochastic search variable selection in linear regression with UDFs",

abstract = "Computing Bayesian statistics with traditional techniques is extremely slow, specially when large data has to be exported from a relational DBMS. We propose algorithms for large scale processing of stochastic search variable selection (SSVS) for linear regression that can work entirely inside a DBMS. The traditional SSVS algorithm requires multiple scans of the input data in order to compute a regression model. Due to our optimizations, SSVS can be done in either one scan over the input table for large number of records with sufficient statistics, or one scan per iteration for high-dimensional data. We consider storage layouts which efficiently exploit DBMS parallel processing of aggregate functions. Experimental results demonstrate correctness, convergence and performance of our algorithms. Finally, the algorithms show good scalability for data with a very large number of records, or a very high number of dimensions.",

keywords = "Bayesian statistics, UDF, Variable selection",

author = "Mario Navas and Carlos Ordonez and Veerabhadran Baladandayuthapani",

year = "2010",

doi = "10.1109/ICDM.2010.79",

language = "English (US)",

isbn = "9780769542560",

series = "Proceedings - IEEE International Conference on Data Mining, ICDM",

pages = "941--946",

booktitle = "Proceedings - 10th IEEE International Conference on Data Mining, ICDM 2010",

}

TY - GEN

T1 - On the computation of stochastic search variable selection in linear regression with UDFs

AU - Navas, Mario

AU - Ordonez, Carlos

AU - Baladandayuthapani, Veerabhadran

PY - 2010

Y1 - 2010

N2 - Computing Bayesian statistics with traditional techniques is extremely slow, specially when large data has to be exported from a relational DBMS. We propose algorithms for large scale processing of stochastic search variable selection (SSVS) for linear regression that can work entirely inside a DBMS. The traditional SSVS algorithm requires multiple scans of the input data in order to compute a regression model. Due to our optimizations, SSVS can be done in either one scan over the input table for large number of records with sufficient statistics, or one scan per iteration for high-dimensional data. We consider storage layouts which efficiently exploit DBMS parallel processing of aggregate functions. Experimental results demonstrate correctness, convergence and performance of our algorithms. Finally, the algorithms show good scalability for data with a very large number of records, or a very high number of dimensions.

AB - Computing Bayesian statistics with traditional techniques is extremely slow, specially when large data has to be exported from a relational DBMS. We propose algorithms for large scale processing of stochastic search variable selection (SSVS) for linear regression that can work entirely inside a DBMS. The traditional SSVS algorithm requires multiple scans of the input data in order to compute a regression model. Due to our optimizations, SSVS can be done in either one scan over the input table for large number of records with sufficient statistics, or one scan per iteration for high-dimensional data. We consider storage layouts which efficiently exploit DBMS parallel processing of aggregate functions. Experimental results demonstrate correctness, convergence and performance of our algorithms. Finally, the algorithms show good scalability for data with a very large number of records, or a very high number of dimensions.

KW - Bayesian statistics

KW - UDF

KW - Variable selection

UR - http://www.scopus.com/inward/record.url?scp=79951756403&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=79951756403&partnerID=8YFLogxK

U2 - 10.1109/ICDM.2010.79

DO - 10.1109/ICDM.2010.79

M3 - Conference contribution

AN - SCOPUS:79951756403

SN - 9780769542560

T3 - Proceedings - IEEE International Conference on Data Mining, ICDM

SP - 941

EP - 946

BT - Proceedings - 10th IEEE International Conference on Data Mining, ICDM 2010

T2 - 10th IEEE International Conference on Data Mining, ICDM 2010

Y2 - 14 December 2010 through 17 December 2010

ER -

On the computation of stochastic search variable selection in linear regression with UDFs

Abstract

Publication series

Other

Keywords

ASJC Scopus subject areas

Access to Document

Other files and links

Fingerprint

Cite this