TY - GEN
T1 - On the computation of stochastic search variable selection in linear regression with UDFs
AU - Navas, Mario
AU - Ordonez, Carlos
AU - Baladandayuthapani, Veerabhadran
N1 - Copyright:
Copyright 2011 Elsevier B.V., All rights reserved.
PY - 2010
Y1 - 2010
N2 - Computing Bayesian statistics with traditional techniques is extremely slow, specially when large data has to be exported from a relational DBMS. We propose algorithms for large scale processing of stochastic search variable selection (SSVS) for linear regression that can work entirely inside a DBMS. The traditional SSVS algorithm requires multiple scans of the input data in order to compute a regression model. Due to our optimizations, SSVS can be done in either one scan over the input table for large number of records with sufficient statistics, or one scan per iteration for high-dimensional data. We consider storage layouts which efficiently exploit DBMS parallel processing of aggregate functions. Experimental results demonstrate correctness, convergence and performance of our algorithms. Finally, the algorithms show good scalability for data with a very large number of records, or a very high number of dimensions.
AB - Computing Bayesian statistics with traditional techniques is extremely slow, specially when large data has to be exported from a relational DBMS. We propose algorithms for large scale processing of stochastic search variable selection (SSVS) for linear regression that can work entirely inside a DBMS. The traditional SSVS algorithm requires multiple scans of the input data in order to compute a regression model. Due to our optimizations, SSVS can be done in either one scan over the input table for large number of records with sufficient statistics, or one scan per iteration for high-dimensional data. We consider storage layouts which efficiently exploit DBMS parallel processing of aggregate functions. Experimental results demonstrate correctness, convergence and performance of our algorithms. Finally, the algorithms show good scalability for data with a very large number of records, or a very high number of dimensions.
KW - Bayesian statistics
KW - UDF
KW - Variable selection
UR - http://www.scopus.com/inward/record.url?scp=79951756403&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=79951756403&partnerID=8YFLogxK
U2 - 10.1109/ICDM.2010.79
DO - 10.1109/ICDM.2010.79
M3 - Conference contribution
AN - SCOPUS:79951756403
SN - 9780769542560
T3 - Proceedings - IEEE International Conference on Data Mining, ICDM
SP - 941
EP - 946
BT - Proceedings - 10th IEEE International Conference on Data Mining, ICDM 2010
T2 - 10th IEEE International Conference on Data Mining, ICDM 2010
Y2 - 14 December 2010 through 17 December 2010
ER -