CTD: An information-theoretic algorithm to interpret sets of metabolomic and transcriptomic perturbations in the context of graphical models

Lillian R. Thistlethwaite; Varduhi Petrosyan; Xiqi Li; Marcus J. Miller; Sarah H. Elsea; Aleksandar Milosavljevic

doi:10.1371/JOURNAL.PCBI.1008550

CTD: An information-theoretic algorithm to interpret sets of metabolomic and transcriptomic perturbations in the context of graphical models

Lillian R. Thistlethwaite, Varduhi Petrosyan, Xiqi Li, Marcus J. Miller, Sarah H. Elsea, Aleksandar Milosavljevic

Research output: Contribution to journal › Article › peer-review

7 Scopus citations

Abstract

We consider the following general family of algorithmic problems that arises in transcriptomics, metabolomics and other fields: given a weighted graph G and a subset of its nodes S, find subsets of S that show significant connectedness within G. A specific solution to this problem may be defined by devising a scoring function, the Maximum Clique problem being a classic example, where S includes all nodes in G and where the score is defined by the size of the largest subset of S fully connected within G. Major practical obstacles for the plethora of algorithms addressing this type of problem include computational efficiency and, particularly for more complex scores which take edge weights into account, the computational cost of permutation testing, a statistical procedure required to obtain a bound on the p-value for a connectedness score. To address these problems, we developed CTD, "Connect the Dots", a fast algorithm based on data compression that detects highly connected subsets within S. CTD provides information-theoretic upper bounds on p-values when S contains a small fraction of nodes in G without requiring computationally costly permutation testing. We apply the CTD algorithm to interpret multi-metabolite perturbations due to inborn errors of metabolism and multi-transcript perturbations associated with breast cancer in the context of disease-specific Gaussian Markov Random Field networks learned directly from respective molecular profiling data.

Original language	English (US)
Article number	e1008550
Journal	PLoS computational biology
Volume	17
Issue number	1
DOIs	https://doi.org/10.1371/JOURNAL.PCBI.1008550
State	Published - Jan 29 2021
Externally published	Yes

ASJC Scopus subject areas

Ecology, Evolution, Behavior and Systematics
Modeling and Simulation
Ecology
Molecular Biology
Genetics
Cellular and Molecular Neuroscience
Computational Theory and Mathematics

Access to Document

10.1371/JOURNAL.PCBI.1008550

Cite this

@article{7432785c916a41a3b5729e899a907e5c,

title = "CTD: An information-theoretic algorithm to interpret sets of metabolomic and transcriptomic perturbations in the context of graphical models",

abstract = "We consider the following general family of algorithmic problems that arises in transcriptomics, metabolomics and other fields: given a weighted graph G and a subset of its nodes S, find subsets of S that show significant connectedness within G. A specific solution to this problem may be defined by devising a scoring function, the Maximum Clique problem being a classic example, where S includes all nodes in G and where the score is defined by the size of the largest subset of S fully connected within G. Major practical obstacles for the plethora of algorithms addressing this type of problem include computational efficiency and, particularly for more complex scores which take edge weights into account, the computational cost of permutation testing, a statistical procedure required to obtain a bound on the p-value for a connectedness score. To address these problems, we developed CTD, {"}Connect the Dots{"}, a fast algorithm based on data compression that detects highly connected subsets within S. CTD provides information-theoretic upper bounds on p-values when S contains a small fraction of nodes in G without requiring computationally costly permutation testing. We apply the CTD algorithm to interpret multi-metabolite perturbations due to inborn errors of metabolism and multi-transcript perturbations associated with breast cancer in the context of disease-specific Gaussian Markov Random Field networks learned directly from respective molecular profiling data.",

author = "Thistlethwaite, {Lillian R.} and Varduhi Petrosyan and Xiqi Li and Miller, {Marcus J.} and Elsea, {Sarah H.} and Aleksandar Milosavljevic",

note = "Funding Information: L.R.T. was supported by a training fellowship from the Gulf Coast Consortia, on the NLM Biomedical Informatics Training Program [Grant No. T15 LM007093]. A.M. was supported by the Henry and Emma Meyer Professorship in Molecular Genetics, grants U54-DA036134, U54- DA049098, and U41-HG009649. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript. Publisher Copyright: {\textcopyright} 2021 Thistlethwaite et al.",

year = "2021",

month = jan,

day = "29",

doi = "10.1371/JOURNAL.PCBI.1008550",

language = "English (US)",

volume = "17",

journal = "PLoS computational biology",

issn = "1553-734X",

publisher = "Public Library of Science",

number = "1",

}

TY - JOUR

T1 - CTD

T2 - An information-theoretic algorithm to interpret sets of metabolomic and transcriptomic perturbations in the context of graphical models

AU - Thistlethwaite, Lillian R.

AU - Petrosyan, Varduhi

AU - Li, Xiqi

AU - Miller, Marcus J.

AU - Elsea, Sarah H.

AU - Milosavljevic, Aleksandar

N1 - Funding Information: L.R.T. was supported by a training fellowship from the Gulf Coast Consortia, on the NLM Biomedical Informatics Training Program [Grant No. T15 LM007093]. A.M. was supported by the Henry and Emma Meyer Professorship in Molecular Genetics, grants U54-DA036134, U54- DA049098, and U41-HG009649. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript. Publisher Copyright: © 2021 Thistlethwaite et al.

PY - 2021/1/29

Y1 - 2021/1/29

N2 - We consider the following general family of algorithmic problems that arises in transcriptomics, metabolomics and other fields: given a weighted graph G and a subset of its nodes S, find subsets of S that show significant connectedness within G. A specific solution to this problem may be defined by devising a scoring function, the Maximum Clique problem being a classic example, where S includes all nodes in G and where the score is defined by the size of the largest subset of S fully connected within G. Major practical obstacles for the plethora of algorithms addressing this type of problem include computational efficiency and, particularly for more complex scores which take edge weights into account, the computational cost of permutation testing, a statistical procedure required to obtain a bound on the p-value for a connectedness score. To address these problems, we developed CTD, "Connect the Dots", a fast algorithm based on data compression that detects highly connected subsets within S. CTD provides information-theoretic upper bounds on p-values when S contains a small fraction of nodes in G without requiring computationally costly permutation testing. We apply the CTD algorithm to interpret multi-metabolite perturbations due to inborn errors of metabolism and multi-transcript perturbations associated with breast cancer in the context of disease-specific Gaussian Markov Random Field networks learned directly from respective molecular profiling data.

AB - We consider the following general family of algorithmic problems that arises in transcriptomics, metabolomics and other fields: given a weighted graph G and a subset of its nodes S, find subsets of S that show significant connectedness within G. A specific solution to this problem may be defined by devising a scoring function, the Maximum Clique problem being a classic example, where S includes all nodes in G and where the score is defined by the size of the largest subset of S fully connected within G. Major practical obstacles for the plethora of algorithms addressing this type of problem include computational efficiency and, particularly for more complex scores which take edge weights into account, the computational cost of permutation testing, a statistical procedure required to obtain a bound on the p-value for a connectedness score. To address these problems, we developed CTD, "Connect the Dots", a fast algorithm based on data compression that detects highly connected subsets within S. CTD provides information-theoretic upper bounds on p-values when S contains a small fraction of nodes in G without requiring computationally costly permutation testing. We apply the CTD algorithm to interpret multi-metabolite perturbations due to inborn errors of metabolism and multi-transcript perturbations associated with breast cancer in the context of disease-specific Gaussian Markov Random Field networks learned directly from respective molecular profiling data.

UR - http://www.scopus.com/inward/record.url?scp=85101186510&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=85101186510&partnerID=8YFLogxK

U2 - 10.1371/JOURNAL.PCBI.1008550

DO - 10.1371/JOURNAL.PCBI.1008550

M3 - Article

C2 - 33513132

AN - SCOPUS:85101186510

SN - 1553-734X

VL - 17

JO - PLoS computational biology

JF - PLoS computational biology

IS - 1

M1 - e1008550

ER -

CTD: An information-theoretic algorithm to interpret sets of metabolomic and transcriptomic perturbations in the context of graphical models

Abstract

ASJC Scopus subject areas

Access to Document

Other files and links

Fingerprint

Cite this