TY - JOUR
T1 - A content-based literature recommendation system for datasets to improve data reusability – A case study on Gene Expression Omnibus (GEO) datasets
AU - Patra, Braja Gopal
AU - Maroufy, Vahed
AU - Soltanalizadeh, Babak
AU - Deng, Nan
AU - Zheng, W. Jim
AU - Roberts, Kirk
AU - Wu, Hulin
N1 - Publisher Copyright:
© 2020 Elsevier Inc.
PY - 2020/4
Y1 - 2020/4
N2 - Objective: The centrality of data to biomedical research is difficult to understate, and the same is true for the importance of the biomedical literature in disseminating empirical findings to scientific questions made on such data. But the connections between the literature and related datasets are often weak, hampering the ability of scientists to easily move between existing datasets and existing findings to derive new scientific hypotheses. This work aims to recommend relevant literature articles for datasets with the ultimate goal of increasing the productivity of researchers. Our approach to literature recommendation for datasets is a part of the dataset reusability platform developed at the University Texas Health Science Center at Houston for datasets related to gene expression. This platform incorporates datasets from Gene Expression Omnibus (GEO). An average of 34 datasets were added to GEO daily in the last five years (i.e. 2014 to 2018), demonstrating the need for automatic methods to connect these datasets with relevant literature. The relevant literature for a given dataset may describe that dataset, provide a scientific finding based on that dataset, or even describe prior and related work to the dataset's topic that is of interest to users of the dataset. Materials and methods: We adopt an information retrieval paradigm for literature recommendation. In our experiments, distributional semantic features are created from the title and abstract of MEDLINE articles. Then, related articles are identified for datasets in GEO. We evaluate multiple distributional methods such as TF-IDF, BM25, Latent Semantic Analysis, Latent Dirichlet Allocation, word2vec, and doc2vec. Top similar papers are recommended for each dataset using cosine similarity between the dataset's vector representation and every paper's vector representation. We also propose several novel re-ranking and normalization methods over embeddings to improve the recommendations. Results: The top-performing literature recommendation technique achieved a strict precision at 10 of 0.8333 and a partial precision at 10 of 0.9000 using BM25 based on a manual evaluation of 36 datasets. Evaluation on a larger, automatically-collected benchmark shows small but consistent gains by emphasizing the similarity of dataset and article titles. Conclusion: This work is the first step toward developing a literature recommendation tool by recommending relevant literature for datasets. This will hopefully lead to better data reuse experience.
AB - Objective: The centrality of data to biomedical research is difficult to understate, and the same is true for the importance of the biomedical literature in disseminating empirical findings to scientific questions made on such data. But the connections between the literature and related datasets are often weak, hampering the ability of scientists to easily move between existing datasets and existing findings to derive new scientific hypotheses. This work aims to recommend relevant literature articles for datasets with the ultimate goal of increasing the productivity of researchers. Our approach to literature recommendation for datasets is a part of the dataset reusability platform developed at the University Texas Health Science Center at Houston for datasets related to gene expression. This platform incorporates datasets from Gene Expression Omnibus (GEO). An average of 34 datasets were added to GEO daily in the last five years (i.e. 2014 to 2018), demonstrating the need for automatic methods to connect these datasets with relevant literature. The relevant literature for a given dataset may describe that dataset, provide a scientific finding based on that dataset, or even describe prior and related work to the dataset's topic that is of interest to users of the dataset. Materials and methods: We adopt an information retrieval paradigm for literature recommendation. In our experiments, distributional semantic features are created from the title and abstract of MEDLINE articles. Then, related articles are identified for datasets in GEO. We evaluate multiple distributional methods such as TF-IDF, BM25, Latent Semantic Analysis, Latent Dirichlet Allocation, word2vec, and doc2vec. Top similar papers are recommended for each dataset using cosine similarity between the dataset's vector representation and every paper's vector representation. We also propose several novel re-ranking and normalization methods over embeddings to improve the recommendations. Results: The top-performing literature recommendation technique achieved a strict precision at 10 of 0.8333 and a partial precision at 10 of 0.9000 using BM25 based on a manual evaluation of 36 datasets. Evaluation on a larger, automatically-collected benchmark shows small but consistent gains by emphasizing the similarity of dataset and article titles. Conclusion: This work is the first step toward developing a literature recommendation tool by recommending relevant literature for datasets. This will hopefully lead to better data reuse experience.
KW - Cosine similarity
KW - Gene Expression Omnibus (GEO)
KW - Literature recommendation
KW - Re-ranking
KW - Vector space model
UR - http://www.scopus.com/inward/record.url?scp=85081692266&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85081692266&partnerID=8YFLogxK
U2 - 10.1016/j.jbi.2020.103399
DO - 10.1016/j.jbi.2020.103399
M3 - Article
C2 - 32151769
AN - SCOPUS:85081692266
SN - 1532-0464
VL - 104
JO - Journal of Biomedical Informatics
JF - Journal of Biomedical Informatics
M1 - 103399
ER -