TY - JOUR
T1 - A comprehensive comparison of supervised and unsupervised methods for cell type identification in single-cell RNA-seq
AU - Sun, Xiaobo
AU - Lin, Xiaochu
AU - Li, Ziyi
AU - Wu, Hao
N1 - Funding Information:
This work was supported by multiple funding. X.S. was supported by the National Natural Science Foundation of China (Grant No. 61773401). H.W. and Z.L. were partially supported by the National Institutes of Health (R01GM122083).
Publisher Copyright:
© 2022 The Author(s) 2022.
PY - 2022/3/1
Y1 - 2022/3/1
N2 - The cell type identification is among the most important tasks in single-cell RNA-sequencing (scRNA-seq) analysis. Many in silico methods have been developed and can be roughly categorized as either supervised or unsupervised. In this study, we investigated the performances of 8 supervised and 10 unsupervised cell type identification methods using 14 public scRNA-seq datasets of different tissues, sequencing protocols and species. We investigated the impacts of a number of factors, including total amount of cells, number of cell types, sequencing depth, batch effects, reference bias, cell population imbalance, unknown/novel cell type, and computational efficiency and scalability. Instead of merely comparing individual methods, we focused on factors' impacts on the general category of supervised and unsupervised methods. We found that in most scenarios, the supervised methods outperformed the unsupervised methods, except for the identification of unknown cell types. This is particularly true when the supervised methods use a reference dataset with high informational sufficiency, low complexity and high similarity to the query dataset. However, such outperformance could be undermined by some undesired dataset properties investigated in this study, which lead to uninformative and biased reference datasets. In these scenarios, unsupervised methods could be comparable to supervised methods. Our study not only explained the cell typing methods' behaviors under different experimental settings but also provided a general guideline for the choice of method according to the scientific goal and dataset properties. Finally, our evaluation workflow is implemented as a modularized R pipeline that allows future evaluation of new methods. Availability: All the source codes are available at https://github.com/xsun28/scRNAIdent.
AB - The cell type identification is among the most important tasks in single-cell RNA-sequencing (scRNA-seq) analysis. Many in silico methods have been developed and can be roughly categorized as either supervised or unsupervised. In this study, we investigated the performances of 8 supervised and 10 unsupervised cell type identification methods using 14 public scRNA-seq datasets of different tissues, sequencing protocols and species. We investigated the impacts of a number of factors, including total amount of cells, number of cell types, sequencing depth, batch effects, reference bias, cell population imbalance, unknown/novel cell type, and computational efficiency and scalability. Instead of merely comparing individual methods, we focused on factors' impacts on the general category of supervised and unsupervised methods. We found that in most scenarios, the supervised methods outperformed the unsupervised methods, except for the identification of unknown cell types. This is particularly true when the supervised methods use a reference dataset with high informational sufficiency, low complexity and high similarity to the query dataset. However, such outperformance could be undermined by some undesired dataset properties investigated in this study, which lead to uninformative and biased reference datasets. In these scenarios, unsupervised methods could be comparable to supervised methods. Our study not only explained the cell typing methods' behaviors under different experimental settings but also provided a general guideline for the choice of method according to the scientific goal and dataset properties. Finally, our evaluation workflow is implemented as a modularized R pipeline that allows future evaluation of new methods. Availability: All the source codes are available at https://github.com/xsun28/scRNAIdent.
KW - cell type identification
KW - ScRNA-seq
KW - supervised learning
KW - unsupervised clustering
UR - http://www.scopus.com/inward/record.url?scp=85127519460&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85127519460&partnerID=8YFLogxK
U2 - 10.1093/bib/bbab567
DO - 10.1093/bib/bbab567
M3 - Article
C2 - 35021202
AN - SCOPUS:85127519460
VL - 23
JO - Briefings in Bioinformatics
JF - Briefings in Bioinformatics
SN - 1467-5463
IS - 2
M1 - bbab567
ER -