TY - JOUR
T1 - Expert-centered Evaluation of Deep Learning Algorithms for Brain Tumor Segmentation
AU - Hoebel, Katharina V.
AU - Bridge, Christopher P.
AU - Ahmed, Sara
AU - Akintola, Oluwatosin
AU - Chung, Caroline
AU - Huang, Raymond Y.
AU - Johnson, Jason M.
AU - Kim, Albert
AU - Ly, K. Ina
AU - Chang, Ken
AU - Patel, Jay
AU - Pinho, Marco
AU - Batchelor, Tracy T.
AU - Rosen, Bruce R.
AU - Gerstner, Elizabeth R.
AU - Kalpathy-Cramer, Jayashree
N1 - Publisher Copyright:
© RSNA, 2023.
PY - 2024
Y1 - 2024
N2 - Purpose: To present results from a literature survey on practices in deep learning segmentation algorithm evaluation and perform a study on expert quality perception of brain tumor segmentation. Materials and Methods: A total of 180 articles reporting on brain tumor segmentation algorithms were surveyed for the reported quality evaluation. Additionally, ratings of segmentation quality on a four-point scale were collected from medical professionals for 60 brain tumor segmentation cases. Results: Of the surveyed articles, Dice score, sensitivity, and Hausdorff distance were the most popular metrics to report segmentation performance. Notably, only 2.8% of the articles included clinical experts’ evaluation of segmentation quality. The experimental results revealed a low interrater agreement (Krippendorff α, 0.34) in experts’ segmentation quality perception. Furthermore, the correlations between the ratings and commonly used quantitative quality metrics were low (Kendall tau between Dice score and mean rating, 0.23; Kendall tau between Hausdorff distance and mean rating, 0.51), with large variability among the experts. Conclusion: The results demonstrate that quality ratings are prone to variability due to the ambiguity of tumor boundaries and individual perceptual differences, and existing metrics do not capture the clinical perception of segmentation quality.
AB - Purpose: To present results from a literature survey on practices in deep learning segmentation algorithm evaluation and perform a study on expert quality perception of brain tumor segmentation. Materials and Methods: A total of 180 articles reporting on brain tumor segmentation algorithms were surveyed for the reported quality evaluation. Additionally, ratings of segmentation quality on a four-point scale were collected from medical professionals for 60 brain tumor segmentation cases. Results: Of the surveyed articles, Dice score, sensitivity, and Hausdorff distance were the most popular metrics to report segmentation performance. Notably, only 2.8% of the articles included clinical experts’ evaluation of segmentation quality. The experimental results revealed a low interrater agreement (Krippendorff α, 0.34) in experts’ segmentation quality perception. Furthermore, the correlations between the ratings and commonly used quantitative quality metrics were low (Kendall tau between Dice score and mean rating, 0.23; Kendall tau between Hausdorff distance and mean rating, 0.51), with large variability among the experts. Conclusion: The results demonstrate that quality ratings are prone to variability due to the ambiguity of tumor boundaries and individual perceptual differences, and existing metrics do not capture the clinical perception of segmentation quality.
KW - Brain Tumor Segmentation
KW - Cancer
KW - Deep Learning Algorithms
KW - Glioblas-toma
KW - Machine Learning
UR - http://www.scopus.com/inward/record.url?scp=85184467236&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85184467236&partnerID=8YFLogxK
U2 - 10.1148/ryai.220231
DO - 10.1148/ryai.220231
M3 - Article
C2 - 38197800
AN - SCOPUS:85184467236
SN - 2638-6100
VL - 6
JO - Radiology: Artificial Intelligence
JF - Radiology: Artificial Intelligence
IS - 1
M1 - e220231
ER -