TY - GEN
T1 - Is this good enough on expert perception of brain tumor segmentation quality
AU - Hoebel, Katharina
AU - Bridge, Christopher P.
AU - Ahmed, Sara
AU - Akintola, Oluwatosin
AU - Chung, Caroline
AU - Huang, Raymond
AU - Johnson, Jason
AU - Kim, Albert
AU - Ina Ly, K.
AU - Chang, Ken
AU - Patel, Jay
AU - Pinho, Marco
AU - Batchelor, Tracy T.
AU - Rosen, Bruce
AU - Gerstner, Elizabeth
AU - Kalpathy-Cramer, Jayashree
N1 - Publisher Copyright:
© 2022 SPIE. All rights reserved.
PY - 2022
Y1 - 2022
N2 - The performance of Deep Learning (DL) segmentation algorithms is routinely determined using quantitative metrics like the Dice score and Hausdorff distance. However, these metrics show a low concordance with humans perception of segmentation quality. The successful collaboration of health care professionals with DL segmentation algorithms will require a detailed understanding of experts assessment of segmentation quality. Here, we present the results of a study on expert quality perception of brain tumor segmentations of brain MR images generated by a DL segmentation algorithm. Eight expert medical professionals were asked to grade the quality of segmentations on a scale from 1 (worst) to 4 (best). To this end, we collected four ratings for a dataset of 60 cases. We observed a low inter-rater agreement among all raters (Krippendorff s alpha: 0.34), which potentially is a result of different internal cutoffs for the quality ratings. Several factors, including the volume of the segmentation and model uncertainty, were associated with high disagreement between raters. Furthermore, the correlations between the ratings and commonly used quantitative segmentation quality metrics ranged from no to moderate correlation. We conclude that, similar to the inter-rater variability observed for manual brain tumor segmentation, segmentation quality ratings are prone to variability due to the ambiguity of tumor boundaries and individual perceptual differences. Clearer guidelines for quality evaluation could help to mitigate these differences. Importantly, existing technical metrics do not capture clinical perception of segmentation quality. A better understanding of expert quality perception is expected to support the design of more human-centered DL algorithms for integration into the clinical workflow.
AB - The performance of Deep Learning (DL) segmentation algorithms is routinely determined using quantitative metrics like the Dice score and Hausdorff distance. However, these metrics show a low concordance with humans perception of segmentation quality. The successful collaboration of health care professionals with DL segmentation algorithms will require a detailed understanding of experts assessment of segmentation quality. Here, we present the results of a study on expert quality perception of brain tumor segmentations of brain MR images generated by a DL segmentation algorithm. Eight expert medical professionals were asked to grade the quality of segmentations on a scale from 1 (worst) to 4 (best). To this end, we collected four ratings for a dataset of 60 cases. We observed a low inter-rater agreement among all raters (Krippendorff s alpha: 0.34), which potentially is a result of different internal cutoffs for the quality ratings. Several factors, including the volume of the segmentation and model uncertainty, were associated with high disagreement between raters. Furthermore, the correlations between the ratings and commonly used quantitative segmentation quality metrics ranged from no to moderate correlation. We conclude that, similar to the inter-rater variability observed for manual brain tumor segmentation, segmentation quality ratings are prone to variability due to the ambiguity of tumor boundaries and individual perceptual differences. Clearer guidelines for quality evaluation could help to mitigate these differences. Importantly, existing technical metrics do not capture clinical perception of segmentation quality. A better understanding of expert quality perception is expected to support the design of more human-centered DL algorithms for integration into the clinical workflow.
KW - deep learning
KW - inter-rater variability
KW - quality assessment
KW - segmentation
UR - http://www.scopus.com/inward/record.url?scp=85131881240&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85131881240&partnerID=8YFLogxK
U2 - 10.1117/12.2611810
DO - 10.1117/12.2611810
M3 - Conference contribution
AN - SCOPUS:85131881240
T3 - Progress in Biomedical Optics and Imaging - Proceedings of SPIE
BT - Medical Imaging 2022
A2 - Mello-Thoms, Claudia R.
A2 - Mello-Thoms, Claudia R.
A2 - Taylor-Phillips, Sian
PB - SPIE
T2 - Medical Imaging 2022: Image Perception, Observer Performance, and Technology Assessment
Y2 - 21 March 2022 through 27 March 2022
ER -