Expert-centered Evaluation of Deep Learning Algorithms for Brain Tumor Segmentation

Katharina V. Hoebel; Christopher P. Bridge; Sara Ahmed; Oluwatosin Akintola; Caroline Chung; Raymond Y. Huang; Jason M. Johnson; Albert Kim; K. Ina Ly; Ken Chang; Jay Patel; Marco Pinho; Tracy T. Batchelor; Bruce R. Rosen; Elizabeth R. Gerstner; Jayashree Kalpathy-Cramer

doi:10.1148/ryai.220231

Expert-centered Evaluation of Deep Learning Algorithms for Brain Tumor Segmentation

Katharina V. Hoebel, Christopher P. Bridge, Sara Ahmed, Oluwatosin Akintola, Caroline Chung, Raymond Y. Huang, Jason M. Johnson, Albert Kim, K. Ina Ly, Ken Chang, Jay Patel, Marco Pinho, Tracy T. Batchelor, Bruce R. Rosen, Elizabeth R. Gerstner, Jayashree Kalpathy-Cramer

Research output: Contribution to journal › Article › peer-review

Abstract

Purpose: To present results from a literature survey on practices in deep learning segmentation algorithm evaluation and perform a study on expert quality perception of brain tumor segmentation. Materials and Methods: A total of 180 articles reporting on brain tumor segmentation algorithms were surveyed for the reported quality evaluation. Additionally, ratings of segmentation quality on a four-point scale were collected from medical professionals for 60 brain tumor segmentation cases. Results: Of the surveyed articles, Dice score, sensitivity, and Hausdorff distance were the most popular metrics to report segmentation performance. Notably, only 2.8% of the articles included clinical experts’ evaluation of segmentation quality. The experimental results revealed a low interrater agreement (Krippendorff α, 0.34) in experts’ segmentation quality perception. Furthermore, the correlations between the ratings and commonly used quantitative quality metrics were low (Kendall tau between Dice score and mean rating, 0.23; Kendall tau between Hausdorff distance and mean rating, 0.51), with large variability among the experts. Conclusion: The results demonstrate that quality ratings are prone to variability due to the ambiguity of tumor boundaries and individual perceptual differences, and existing metrics do not capture the clinical perception of segmentation quality.

Original language	English (US)
Article number	e220231
Journal	Radiology: Artificial Intelligence
Volume	6
Issue number	1
DOIs	https://doi.org/10.1148/ryai.220231
State	Published - 2024

Keywords

Brain Tumor Segmentation
Cancer
Deep Learning Algorithms
Glioblas-toma
Machine Learning

ASJC Scopus subject areas

Radiological and Ultrasound Technology
Radiology Nuclear Medicine and imaging
Artificial Intelligence

Access to Document

10.1148/ryai.220231

Cite this

Hoebel, K. V., Bridge, C. P., Ahmed, S., Akintola, O., Chung, C., Huang, R. Y., Johnson, J. M., Kim, A., Ly, K. I., Chang, K., Patel, J., Pinho, M., Batchelor, T. T., Rosen, B. R., Gerstner, E. R., & Kalpathy-Cramer, J. (2024). Expert-centered Evaluation of Deep Learning Algorithms for Brain Tumor Segmentation. Radiology: Artificial Intelligence, 6(1), Article e220231. https://doi.org/10.1148/ryai.220231

Hoebel, KV, Bridge, CP, Ahmed, S, Akintola, O, Chung, C, Huang, RY, Johnson, JM, Kim, A, Ly, KI, Chang, K, Patel, J, Pinho, M, Batchelor, TT, Rosen, BR, Gerstner, ER & Kalpathy-Cramer, J 2024, 'Expert-centered Evaluation of Deep Learning Algorithms for Brain Tumor Segmentation', Radiology: Artificial Intelligence, vol. 6, no. 1, e220231. https://doi.org/10.1148/ryai.220231

@article{4575d98447e54b418f0ac69387bc1a5e,

title = "Expert-centered Evaluation of Deep Learning Algorithms for Brain Tumor Segmentation",

abstract = "Purpose: To present results from a literature survey on practices in deep learning segmentation algorithm evaluation and perform a study on expert quality perception of brain tumor segmentation. Materials and Methods: A total of 180 articles reporting on brain tumor segmentation algorithms were surveyed for the reported quality evaluation. Additionally, ratings of segmentation quality on a four-point scale were collected from medical professionals for 60 brain tumor segmentation cases. Results: Of the surveyed articles, Dice score, sensitivity, and Hausdorff distance were the most popular metrics to report segmentation performance. Notably, only 2.8% of the articles included clinical experts{\textquoteright} evaluation of segmentation quality. The experimental results revealed a low interrater agreement (Krippendorff α, 0.34) in experts{\textquoteright} segmentation quality perception. Furthermore, the correlations between the ratings and commonly used quantitative quality metrics were low (Kendall tau between Dice score and mean rating, 0.23; Kendall tau between Hausdorff distance and mean rating, 0.51), with large variability among the experts. Conclusion: The results demonstrate that quality ratings are prone to variability due to the ambiguity of tumor boundaries and individual perceptual differences, and existing metrics do not capture the clinical perception of segmentation quality.",

keywords = "Brain Tumor Segmentation, Cancer, Deep Learning Algorithms, Glioblas-toma, Machine Learning",

author = "Hoebel, {Katharina V.} and Bridge, {Christopher P.} and Sara Ahmed and Oluwatosin Akintola and Caroline Chung and Huang, {Raymond Y.} and Johnson, {Jason M.} and Albert Kim and Ly, {K. Ina} and Ken Chang and Jay Patel and Marco Pinho and Batchelor, {Tracy T.} and Rosen, {Bruce R.} and Gerstner, {Elizabeth R.} and Jayashree Kalpathy-Cramer",

note = "Publisher Copyright: {\textcopyright} RSNA, 2023.",

year = "2024",

doi = "10.1148/ryai.220231",

language = "English (US)",

volume = "6",

journal = "Radiology: Artificial Intelligence",

issn = "2638-6100",

publisher = "Radiological Society of North America Inc.",

number = "1",

}

TY - JOUR

T1 - Expert-centered Evaluation of Deep Learning Algorithms for Brain Tumor Segmentation

AU - Hoebel, Katharina V.

AU - Bridge, Christopher P.

AU - Ahmed, Sara

AU - Akintola, Oluwatosin

AU - Chung, Caroline

AU - Huang, Raymond Y.

AU - Johnson, Jason M.

AU - Kim, Albert

AU - Ly, K. Ina

AU - Chang, Ken

AU - Patel, Jay

AU - Pinho, Marco

AU - Batchelor, Tracy T.

AU - Rosen, Bruce R.

AU - Gerstner, Elizabeth R.

AU - Kalpathy-Cramer, Jayashree

PY - 2024

Y1 - 2024

N2 - Purpose: To present results from a literature survey on practices in deep learning segmentation algorithm evaluation and perform a study on expert quality perception of brain tumor segmentation. Materials and Methods: A total of 180 articles reporting on brain tumor segmentation algorithms were surveyed for the reported quality evaluation. Additionally, ratings of segmentation quality on a four-point scale were collected from medical professionals for 60 brain tumor segmentation cases. Results: Of the surveyed articles, Dice score, sensitivity, and Hausdorff distance were the most popular metrics to report segmentation performance. Notably, only 2.8% of the articles included clinical experts’ evaluation of segmentation quality. The experimental results revealed a low interrater agreement (Krippendorff α, 0.34) in experts’ segmentation quality perception. Furthermore, the correlations between the ratings and commonly used quantitative quality metrics were low (Kendall tau between Dice score and mean rating, 0.23; Kendall tau between Hausdorff distance and mean rating, 0.51), with large variability among the experts. Conclusion: The results demonstrate that quality ratings are prone to variability due to the ambiguity of tumor boundaries and individual perceptual differences, and existing metrics do not capture the clinical perception of segmentation quality.

AB - Purpose: To present results from a literature survey on practices in deep learning segmentation algorithm evaluation and perform a study on expert quality perception of brain tumor segmentation. Materials and Methods: A total of 180 articles reporting on brain tumor segmentation algorithms were surveyed for the reported quality evaluation. Additionally, ratings of segmentation quality on a four-point scale were collected from medical professionals for 60 brain tumor segmentation cases. Results: Of the surveyed articles, Dice score, sensitivity, and Hausdorff distance were the most popular metrics to report segmentation performance. Notably, only 2.8% of the articles included clinical experts’ evaluation of segmentation quality. The experimental results revealed a low interrater agreement (Krippendorff α, 0.34) in experts’ segmentation quality perception. Furthermore, the correlations between the ratings and commonly used quantitative quality metrics were low (Kendall tau between Dice score and mean rating, 0.23; Kendall tau between Hausdorff distance and mean rating, 0.51), with large variability among the experts. Conclusion: The results demonstrate that quality ratings are prone to variability due to the ambiguity of tumor boundaries and individual perceptual differences, and existing metrics do not capture the clinical perception of segmentation quality.

KW - Brain Tumor Segmentation

KW - Cancer

KW - Deep Learning Algorithms

KW - Glioblas-toma

KW - Machine Learning

UR - http://www.scopus.com/inward/record.url?scp=85184467236&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=85184467236&partnerID=8YFLogxK

U2 - 10.1148/ryai.220231

DO - 10.1148/ryai.220231

M3 - Article

C2 - 38197800

AN - SCOPUS:85184467236

SN - 2638-6100

VL - 6

JO - Radiology: Artificial Intelligence

JF - Radiology: Artificial Intelligence

IS - 1

M1 - e220231

ER -

Expert-centered Evaluation of Deep Learning Algorithms for Brain Tumor Segmentation

Abstract

Keywords

ASJC Scopus subject areas

Access to Document

Other files and links

Fingerprint

Cite this