Computer-aided segmentation on MRI for prostate radiotherapy, part II: Comparing human and computer observer populations and the influence of annotator variability on algorithm variability

Jeremiah W. Sanders; Henry Mok; Alexander N. Hanania; Aradhana M. Venkatesan; Chad Tang; Teresa L. Bruno; Howard D. Thames; Rajat J. Kudchadker; Steven J. Frank

doi:10.1016/j.radonc.2021.12.033

Computer-aided segmentation on MRI for prostate radiotherapy, part II: Comparing human and computer observer populations and the influence of annotator variability on algorithm variability

Jeremiah W. Sanders, Henry Mok, Alexander N. Hanania, Aradhana M. Venkatesan, Chad Tang, Teresa L. Bruno, Howard D. Thames, Rajat J. Kudchadker, Steven J. Frank

Research output: Contribution to journal › Article › peer-review

4 Scopus citations

Abstract

Background and purpose: Comparing deep learning (DL) algorithms to human interobserver variability, one of the largest sources of noise in human-performed annotations, is necessary to inform the clinical application, use, and quality assurance of DL for prostate radiotherapy. Materials and methods: One hundred fourteen DL algorithms were developed on 295 prostate MRIs to segment the prostate, external urinary sphincter (EUS), seminal vesicles (SV), rectum, and bladder. Fifty prostate MRIs of 25 patients undergoing MRI-based low-dose-rate prostate brachytherapy were acquired as an independent test set. Groups of DL algorithms were created based on the loss functions used to train them, and the spatial entropy (SE) of their predictions on the 50 test MRIs was computed. Five human observers contoured the 50 test MRIs, and SE maps of their contours were compared with those of the groups of the DL algorithms. Additionally, similarity metrics were computed between DL algorithm predictions and consensus annotations of the 5 human observers’ contours of the 50 test MRIs. Results: A DL algorithm yielded statistically significantly higher similarity metrics for the prostate than did the human observers (H) (prostate Matthew's correlation coefficient, DL vs. H: planning–0.931 vs. 0.903, p < 0.001; postimplant–0.925 vs. 0.892, p < 0.001); the same was true for the 4 organs at risk. The SE maps revealed that the DL algorithms and human annotators were most variable in similar anatomical regions: the prostate-EUS, prostate-SV, prostate-rectum, and prostate-bladder junctions. Conclusions: Annotation quality is an important consideration when developing, evaluating, and using DL algorithms clinically.

Original language	English (US)
Pages (from-to)	132-139
Number of pages	8
Journal	Radiotherapy and Oncology
Volume	169
DOIs	https://doi.org/10.1016/j.radonc.2021.12.033
State	Published - Apr 2022

Keywords

Annotation quality
Brachytherapy
Deep learning
MRI
Prostate
Radiation therapy
Segmentation

ASJC Scopus subject areas

Hematology
Oncology
Radiology Nuclear Medicine and imaging

Access to Document

10.1016/j.radonc.2021.12.033

Cite this

Sanders, J. W., Mok, H., Hanania, A. N., Venkatesan, A. M., Tang, C., Bruno, T. L., Thames, H. D., Kudchadker, R. J., & Frank, S. J. (2022). Computer-aided segmentation on MRI for prostate radiotherapy, part II: Comparing human and computer observer populations and the influence of annotator variability on algorithm variability. Radiotherapy and Oncology, 169, 132-139. https://doi.org/10.1016/j.radonc.2021.12.033

Computer-aided segmentation on MRI for prostate radiotherapy, part II: Comparing human and computer observer populations and the influence of annotator variability on algorithm variability. / Sanders, Jeremiah W.; Mok, Henry; Hanania, Alexander N. et al.
In: Radiotherapy and Oncology, Vol. 169, 04.2022, p. 132-139.

Research output: Contribution to journal › Article › peer-review

Sanders, JW, Mok, H, Hanania, AN, Venkatesan, AM , Tang, C, Bruno, TL, Thames, HD, Kudchadker, RJ & Frank, SJ 2022, 'Computer-aided segmentation on MRI for prostate radiotherapy, part II: Comparing human and computer observer populations and the influence of annotator variability on algorithm variability', Radiotherapy and Oncology, vol. 169, pp. 132-139. https://doi.org/10.1016/j.radonc.2021.12.033

@article{1c2323eb6b6f4b0391feb6df5eeedb4e,

title = "Computer-aided segmentation on MRI for prostate radiotherapy, part II: Comparing human and computer observer populations and the influence of annotator variability on algorithm variability",

abstract = "Background and purpose: Comparing deep learning (DL) algorithms to human interobserver variability, one of the largest sources of noise in human-performed annotations, is necessary to inform the clinical application, use, and quality assurance of DL for prostate radiotherapy. Materials and methods: One hundred fourteen DL algorithms were developed on 295 prostate MRIs to segment the prostate, external urinary sphincter (EUS), seminal vesicles (SV), rectum, and bladder. Fifty prostate MRIs of 25 patients undergoing MRI-based low-dose-rate prostate brachytherapy were acquired as an independent test set. Groups of DL algorithms were created based on the loss functions used to train them, and the spatial entropy (SE) of their predictions on the 50 test MRIs was computed. Five human observers contoured the 50 test MRIs, and SE maps of their contours were compared with those of the groups of the DL algorithms. Additionally, similarity metrics were computed between DL algorithm predictions and consensus annotations of the 5 human observers{\textquoteright} contours of the 50 test MRIs. Results: A DL algorithm yielded statistically significantly higher similarity metrics for the prostate than did the human observers (H) (prostate Matthew's correlation coefficient, DL vs. H: planning–0.931 vs. 0.903, p < 0.001; postimplant–0.925 vs. 0.892, p < 0.001); the same was true for the 4 organs at risk. The SE maps revealed that the DL algorithms and human annotators were most variable in similar anatomical regions: the prostate-EUS, prostate-SV, prostate-rectum, and prostate-bladder junctions. Conclusions: Annotation quality is an important consideration when developing, evaluating, and using DL algorithms clinically.",

keywords = "Annotation quality, Brachytherapy, Deep learning, MRI, Prostate, Radiation therapy, Segmentation",

author = "Sanders, {Jeremiah W.} and Henry Mok and Hanania, {Alexander N.} and Venkatesan, {Aradhana M.} and Chad Tang and Bruno, {Teresa L.} and Thames, {Howard D.} and Kudchadker, {Rajat J.} and Frank, {Steven J.}",

note = "Publisher Copyright: {\textcopyright} 2021 Elsevier B.V.",

year = "2022",

month = apr,

doi = "10.1016/j.radonc.2021.12.033",

language = "English (US)",

volume = "169",

pages = "132--139",

journal = "Radiotherapy and Oncology",

issn = "0167-8140",

publisher = "Elsevier Ireland Ltd",

}

TY - JOUR

T1 - Computer-aided segmentation on MRI for prostate radiotherapy, part II

T2 - Comparing human and computer observer populations and the influence of annotator variability on algorithm variability

AU - Sanders, Jeremiah W.

AU - Mok, Henry

AU - Hanania, Alexander N.

AU - Venkatesan, Aradhana M.

AU - Tang, Chad

AU - Bruno, Teresa L.

AU - Thames, Howard D.

AU - Kudchadker, Rajat J.

AU - Frank, Steven J.

PY - 2022/4

Y1 - 2022/4

N2 - Background and purpose: Comparing deep learning (DL) algorithms to human interobserver variability, one of the largest sources of noise in human-performed annotations, is necessary to inform the clinical application, use, and quality assurance of DL for prostate radiotherapy. Materials and methods: One hundred fourteen DL algorithms were developed on 295 prostate MRIs to segment the prostate, external urinary sphincter (EUS), seminal vesicles (SV), rectum, and bladder. Fifty prostate MRIs of 25 patients undergoing MRI-based low-dose-rate prostate brachytherapy were acquired as an independent test set. Groups of DL algorithms were created based on the loss functions used to train them, and the spatial entropy (SE) of their predictions on the 50 test MRIs was computed. Five human observers contoured the 50 test MRIs, and SE maps of their contours were compared with those of the groups of the DL algorithms. Additionally, similarity metrics were computed between DL algorithm predictions and consensus annotations of the 5 human observers’ contours of the 50 test MRIs. Results: A DL algorithm yielded statistically significantly higher similarity metrics for the prostate than did the human observers (H) (prostate Matthew's correlation coefficient, DL vs. H: planning–0.931 vs. 0.903, p < 0.001; postimplant–0.925 vs. 0.892, p < 0.001); the same was true for the 4 organs at risk. The SE maps revealed that the DL algorithms and human annotators were most variable in similar anatomical regions: the prostate-EUS, prostate-SV, prostate-rectum, and prostate-bladder junctions. Conclusions: Annotation quality is an important consideration when developing, evaluating, and using DL algorithms clinically.

AB - Background and purpose: Comparing deep learning (DL) algorithms to human interobserver variability, one of the largest sources of noise in human-performed annotations, is necessary to inform the clinical application, use, and quality assurance of DL for prostate radiotherapy. Materials and methods: One hundred fourteen DL algorithms were developed on 295 prostate MRIs to segment the prostate, external urinary sphincter (EUS), seminal vesicles (SV), rectum, and bladder. Fifty prostate MRIs of 25 patients undergoing MRI-based low-dose-rate prostate brachytherapy were acquired as an independent test set. Groups of DL algorithms were created based on the loss functions used to train them, and the spatial entropy (SE) of their predictions on the 50 test MRIs was computed. Five human observers contoured the 50 test MRIs, and SE maps of their contours were compared with those of the groups of the DL algorithms. Additionally, similarity metrics were computed between DL algorithm predictions and consensus annotations of the 5 human observers’ contours of the 50 test MRIs. Results: A DL algorithm yielded statistically significantly higher similarity metrics for the prostate than did the human observers (H) (prostate Matthew's correlation coefficient, DL vs. H: planning–0.931 vs. 0.903, p < 0.001; postimplant–0.925 vs. 0.892, p < 0.001); the same was true for the 4 organs at risk. The SE maps revealed that the DL algorithms and human annotators were most variable in similar anatomical regions: the prostate-EUS, prostate-SV, prostate-rectum, and prostate-bladder junctions. Conclusions: Annotation quality is an important consideration when developing, evaluating, and using DL algorithms clinically.

KW - Annotation quality

KW - Brachytherapy

KW - Deep learning

KW - MRI

KW - Prostate

KW - Radiation therapy

KW - Segmentation

UR - http://www.scopus.com/inward/record.url?scp=85122958670&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=85122958670&partnerID=8YFLogxK

U2 - 10.1016/j.radonc.2021.12.033

DO - 10.1016/j.radonc.2021.12.033

M3 - Article

C2 - 34979213

AN - SCOPUS:85122958670

SN - 0167-8140

VL - 169

SP - 132

EP - 139

JO - Radiotherapy and Oncology

JF - Radiotherapy and Oncology

ER -

Computer-aided segmentation on MRI for prostate radiotherapy, part II: Comparing human and computer observer populations and the influence of annotator variability on algorithm variability

Abstract

Keywords

ASJC Scopus subject areas

Access to Document

Other files and links

Fingerprint

Cite this