Comparing artificial intelligence algorithms to 157 German dermatologists: the melanoma classification benchmark

Titus J. Brinker, Achim Hekler, Axel Hauschild, Carola Berking, Bastian Schilling, Alexander H. Enk, Sebastian Haferkamp, Ante Karoglan, Christof von Kalle, Michael Weichenthal, Elke Sattler, Dirk Schadendorf, Maria R. Gaiser, Joachim Klode, Jochen S. Utikal

Research output: Contribution to journalArticle

12 Citations (Scopus)

Abstract

Background: Several recent publications have demonstrated the use of convolutional neural networks to classify images of melanoma at par with board-certified dermatologists. However, the non-availability of a public human benchmark restricts the comparability of the performance of these algorithms and thereby the technical progress in this field. Methods: An electronic questionnaire was sent to dermatologists at 12 German university hospitals. Each questionnaire comprised 100 dermoscopic and 100 clinical images (80 nevi images and 20 biopsy-verified melanoma images, each), all open-source. The questionnaire recorded factors such as the years of experience in dermatology, performed skin checks, age, sex and the rank within the university hospital or the status as resident physician. For each image, the dermatologists were asked to provide a management decision (treat/biopsy lesion or reassure the patient). Main outcome measures were sensitivity, specificity and the receiver operating characteristics (ROC). Results: Total 157 dermatologists assessed all 100 dermoscopic images with an overall sensitivity of 74.1%, specificity of 60.0% and an ROC of 0.67 (range = 0.538–0.769); 145 dermatologists assessed all 100 clinical images with an overall sensitivity of 89.4%, specificity of 64.4% and an ROC of 0.769 (range = 0.613–0.9). Results between test-sets were significantly different (P < 0.05) confirming the need for a standardised benchmark. Conclusions: We present the first public melanoma classification benchmark for both non-dermoscopic and dermoscopic images for comparing artificial intelligence algorithms with diagnostic performance of 145 or 157 dermatologists. Melanoma Classification Benchmark should be considered as a reference standard for white-skinned Western populations in the field of binary algorithmic melanoma classification.

Original languageEnglish (US)
Pages (from-to)30-37
Number of pages8
JournalEuropean Journal of Cancer
Volume111
DOIs
StatePublished - Apr 1 2019

Fingerprint

Benchmarking
Artificial Intelligence
Melanoma
ROC Curve
Sensitivity and Specificity
Biopsy
Nevus
Dermatology
Dermatologists
Outcome Assessment (Health Care)
Physicians
Skin
Population
Surveys and Questionnaires

Keywords

  • Artificial intelligence
  • Benchmark
  • Deep learning
  • Melanoma

ASJC Scopus subject areas

  • Oncology
  • Cancer Research

Cite this

Brinker, T. J., Hekler, A., Hauschild, A., Berking, C., Schilling, B., Enk, A. H., ... Utikal, J. S. (2019). Comparing artificial intelligence algorithms to 157 German dermatologists: the melanoma classification benchmark. European Journal of Cancer, 111, 30-37. https://doi.org/10.1016/j.ejca.2018.12.016

Comparing artificial intelligence algorithms to 157 German dermatologists : the melanoma classification benchmark. / Brinker, Titus J.; Hekler, Achim; Hauschild, Axel; Berking, Carola; Schilling, Bastian; Enk, Alexander H.; Haferkamp, Sebastian; Karoglan, Ante; von Kalle, Christof; Weichenthal, Michael; Sattler, Elke; Schadendorf, Dirk; Gaiser, Maria R.; Klode, Joachim; Utikal, Jochen S.

In: European Journal of Cancer, Vol. 111, 01.04.2019, p. 30-37.

Research output: Contribution to journalArticle

Brinker, TJ, Hekler, A, Hauschild, A, Berking, C, Schilling, B, Enk, AH, Haferkamp, S, Karoglan, A, von Kalle, C, Weichenthal, M, Sattler, E, Schadendorf, D, Gaiser, MR, Klode, J & Utikal, JS 2019, 'Comparing artificial intelligence algorithms to 157 German dermatologists: the melanoma classification benchmark', European Journal of Cancer, vol. 111, pp. 30-37. https://doi.org/10.1016/j.ejca.2018.12.016
Brinker, Titus J. ; Hekler, Achim ; Hauschild, Axel ; Berking, Carola ; Schilling, Bastian ; Enk, Alexander H. ; Haferkamp, Sebastian ; Karoglan, Ante ; von Kalle, Christof ; Weichenthal, Michael ; Sattler, Elke ; Schadendorf, Dirk ; Gaiser, Maria R. ; Klode, Joachim ; Utikal, Jochen S. / Comparing artificial intelligence algorithms to 157 German dermatologists : the melanoma classification benchmark. In: European Journal of Cancer. 2019 ; Vol. 111. pp. 30-37.
@article{1d1fb3e0ef6f47ce9f1d804ca59489e4,
title = "Comparing artificial intelligence algorithms to 157 German dermatologists: the melanoma classification benchmark",
abstract = "Background: Several recent publications have demonstrated the use of convolutional neural networks to classify images of melanoma at par with board-certified dermatologists. However, the non-availability of a public human benchmark restricts the comparability of the performance of these algorithms and thereby the technical progress in this field. Methods: An electronic questionnaire was sent to dermatologists at 12 German university hospitals. Each questionnaire comprised 100 dermoscopic and 100 clinical images (80 nevi images and 20 biopsy-verified melanoma images, each), all open-source. The questionnaire recorded factors such as the years of experience in dermatology, performed skin checks, age, sex and the rank within the university hospital or the status as resident physician. For each image, the dermatologists were asked to provide a management decision (treat/biopsy lesion or reassure the patient). Main outcome measures were sensitivity, specificity and the receiver operating characteristics (ROC). Results: Total 157 dermatologists assessed all 100 dermoscopic images with an overall sensitivity of 74.1{\%}, specificity of 60.0{\%} and an ROC of 0.67 (range = 0.538–0.769); 145 dermatologists assessed all 100 clinical images with an overall sensitivity of 89.4{\%}, specificity of 64.4{\%} and an ROC of 0.769 (range = 0.613–0.9). Results between test-sets were significantly different (P < 0.05) confirming the need for a standardised benchmark. Conclusions: We present the first public melanoma classification benchmark for both non-dermoscopic and dermoscopic images for comparing artificial intelligence algorithms with diagnostic performance of 145 or 157 dermatologists. Melanoma Classification Benchmark should be considered as a reference standard for white-skinned Western populations in the field of binary algorithmic melanoma classification.",
keywords = "Artificial intelligence, Benchmark, Deep learning, Melanoma",
author = "Brinker, {Titus J.} and Achim Hekler and Axel Hauschild and Carola Berking and Bastian Schilling and Enk, {Alexander H.} and Sebastian Haferkamp and Ante Karoglan and {von Kalle}, Christof and Michael Weichenthal and Elke Sattler and Dirk Schadendorf and Gaiser, {Maria R.} and Joachim Klode and Utikal, {Jochen S.}",
year = "2019",
month = "4",
day = "1",
doi = "10.1016/j.ejca.2018.12.016",
language = "English (US)",
volume = "111",
pages = "30--37",
journal = "European Journal of Cancer",
issn = "0959-8049",
publisher = "Elsevier Limited",

}

TY - JOUR

T1 - Comparing artificial intelligence algorithms to 157 German dermatologists

T2 - the melanoma classification benchmark

AU - Brinker, Titus J.

AU - Hekler, Achim

AU - Hauschild, Axel

AU - Berking, Carola

AU - Schilling, Bastian

AU - Enk, Alexander H.

AU - Haferkamp, Sebastian

AU - Karoglan, Ante

AU - von Kalle, Christof

AU - Weichenthal, Michael

AU - Sattler, Elke

AU - Schadendorf, Dirk

AU - Gaiser, Maria R.

AU - Klode, Joachim

AU - Utikal, Jochen S.

PY - 2019/4/1

Y1 - 2019/4/1

N2 - Background: Several recent publications have demonstrated the use of convolutional neural networks to classify images of melanoma at par with board-certified dermatologists. However, the non-availability of a public human benchmark restricts the comparability of the performance of these algorithms and thereby the technical progress in this field. Methods: An electronic questionnaire was sent to dermatologists at 12 German university hospitals. Each questionnaire comprised 100 dermoscopic and 100 clinical images (80 nevi images and 20 biopsy-verified melanoma images, each), all open-source. The questionnaire recorded factors such as the years of experience in dermatology, performed skin checks, age, sex and the rank within the university hospital or the status as resident physician. For each image, the dermatologists were asked to provide a management decision (treat/biopsy lesion or reassure the patient). Main outcome measures were sensitivity, specificity and the receiver operating characteristics (ROC). Results: Total 157 dermatologists assessed all 100 dermoscopic images with an overall sensitivity of 74.1%, specificity of 60.0% and an ROC of 0.67 (range = 0.538–0.769); 145 dermatologists assessed all 100 clinical images with an overall sensitivity of 89.4%, specificity of 64.4% and an ROC of 0.769 (range = 0.613–0.9). Results between test-sets were significantly different (P < 0.05) confirming the need for a standardised benchmark. Conclusions: We present the first public melanoma classification benchmark for both non-dermoscopic and dermoscopic images for comparing artificial intelligence algorithms with diagnostic performance of 145 or 157 dermatologists. Melanoma Classification Benchmark should be considered as a reference standard for white-skinned Western populations in the field of binary algorithmic melanoma classification.

AB - Background: Several recent publications have demonstrated the use of convolutional neural networks to classify images of melanoma at par with board-certified dermatologists. However, the non-availability of a public human benchmark restricts the comparability of the performance of these algorithms and thereby the technical progress in this field. Methods: An electronic questionnaire was sent to dermatologists at 12 German university hospitals. Each questionnaire comprised 100 dermoscopic and 100 clinical images (80 nevi images and 20 biopsy-verified melanoma images, each), all open-source. The questionnaire recorded factors such as the years of experience in dermatology, performed skin checks, age, sex and the rank within the university hospital or the status as resident physician. For each image, the dermatologists were asked to provide a management decision (treat/biopsy lesion or reassure the patient). Main outcome measures were sensitivity, specificity and the receiver operating characteristics (ROC). Results: Total 157 dermatologists assessed all 100 dermoscopic images with an overall sensitivity of 74.1%, specificity of 60.0% and an ROC of 0.67 (range = 0.538–0.769); 145 dermatologists assessed all 100 clinical images with an overall sensitivity of 89.4%, specificity of 64.4% and an ROC of 0.769 (range = 0.613–0.9). Results between test-sets were significantly different (P < 0.05) confirming the need for a standardised benchmark. Conclusions: We present the first public melanoma classification benchmark for both non-dermoscopic and dermoscopic images for comparing artificial intelligence algorithms with diagnostic performance of 145 or 157 dermatologists. Melanoma Classification Benchmark should be considered as a reference standard for white-skinned Western populations in the field of binary algorithmic melanoma classification.

KW - Artificial intelligence

KW - Benchmark

KW - Deep learning

KW - Melanoma

UR - http://www.scopus.com/inward/record.url?scp=85061795069&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=85061795069&partnerID=8YFLogxK

U2 - 10.1016/j.ejca.2018.12.016

DO - 10.1016/j.ejca.2018.12.016

M3 - Article

C2 - 30802784

AN - SCOPUS:85061795069

VL - 111

SP - 30

EP - 37

JO - European Journal of Cancer

JF - European Journal of Cancer

SN - 0959-8049

ER -