Generative pretrained transformer 4: an innovative approach to facilitate value-based healthcare

Han Lyu; Zhixiang Wang; Jia Li; Jing Sun; Xinghao Wang; Pengling Ren; Linkun Cai; Zhenchang Wang; Max Wintermark

doi:10.1016/j.imed.2023.09.001

Generative pretrained transformer 4: an innovative approach to facilitate value-based healthcare

Han Lyu, Zhixiang Wang, Jia Li, Jing Sun, Xinghao Wang, Pengling Ren, Linkun Cai, Zhenchang Wang, Max Wintermark

Neuroradiology

Research output: Contribution to journal › Article › peer-review

Abstract

Objective: Appropriate medical imaging is important for value-based care. We aim to evaluate the performance of generative pretrained transformer 4 (GPT-4), an innovative natural language processing model, providing appropriate medical imaging automatically in different clinical scenarios. Methods: Institutional Review Boards (IRB) approval was not required due to the use of nonidentifiable data. Instead, we used 112 questions from the American College of Radiology (ACR) Radiology-TEACHES Program as prompts, which is an open-sourced question and answer program to guide appropriate medical imaging. We included 69 free-text case vignettes and 43 simplified cases. For the performance evaluation of GPT-4 and GPT-3.5, we considered the recommendations of ACR guidelines as the gold standard, and then three radiologists analyzed the consistency of the responses from the GPT models with those of the ACR. We set a five-score criterion for the evaluation of the consistency. A paired t-test was applied to assess the statistical significance of the findings. Results: For the performance of the GPT models in free-text case vignettes, the accuracy of GPT-4 was 92.9%, whereas the accuracy of GPT-3.5 was just 78.3%. GPT-4 can provide more appropriate suggestions to reduce the overutilization of medical imaging than GPT-3.5 (t = 3.429, P = 0.001). For the performance of the GPT models in simplified scenarios, the accuracy of GPT-4 and GPT-3.5 was 66.5% and 60.0%, respectively. The differences were not statistically significant (t = 1.858, P = 0.070). GPT-4 was characterized by longer reaction times (27.1 s in average) and extensive responses (137.1 words on average) than GPT-3.5. Conclusion: As an advanced tool for improving value-based healthcare in clinics, GPT-4 may guide appropriate medical imaging accurately and efficiently.

Original language	English (US)
Pages (from-to)	10-15
Number of pages	6
Journal	Intelligent Medicine
Volume	4
Issue number	1
DOIs	https://doi.org/10.1016/j.imed.2023.09.001
State	Published - Feb 2024

Keywords

Appropriateness
Generative pretrained transformer 4 model
Medical imaging
Natural language processing

ASJC Scopus subject areas

Medicine (miscellaneous)
Biomedical Engineering
Health Informatics
Artificial Intelligence

Access to Document

10.1016/j.imed.2023.09.001

Cite this

@article{5cdac70f40384367b5db82261a3c5556,

title = "Generative pretrained transformer 4: an innovative approach to facilitate value-based healthcare",

abstract = "Objective: Appropriate medical imaging is important for value-based care. We aim to evaluate the performance of generative pretrained transformer 4 (GPT-4), an innovative natural language processing model, providing appropriate medical imaging automatically in different clinical scenarios. Methods: Institutional Review Boards (IRB) approval was not required due to the use of nonidentifiable data. Instead, we used 112 questions from the American College of Radiology (ACR) Radiology-TEACHES Program as prompts, which is an open-sourced question and answer program to guide appropriate medical imaging. We included 69 free-text case vignettes and 43 simplified cases. For the performance evaluation of GPT-4 and GPT-3.5, we considered the recommendations of ACR guidelines as the gold standard, and then three radiologists analyzed the consistency of the responses from the GPT models with those of the ACR. We set a five-score criterion for the evaluation of the consistency. A paired t-test was applied to assess the statistical significance of the findings. Results: For the performance of the GPT models in free-text case vignettes, the accuracy of GPT-4 was 92.9%, whereas the accuracy of GPT-3.5 was just 78.3%. GPT-4 can provide more appropriate suggestions to reduce the overutilization of medical imaging than GPT-3.5 (t = 3.429, P = 0.001). For the performance of the GPT models in simplified scenarios, the accuracy of GPT-4 and GPT-3.5 was 66.5% and 60.0%, respectively. The differences were not statistically significant (t = 1.858, P = 0.070). GPT-4 was characterized by longer reaction times (27.1 s in average) and extensive responses (137.1 words on average) than GPT-3.5. Conclusion: As an advanced tool for improving value-based healthcare in clinics, GPT-4 may guide appropriate medical imaging accurately and efficiently.",

keywords = "Appropriateness, Generative pretrained transformer 4 model, Medical imaging, Natural language processing",

author = "Han Lyu and Zhixiang Wang and Jia Li and Jing Sun and Xinghao Wang and Pengling Ren and Linkun Cai and Zhenchang Wang and Max Wintermark",

note = "Publisher Copyright: {\textcopyright} 2023",

year = "2024",

month = feb,

doi = "10.1016/j.imed.2023.09.001",

language = "English (US)",

volume = "4",

pages = "10--15",

journal = "Intelligent Medicine",

issn = "2096-9376",

publisher = "Elsevier BV",

number = "1",

}

TY - JOUR

T1 - Generative pretrained transformer 4

T2 - an innovative approach to facilitate value-based healthcare

AU - Lyu, Han

AU - Wang, Zhixiang

AU - Li, Jia

AU - Sun, Jing

AU - Wang, Xinghao

AU - Ren, Pengling

AU - Cai, Linkun

AU - Wang, Zhenchang

AU - Wintermark, Max

PY - 2024/2

Y1 - 2024/2

N2 - Objective: Appropriate medical imaging is important for value-based care. We aim to evaluate the performance of generative pretrained transformer 4 (GPT-4), an innovative natural language processing model, providing appropriate medical imaging automatically in different clinical scenarios. Methods: Institutional Review Boards (IRB) approval was not required due to the use of nonidentifiable data. Instead, we used 112 questions from the American College of Radiology (ACR) Radiology-TEACHES Program as prompts, which is an open-sourced question and answer program to guide appropriate medical imaging. We included 69 free-text case vignettes and 43 simplified cases. For the performance evaluation of GPT-4 and GPT-3.5, we considered the recommendations of ACR guidelines as the gold standard, and then three radiologists analyzed the consistency of the responses from the GPT models with those of the ACR. We set a five-score criterion for the evaluation of the consistency. A paired t-test was applied to assess the statistical significance of the findings. Results: For the performance of the GPT models in free-text case vignettes, the accuracy of GPT-4 was 92.9%, whereas the accuracy of GPT-3.5 was just 78.3%. GPT-4 can provide more appropriate suggestions to reduce the overutilization of medical imaging than GPT-3.5 (t = 3.429, P = 0.001). For the performance of the GPT models in simplified scenarios, the accuracy of GPT-4 and GPT-3.5 was 66.5% and 60.0%, respectively. The differences were not statistically significant (t = 1.858, P = 0.070). GPT-4 was characterized by longer reaction times (27.1 s in average) and extensive responses (137.1 words on average) than GPT-3.5. Conclusion: As an advanced tool for improving value-based healthcare in clinics, GPT-4 may guide appropriate medical imaging accurately and efficiently.

AB - Objective: Appropriate medical imaging is important for value-based care. We aim to evaluate the performance of generative pretrained transformer 4 (GPT-4), an innovative natural language processing model, providing appropriate medical imaging automatically in different clinical scenarios. Methods: Institutional Review Boards (IRB) approval was not required due to the use of nonidentifiable data. Instead, we used 112 questions from the American College of Radiology (ACR) Radiology-TEACHES Program as prompts, which is an open-sourced question and answer program to guide appropriate medical imaging. We included 69 free-text case vignettes and 43 simplified cases. For the performance evaluation of GPT-4 and GPT-3.5, we considered the recommendations of ACR guidelines as the gold standard, and then three radiologists analyzed the consistency of the responses from the GPT models with those of the ACR. We set a five-score criterion for the evaluation of the consistency. A paired t-test was applied to assess the statistical significance of the findings. Results: For the performance of the GPT models in free-text case vignettes, the accuracy of GPT-4 was 92.9%, whereas the accuracy of GPT-3.5 was just 78.3%. GPT-4 can provide more appropriate suggestions to reduce the overutilization of medical imaging than GPT-3.5 (t = 3.429, P = 0.001). For the performance of the GPT models in simplified scenarios, the accuracy of GPT-4 and GPT-3.5 was 66.5% and 60.0%, respectively. The differences were not statistically significant (t = 1.858, P = 0.070). GPT-4 was characterized by longer reaction times (27.1 s in average) and extensive responses (137.1 words on average) than GPT-3.5. Conclusion: As an advanced tool for improving value-based healthcare in clinics, GPT-4 may guide appropriate medical imaging accurately and efficiently.

KW - Appropriateness

KW - Generative pretrained transformer 4 model

KW - Medical imaging

KW - Natural language processing

UR - http://www.scopus.com/inward/record.url?scp=85188002623&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=85188002623&partnerID=8YFLogxK

U2 - 10.1016/j.imed.2023.09.001

DO - 10.1016/j.imed.2023.09.001

M3 - Article

AN - SCOPUS:85188002623

SN - 2096-9376

VL - 4

SP - 10

EP - 15

JO - Intelligent Medicine

JF - Intelligent Medicine

IS - 1

ER -

Generative pretrained transformer 4: an innovative approach to facilitate value-based healthcare

Abstract

Keywords

ASJC Scopus subject areas

Access to Document

Other files and links

Fingerprint

Cite this