Generalizability of a Machine Learning Model for Improving Utilization of Parathyroid Hormone-Related Peptide Testing across Multiple Clinical Centers

He S. Yang; Weishen Pan; Yingheng Wang; Mark A. Zaydman; Nicholas C. Spies; Zhen Zhao; Theresa A. Guise; Qing H. Meng; Fei Wang

doi:10.1093/clinchem/hvad141

Generalizability of a Machine Learning Model for Improving Utilization of Parathyroid Hormone-Related Peptide Testing across Multiple Clinical Centers

He S. Yang, Weishen Pan, Yingheng Wang, Mark A. Zaydman, Nicholas C. Spies, Zhen Zhao, Theresa A. Guise, Qing H. Meng, Fei Wang

Laboratory Medicine

Research output: Contribution to journal › Article › peer-review

2 Scopus citations

Abstract

BACKGROUND: Measuring parathyroid hormone-related peptide (PTHrP) helps diagnose the humoral hypercalcemia of malignancy, but is often ordered for patients with low pretest probability, resulting in poor test utilization. Manual review of results to identify inappropriate PTHrP orders is a cumbersome process. METHODS: Using a dataset of 1330 patients from a single institute, we developed a machine learning (ML) model to predict abnormal PTHrP results. We then evaluated the performance of the model on two external datasets. Different strategies (model transporting, retraining, rebuilding, and fine-tuning) were investigated to improve model generalizability. Maximum mean discrepancy (MMD) was adopted to quantify the shift of data distributions across different datasets. RESULTS: The model achieved an area under the receiver operating characteristic curve (AUROC) of 0.936, and a specificity of 0.842 at 0.900 sensitivity in the development cohort. Directly transporting this model to two external datasets resulted in a deterioration of AUROC to 0.838 and 0.737, with the latter having a larger MMD corresponding to a greater data shift compared to the original dataset. Model rebuilding using site-specific data improved AUROC to 0.891 and 0.837 on the two sites, respectively. When external data is insufficient for retraining, a fine-tuning strategy also improved model utility. CONCLUSIONS: ML offers promise to improve PTHrP test utilization while relieving the burden of manual review. Transporting a ready-made model to external datasets may lead to performance deterioration due to data distribution shift. Model retraining or rebuilding could improve generalizability when there are enough data, and model fine-tuning may be favorable when site-specific data is limited.

Original language	English (US)
Pages (from-to)	1260-1269
Number of pages	10
Journal	Clinical chemistry
Volume	69
Issue number	11
DOIs	https://doi.org/10.1093/clinchem/hvad141
State	Published - Nov 2 2023

ASJC Scopus subject areas

General Medicine

Access to Document

10.1093/clinchem/hvad141

Cite this

@article{c608f006bbba4daf8902cc59698a674a,

title = "Generalizability of a Machine Learning Model for Improving Utilization of Parathyroid Hormone-Related Peptide Testing across Multiple Clinical Centers",

abstract = "BACKGROUND: Measuring parathyroid hormone-related peptide (PTHrP) helps diagnose the humoral hypercalcemia of malignancy, but is often ordered for patients with low pretest probability, resulting in poor test utilization. Manual review of results to identify inappropriate PTHrP orders is a cumbersome process. METHODS: Using a dataset of 1330 patients from a single institute, we developed a machine learning (ML) model to predict abnormal PTHrP results. We then evaluated the performance of the model on two external datasets. Different strategies (model transporting, retraining, rebuilding, and fine-tuning) were investigated to improve model generalizability. Maximum mean discrepancy (MMD) was adopted to quantify the shift of data distributions across different datasets. RESULTS: The model achieved an area under the receiver operating characteristic curve (AUROC) of 0.936, and a specificity of 0.842 at 0.900 sensitivity in the development cohort. Directly transporting this model to two external datasets resulted in a deterioration of AUROC to 0.838 and 0.737, with the latter having a larger MMD corresponding to a greater data shift compared to the original dataset. Model rebuilding using site-specific data improved AUROC to 0.891 and 0.837 on the two sites, respectively. When external data is insufficient for retraining, a fine-tuning strategy also improved model utility. CONCLUSIONS: ML offers promise to improve PTHrP test utilization while relieving the burden of manual review. Transporting a ready-made model to external datasets may lead to performance deterioration due to data distribution shift. Model retraining or rebuilding could improve generalizability when there are enough data, and model fine-tuning may be favorable when site-specific data is limited.",

author = "Yang, {He S.} and Weishen Pan and Yingheng Wang and Zaydman, {Mark A.} and Spies, {Nicholas C.} and Zhen Zhao and Guise, {Theresa A.} and Meng, {Qing H.} and Fei Wang",

year = "2023",

month = nov,

day = "2",

doi = "10.1093/clinchem/hvad141",

language = "English (US)",

volume = "69",

pages = "1260--1269",

journal = "Clinical chemistry",

issn = "0009-9147",

publisher = "American Association for Clinical Chemistry Inc.",

number = "11",

}

TY - JOUR

T1 - Generalizability of a Machine Learning Model for Improving Utilization of Parathyroid Hormone-Related Peptide Testing across Multiple Clinical Centers

AU - Yang, He S.

AU - Pan, Weishen

AU - Wang, Yingheng

AU - Zaydman, Mark A.

AU - Spies, Nicholas C.

AU - Zhao, Zhen

AU - Guise, Theresa A.

AU - Meng, Qing H.

AU - Wang, Fei

PY - 2023/11/2

Y1 - 2023/11/2

N2 - BACKGROUND: Measuring parathyroid hormone-related peptide (PTHrP) helps diagnose the humoral hypercalcemia of malignancy, but is often ordered for patients with low pretest probability, resulting in poor test utilization. Manual review of results to identify inappropriate PTHrP orders is a cumbersome process. METHODS: Using a dataset of 1330 patients from a single institute, we developed a machine learning (ML) model to predict abnormal PTHrP results. We then evaluated the performance of the model on two external datasets. Different strategies (model transporting, retraining, rebuilding, and fine-tuning) were investigated to improve model generalizability. Maximum mean discrepancy (MMD) was adopted to quantify the shift of data distributions across different datasets. RESULTS: The model achieved an area under the receiver operating characteristic curve (AUROC) of 0.936, and a specificity of 0.842 at 0.900 sensitivity in the development cohort. Directly transporting this model to two external datasets resulted in a deterioration of AUROC to 0.838 and 0.737, with the latter having a larger MMD corresponding to a greater data shift compared to the original dataset. Model rebuilding using site-specific data improved AUROC to 0.891 and 0.837 on the two sites, respectively. When external data is insufficient for retraining, a fine-tuning strategy also improved model utility. CONCLUSIONS: ML offers promise to improve PTHrP test utilization while relieving the burden of manual review. Transporting a ready-made model to external datasets may lead to performance deterioration due to data distribution shift. Model retraining or rebuilding could improve generalizability when there are enough data, and model fine-tuning may be favorable when site-specific data is limited.

AB - BACKGROUND: Measuring parathyroid hormone-related peptide (PTHrP) helps diagnose the humoral hypercalcemia of malignancy, but is often ordered for patients with low pretest probability, resulting in poor test utilization. Manual review of results to identify inappropriate PTHrP orders is a cumbersome process. METHODS: Using a dataset of 1330 patients from a single institute, we developed a machine learning (ML) model to predict abnormal PTHrP results. We then evaluated the performance of the model on two external datasets. Different strategies (model transporting, retraining, rebuilding, and fine-tuning) were investigated to improve model generalizability. Maximum mean discrepancy (MMD) was adopted to quantify the shift of data distributions across different datasets. RESULTS: The model achieved an area under the receiver operating characteristic curve (AUROC) of 0.936, and a specificity of 0.842 at 0.900 sensitivity in the development cohort. Directly transporting this model to two external datasets resulted in a deterioration of AUROC to 0.838 and 0.737, with the latter having a larger MMD corresponding to a greater data shift compared to the original dataset. Model rebuilding using site-specific data improved AUROC to 0.891 and 0.837 on the two sites, respectively. When external data is insufficient for retraining, a fine-tuning strategy also improved model utility. CONCLUSIONS: ML offers promise to improve PTHrP test utilization while relieving the burden of manual review. Transporting a ready-made model to external datasets may lead to performance deterioration due to data distribution shift. Model retraining or rebuilding could improve generalizability when there are enough data, and model fine-tuning may be favorable when site-specific data is limited.

UR - http://www.scopus.com/inward/record.url?scp=85176002219&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=85176002219&partnerID=8YFLogxK

U2 - 10.1093/clinchem/hvad141

DO - 10.1093/clinchem/hvad141

M3 - Article

C2 - 37738611

AN - SCOPUS:85176002219

SN - 0009-9147

VL - 69

SP - 1260

EP - 1269

JO - Clinical chemistry

JF - Clinical chemistry

IS - 11

ER -

Generalizability of a Machine Learning Model for Improving Utilization of Parathyroid Hormone-Related Peptide Testing across Multiple Clinical Centers

Abstract

ASJC Scopus subject areas

Access to Document

Other files and links

Fingerprint

Cite this