Automating the determination of prostate cancer risk strata from electronic medical records

Justin R. Gregg; Maximilian Lang; Lucy L. Wang; Matthew J. Resnick; Sandeep K. Jain; Jeremy L. Warner; Daniel A. Barocas

doi:10.1200/CCI.16.00045

Automating the determination of prostate cancer risk strata from electronic medical records

Justin R. Gregg, Maximilian Lang, Lucy L. Wang, Matthew J. Resnick, Sandeep K. Jain, Jeremy L. Warner, Daniel A. Barocas

Research output: Contribution to journal › Article › peer-review

14 Scopus citations

Abstract

Purpose Risk stratification underlies system-wide efforts to promote the delivery of appropriate prostate cancer care. Although the elements of risk stratum are available in the electronic medical record, manual data collection is resource intensive. Therefore, we investigated the feasibility and accuracy of an automated data extraction method using natural language processing (NLP) to determine prostate cancer risk stratum. Methods Manually collected clinical stage, biopsy Gleason score, and preoperative prostate-specific antigen (PSA) values from our prospective prostatectomy database were used to categorize patients as low, intermediate, or high risk by D'Amico risk classification. NLP algorithms were developed to automate the extraction of the same data points from the electronic medical record, and risk strata were recalculated. The ability of NLP to identify elements sufficient to calculate risk (recall) was calculated, and the accuracy of NLP was compared with that of manually collected data using the weighted Cohen's k statistic. Results Of the 2,352 patients with available data who underwent prostatectomy from 2010 to 2014, NLP identified sufficient elements to calculate risk for 1,833 (recall, 78%). NLP had a91%raw agreement with manual risk stratification (k = 0.92; 95% CI, 0.90 to 0.93). The k statistics for PSA, Gleason score, and clinical stage extraction by NLP were 0.86, 0.91, and 0.89, respectively; 91.9% of extracted PSA values were within 6 1.0 ng/mL of the manually collected PSA levels. Conclusion NLP can achieve more than 90% accuracy on D'Amico risk stratification of localized prostate cancer, with adequate recall. This figure is comparable to otherNLPtasks and illustrates theknowntradeoff between recall and accuracy. Automating the collection of risk characteristics could be used to power realtime decision support tools and scale up quality measurement in cancer care.

Original language	English (US)
Pages (from-to)	1-8
Number of pages	8
Journal	JCO Clinical Cancer Informatics
Volume	2017
Issue number	1
DOIs	https://doi.org/10.1200/CCI.16.00045
State	Published - 2017
Externally published	Yes

ASJC Scopus subject areas

Oncology
Health Informatics
Cancer Research

Access to Document

10.1200/CCI.16.00045

Cite this

@article{4ca5ebcae2dd426cabc624621e61672c,

title = "Automating the determination of prostate cancer risk strata from electronic medical records",

abstract = "Purpose Risk stratification underlies system-wide efforts to promote the delivery of appropriate prostate cancer care. Although the elements of risk stratum are available in the electronic medical record, manual data collection is resource intensive. Therefore, we investigated the feasibility and accuracy of an automated data extraction method using natural language processing (NLP) to determine prostate cancer risk stratum. Methods Manually collected clinical stage, biopsy Gleason score, and preoperative prostate-specific antigen (PSA) values from our prospective prostatectomy database were used to categorize patients as low, intermediate, or high risk by D'Amico risk classification. NLP algorithms were developed to automate the extraction of the same data points from the electronic medical record, and risk strata were recalculated. The ability of NLP to identify elements sufficient to calculate risk (recall) was calculated, and the accuracy of NLP was compared with that of manually collected data using the weighted Cohen's k statistic. Results Of the 2,352 patients with available data who underwent prostatectomy from 2010 to 2014, NLP identified sufficient elements to calculate risk for 1,833 (recall, 78%). NLP had a91%raw agreement with manual risk stratification (k = 0.92; 95% CI, 0.90 to 0.93). The k statistics for PSA, Gleason score, and clinical stage extraction by NLP were 0.86, 0.91, and 0.89, respectively; 91.9% of extracted PSA values were within 6 1.0 ng/mL of the manually collected PSA levels. Conclusion NLP can achieve more than 90% accuracy on D'Amico risk stratification of localized prostate cancer, with adequate recall. This figure is comparable to otherNLPtasks and illustrates theknowntradeoff between recall and accuracy. Automating the collection of risk characteristics could be used to power realtime decision support tools and scale up quality measurement in cancer care.",

author = "Gregg, {Justin R.} and Maximilian Lang and Wang, {Lucy L.} and Resnick, {Matthew J.} and Jain, {Sandeep K.} and Warner, {Jeremy L.} and Barocas, {Daniel A.}",

note = "Publisher Copyright: {\textcopyright} 2018 American Society of Clinical Oncology.",

year = "2017",

doi = "10.1200/CCI.16.00045",

language = "English (US)",

volume = "2017",

pages = "1--8",

journal = "JCO Clinical Cancer Informatics",

issn = "2473-4276",

publisher = "American Society of Clinical Oncology",

number = "1",

}

TY - JOUR

T1 - Automating the determination of prostate cancer risk strata from electronic medical records

AU - Gregg, Justin R.

AU - Lang, Maximilian

AU - Wang, Lucy L.

AU - Resnick, Matthew J.

AU - Jain, Sandeep K.

AU - Warner, Jeremy L.

AU - Barocas, Daniel A.

PY - 2017

Y1 - 2017

N2 - Purpose Risk stratification underlies system-wide efforts to promote the delivery of appropriate prostate cancer care. Although the elements of risk stratum are available in the electronic medical record, manual data collection is resource intensive. Therefore, we investigated the feasibility and accuracy of an automated data extraction method using natural language processing (NLP) to determine prostate cancer risk stratum. Methods Manually collected clinical stage, biopsy Gleason score, and preoperative prostate-specific antigen (PSA) values from our prospective prostatectomy database were used to categorize patients as low, intermediate, or high risk by D'Amico risk classification. NLP algorithms were developed to automate the extraction of the same data points from the electronic medical record, and risk strata were recalculated. The ability of NLP to identify elements sufficient to calculate risk (recall) was calculated, and the accuracy of NLP was compared with that of manually collected data using the weighted Cohen's k statistic. Results Of the 2,352 patients with available data who underwent prostatectomy from 2010 to 2014, NLP identified sufficient elements to calculate risk for 1,833 (recall, 78%). NLP had a91%raw agreement with manual risk stratification (k = 0.92; 95% CI, 0.90 to 0.93). The k statistics for PSA, Gleason score, and clinical stage extraction by NLP were 0.86, 0.91, and 0.89, respectively; 91.9% of extracted PSA values were within 6 1.0 ng/mL of the manually collected PSA levels. Conclusion NLP can achieve more than 90% accuracy on D'Amico risk stratification of localized prostate cancer, with adequate recall. This figure is comparable to otherNLPtasks and illustrates theknowntradeoff between recall and accuracy. Automating the collection of risk characteristics could be used to power realtime decision support tools and scale up quality measurement in cancer care.

AB - Purpose Risk stratification underlies system-wide efforts to promote the delivery of appropriate prostate cancer care. Although the elements of risk stratum are available in the electronic medical record, manual data collection is resource intensive. Therefore, we investigated the feasibility and accuracy of an automated data extraction method using natural language processing (NLP) to determine prostate cancer risk stratum. Methods Manually collected clinical stage, biopsy Gleason score, and preoperative prostate-specific antigen (PSA) values from our prospective prostatectomy database were used to categorize patients as low, intermediate, or high risk by D'Amico risk classification. NLP algorithms were developed to automate the extraction of the same data points from the electronic medical record, and risk strata were recalculated. The ability of NLP to identify elements sufficient to calculate risk (recall) was calculated, and the accuracy of NLP was compared with that of manually collected data using the weighted Cohen's k statistic. Results Of the 2,352 patients with available data who underwent prostatectomy from 2010 to 2014, NLP identified sufficient elements to calculate risk for 1,833 (recall, 78%). NLP had a91%raw agreement with manual risk stratification (k = 0.92; 95% CI, 0.90 to 0.93). The k statistics for PSA, Gleason score, and clinical stage extraction by NLP were 0.86, 0.91, and 0.89, respectively; 91.9% of extracted PSA values were within 6 1.0 ng/mL of the manually collected PSA levels. Conclusion NLP can achieve more than 90% accuracy on D'Amico risk stratification of localized prostate cancer, with adequate recall. This figure is comparable to otherNLPtasks and illustrates theknowntradeoff between recall and accuracy. Automating the collection of risk characteristics could be used to power realtime decision support tools and scale up quality measurement in cancer care.

UR - http://www.scopus.com/inward/record.url?scp=85045186104&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=85045186104&partnerID=8YFLogxK

U2 - 10.1200/CCI.16.00045

DO - 10.1200/CCI.16.00045

M3 - Article

C2 - 29541700

AN - SCOPUS:85045186104

SN - 2473-4276

VL - 2017

SP - 1

EP - 8

JO - JCO Clinical Cancer Informatics

JF - JCO Clinical Cancer Informatics

IS - 1

ER -

Automating the determination of prostate cancer risk strata from electronic medical records

Abstract

ASJC Scopus subject areas

Access to Document

Other files and links

Fingerprint

Cite this