Development of a Structured Query Language and Natural Language Processing Algorithm to Identify Lung Nodules in a Cancer Centre

Benjamin Hunter; Sara Reis; Des Campbell; Sheila Matharu; Prashanthi Ratnakumar; Luca Mercuri; Sumeet Hindocha; Hardeep Kalsi; Erik Mayer; Ben Glampson; Emily J. Robinson; Bisan Al-Lazikani; Lisa Scerri; Susannah Bloch; Richard Lee

doi:10.3389/fmed.2021.748168

Development of a Structured Query Language and Natural Language Processing Algorithm to Identify Lung Nodules in a Cancer Centre

Benjamin Hunter, Sara Reis, Des Campbell, Sheila Matharu, Prashanthi Ratnakumar, Luca Mercuri, Sumeet Hindocha, Hardeep Kalsi, Erik Mayer, Ben Glampson, Emily J. Robinson, Bisan Al-Lazikani, Lisa Scerri, Susannah Bloch, Richard Lee

Research output: Contribution to journal › Article › peer-review

3 Scopus citations

Abstract

Importance: The stratification of indeterminate lung nodules is a growing problem, but the burden of lung nodules on healthcare services is not well-described. Manual service evaluation and research cohort curation can be time-consuming and potentially improved by automation. Objective: To automate lung nodule identification in a tertiary cancer centre. Methods: This retrospective cohort study used Electronic Healthcare Records to identify CT reports generated between 31st October 2011 and 24th July 2020. A structured query language/natural language processing tool was developed to classify reports according to lung nodule status. Performance was externally validated. Sentences were used to train machine-learning classifiers to predict concerning nodule features in 2,000 patients. Results: 14,586 patients with lung nodules were identified. The cancer types most commonly associated with lung nodules were lung (39%), neuro-endocrine (38%), skin (35%), colorectal (33%) and sarcoma (33%). Lung nodule patients had a greater proportion of metastatic diagnoses (45 vs. 23%, p < 0.001), a higher mean post-baseline scan number (6.56 vs. 1.93, p < 0.001), and a shorter mean scan interval (4.1 vs. 5.9 months, p < 0.001) than those without nodules. Inter-observer agreement for sentence classification was 0.94 internally and 0.98 externally. Sensitivity and specificity for nodule identification were 93 and 99% internally, and 100 and 100% at external validation, respectively. A linear-support vector machine model predicted concerning sentence features with 94% accuracy. Conclusion: We have developed and validated an accurate tool for automated lung nodule identification that is valuable for service evaluation and research data acquisition.

Original language	English (US)
Article number	748168
Journal	Frontiers in Medicine
Volume	8
DOIs	https://doi.org/10.3389/fmed.2021.748168
State	Published - Nov 4 2021
Externally published	Yes

Keywords

informatics
lung nodule
machine learning
natural language processing (NLP)
structured query language (SQL)

ASJC Scopus subject areas

General Medicine

Access to Document

10.3389/fmed.2021.748168

Cite this

Hunter, B., Reis, S., Campbell, D., Matharu, S., Ratnakumar, P., Mercuri, L., Hindocha, S., Kalsi, H., Mayer, E., Glampson, B., Robinson, E. J., Al-Lazikani, B., Scerri, L., Bloch, S., & Lee, R. (2021). Development of a Structured Query Language and Natural Language Processing Algorithm to Identify Lung Nodules in a Cancer Centre. Frontiers in Medicine, 8, Article 748168. https://doi.org/10.3389/fmed.2021.748168

Hunter, B, Reis, S, Campbell, D, Matharu, S, Ratnakumar, P, Mercuri, L, Hindocha, S, Kalsi, H, Mayer, E, Glampson, B, Robinson, EJ, Al-Lazikani, B, Scerri, L, Bloch, S & Lee, R 2021, 'Development of a Structured Query Language and Natural Language Processing Algorithm to Identify Lung Nodules in a Cancer Centre', Frontiers in Medicine, vol. 8, 748168. https://doi.org/10.3389/fmed.2021.748168

@article{61cd3ad69f9e4f98a02a92d7e8002f81,

title = "Development of a Structured Query Language and Natural Language Processing Algorithm to Identify Lung Nodules in a Cancer Centre",

abstract = "Importance: The stratification of indeterminate lung nodules is a growing problem, but the burden of lung nodules on healthcare services is not well-described. Manual service evaluation and research cohort curation can be time-consuming and potentially improved by automation. Objective: To automate lung nodule identification in a tertiary cancer centre. Methods: This retrospective cohort study used Electronic Healthcare Records to identify CT reports generated between 31st October 2011 and 24th July 2020. A structured query language/natural language processing tool was developed to classify reports according to lung nodule status. Performance was externally validated. Sentences were used to train machine-learning classifiers to predict concerning nodule features in 2,000 patients. Results: 14,586 patients with lung nodules were identified. The cancer types most commonly associated with lung nodules were lung (39%), neuro-endocrine (38%), skin (35%), colorectal (33%) and sarcoma (33%). Lung nodule patients had a greater proportion of metastatic diagnoses (45 vs. 23%, p < 0.001), a higher mean post-baseline scan number (6.56 vs. 1.93, p < 0.001), and a shorter mean scan interval (4.1 vs. 5.9 months, p < 0.001) than those without nodules. Inter-observer agreement for sentence classification was 0.94 internally and 0.98 externally. Sensitivity and specificity for nodule identification were 93 and 99% internally, and 100 and 100% at external validation, respectively. A linear-support vector machine model predicted concerning sentence features with 94% accuracy. Conclusion: We have developed and validated an accurate tool for automated lung nodule identification that is valuable for service evaluation and research data acquisition.",

keywords = "informatics, lung nodule, machine learning, natural language processing (NLP), structured query language (SQL)",

author = "Benjamin Hunter and Sara Reis and Des Campbell and Sheila Matharu and Prashanthi Ratnakumar and Luca Mercuri and Sumeet Hindocha and Hardeep Kalsi and Erik Mayer and Ben Glampson and Robinson, {Emily J.} and Bisan Al-Lazikani and Lisa Scerri and Susannah Bloch and Richard Lee",

note = "Publisher Copyright: {\textcopyright} Copyright {\textcopyright} 2021 Hunter, Reis, Campbell, Matharu, Ratnakumar, Mercuri, Hindocha, Kalsi, Mayer, Glampson, Robinson, Al-Lazikani, Scerri, Bloch and Lee.",

year = "2021",

month = nov,

day = "4",

doi = "10.3389/fmed.2021.748168",

language = "English (US)",

volume = "8",

journal = "Frontiers in Medicine",

issn = "2296-858X",

publisher = "Frontiers Media S. A.",

}

TY - JOUR

T1 - Development of a Structured Query Language and Natural Language Processing Algorithm to Identify Lung Nodules in a Cancer Centre

AU - Hunter, Benjamin

AU - Reis, Sara

AU - Campbell, Des

AU - Matharu, Sheila

AU - Ratnakumar, Prashanthi

AU - Mercuri, Luca

AU - Hindocha, Sumeet

AU - Kalsi, Hardeep

AU - Mayer, Erik

AU - Glampson, Ben

AU - Robinson, Emily J.

AU - Al-Lazikani, Bisan

AU - Scerri, Lisa

AU - Bloch, Susannah

AU - Lee, Richard

PY - 2021/11/4

Y1 - 2021/11/4

N2 - Importance: The stratification of indeterminate lung nodules is a growing problem, but the burden of lung nodules on healthcare services is not well-described. Manual service evaluation and research cohort curation can be time-consuming and potentially improved by automation. Objective: To automate lung nodule identification in a tertiary cancer centre. Methods: This retrospective cohort study used Electronic Healthcare Records to identify CT reports generated between 31st October 2011 and 24th July 2020. A structured query language/natural language processing tool was developed to classify reports according to lung nodule status. Performance was externally validated. Sentences were used to train machine-learning classifiers to predict concerning nodule features in 2,000 patients. Results: 14,586 patients with lung nodules were identified. The cancer types most commonly associated with lung nodules were lung (39%), neuro-endocrine (38%), skin (35%), colorectal (33%) and sarcoma (33%). Lung nodule patients had a greater proportion of metastatic diagnoses (45 vs. 23%, p < 0.001), a higher mean post-baseline scan number (6.56 vs. 1.93, p < 0.001), and a shorter mean scan interval (4.1 vs. 5.9 months, p < 0.001) than those without nodules. Inter-observer agreement for sentence classification was 0.94 internally and 0.98 externally. Sensitivity and specificity for nodule identification were 93 and 99% internally, and 100 and 100% at external validation, respectively. A linear-support vector machine model predicted concerning sentence features with 94% accuracy. Conclusion: We have developed and validated an accurate tool for automated lung nodule identification that is valuable for service evaluation and research data acquisition.

AB - Importance: The stratification of indeterminate lung nodules is a growing problem, but the burden of lung nodules on healthcare services is not well-described. Manual service evaluation and research cohort curation can be time-consuming and potentially improved by automation. Objective: To automate lung nodule identification in a tertiary cancer centre. Methods: This retrospective cohort study used Electronic Healthcare Records to identify CT reports generated between 31st October 2011 and 24th July 2020. A structured query language/natural language processing tool was developed to classify reports according to lung nodule status. Performance was externally validated. Sentences were used to train machine-learning classifiers to predict concerning nodule features in 2,000 patients. Results: 14,586 patients with lung nodules were identified. The cancer types most commonly associated with lung nodules were lung (39%), neuro-endocrine (38%), skin (35%), colorectal (33%) and sarcoma (33%). Lung nodule patients had a greater proportion of metastatic diagnoses (45 vs. 23%, p < 0.001), a higher mean post-baseline scan number (6.56 vs. 1.93, p < 0.001), and a shorter mean scan interval (4.1 vs. 5.9 months, p < 0.001) than those without nodules. Inter-observer agreement for sentence classification was 0.94 internally and 0.98 externally. Sensitivity and specificity for nodule identification were 93 and 99% internally, and 100 and 100% at external validation, respectively. A linear-support vector machine model predicted concerning sentence features with 94% accuracy. Conclusion: We have developed and validated an accurate tool for automated lung nodule identification that is valuable for service evaluation and research data acquisition.

KW - informatics

KW - lung nodule

KW - machine learning

KW - natural language processing (NLP)

KW - structured query language (SQL)

UR - http://www.scopus.com/inward/record.url?scp=85119412758&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=85119412758&partnerID=8YFLogxK

U2 - 10.3389/fmed.2021.748168

DO - 10.3389/fmed.2021.748168

M3 - Article

C2 - 34805217

AN - SCOPUS:85119412758

SN - 2296-858X

VL - 8

JO - Frontiers in Medicine

JF - Frontiers in Medicine

M1 - 748168

ER -

Development of a Structured Query Language and Natural Language Processing Algorithm to Identify Lung Nodules in a Cancer Centre

Abstract

Keywords

ASJC Scopus subject areas

Access to Document

Other files and links

Fingerprint

Cite this