Optimized dual threshold entity resolution for electronic health record databases--training set size and active learning.

Erel Joffe; Michael J. Byrne; Phillip Reeder; Jorge R. Herskovic; Craig W. Johnson; Allison B. McCoy; Elmer V. Bernstam

Optimized dual threshold entity resolution for electronic health record databases--training set size and active learning.

Erel Joffe, Michael J. Byrne, Phillip Reeder, Jorge R. Herskovic, Craig W. Johnson, Allison B. McCoy, Elmer V. Bernstam

Bioinformatics & Computational Biology

Research output: Contribution to journal › Article › peer-review

6 Scopus citations

Abstract

Clinical databases may contain several records for a single patient. Multiple general entity-resolution algorithms have been developed to identify such duplicate records. To achieve optimal accuracy, algorithm parameters must be tuned to a particular dataset. The purpose of this study was to determine the required training set size for probabilistic, deterministic and Fuzzy Inference Engine (FIE) algorithms with parameters optimized using the particle swarm approach. Each algorithm classified potential duplicates into: definite match, non-match and indeterminate (i.e., requires manual review). Training sets size ranged from 2,000-10,000 randomly selected record-pairs. We also evaluated marginal uncertainty sampling for active learning. Optimization reduced manual review size (Deterministic 11.6% vs. 2.5%; FIE 49.6% vs. 1.9%; and Probabilistic 10.5% vs. 3.5%). FIE classified 98.1% of the records correctly (precision=1.0). Best performance required training on all 10,000 randomly-selected record-pairs. Active learning achieved comparable results with 3,000 records. Automated optimization is effective and targeted sampling can reduce the required training set size.

Original language	English (US)
Pages (from-to)	721-730
Number of pages	10
Journal	AMIA ... Annual Symposium proceedings / AMIA Symposium. AMIA Symposium
Volume	2013
State	Published - 2013

ASJC Scopus subject areas

General Medicine

MD Anderson CCSG core facilities

Bioinformatics Shared Resource

Cite this

@article{8371a0c342984a758ba7fcd7493d88a5,

title = "Optimized dual threshold entity resolution for electronic health record databases--training set size and active learning.",

abstract = "Clinical databases may contain several records for a single patient. Multiple general entity-resolution algorithms have been developed to identify such duplicate records. To achieve optimal accuracy, algorithm parameters must be tuned to a particular dataset. The purpose of this study was to determine the required training set size for probabilistic, deterministic and Fuzzy Inference Engine (FIE) algorithms with parameters optimized using the particle swarm approach. Each algorithm classified potential duplicates into: definite match, non-match and indeterminate (i.e., requires manual review). Training sets size ranged from 2,000-10,000 randomly selected record-pairs. We also evaluated marginal uncertainty sampling for active learning. Optimization reduced manual review size (Deterministic 11.6% vs. 2.5%; FIE 49.6% vs. 1.9%; and Probabilistic 10.5% vs. 3.5%). FIE classified 98.1% of the records correctly (precision=1.0). Best performance required training on all 10,000 randomly-selected record-pairs. Active learning achieved comparable results with 3,000 records. Automated optimization is effective and targeted sampling can reduce the required training set size.",

author = "Erel Joffe and Byrne, {Michael J.} and Phillip Reeder and Herskovic, {Jorge R.} and Johnson, {Craig W.} and McCoy, {Allison B.} and Bernstam, {Elmer V.}",

year = "2013",

language = "English (US)",

volume = "2013",

pages = "721--730",

journal = "AMIA ... Annual Symposium proceedings / AMIA Symposium. AMIA Symposium",

issn = "1559-4076",

publisher = "American Medical Informatics Association",

}

TY - JOUR

T1 - Optimized dual threshold entity resolution for electronic health record databases--training set size and active learning.

AU - Joffe, Erel

AU - Byrne, Michael J.

AU - Reeder, Phillip

AU - Herskovic, Jorge R.

AU - Johnson, Craig W.

AU - McCoy, Allison B.

AU - Bernstam, Elmer V.

PY - 2013

Y1 - 2013

N2 - Clinical databases may contain several records for a single patient. Multiple general entity-resolution algorithms have been developed to identify such duplicate records. To achieve optimal accuracy, algorithm parameters must be tuned to a particular dataset. The purpose of this study was to determine the required training set size for probabilistic, deterministic and Fuzzy Inference Engine (FIE) algorithms with parameters optimized using the particle swarm approach. Each algorithm classified potential duplicates into: definite match, non-match and indeterminate (i.e., requires manual review). Training sets size ranged from 2,000-10,000 randomly selected record-pairs. We also evaluated marginal uncertainty sampling for active learning. Optimization reduced manual review size (Deterministic 11.6% vs. 2.5%; FIE 49.6% vs. 1.9%; and Probabilistic 10.5% vs. 3.5%). FIE classified 98.1% of the records correctly (precision=1.0). Best performance required training on all 10,000 randomly-selected record-pairs. Active learning achieved comparable results with 3,000 records. Automated optimization is effective and targeted sampling can reduce the required training set size.

AB - Clinical databases may contain several records for a single patient. Multiple general entity-resolution algorithms have been developed to identify such duplicate records. To achieve optimal accuracy, algorithm parameters must be tuned to a particular dataset. The purpose of this study was to determine the required training set size for probabilistic, deterministic and Fuzzy Inference Engine (FIE) algorithms with parameters optimized using the particle swarm approach. Each algorithm classified potential duplicates into: definite match, non-match and indeterminate (i.e., requires manual review). Training sets size ranged from 2,000-10,000 randomly selected record-pairs. We also evaluated marginal uncertainty sampling for active learning. Optimization reduced manual review size (Deterministic 11.6% vs. 2.5%; FIE 49.6% vs. 1.9%; and Probabilistic 10.5% vs. 3.5%). FIE classified 98.1% of the records correctly (precision=1.0). Best performance required training on all 10,000 randomly-selected record-pairs. Active learning achieved comparable results with 3,000 records. Automated optimization is effective and targeted sampling can reduce the required training set size.

UR - http://www.scopus.com/inward/record.url?scp=84901248604&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=84901248604&partnerID=8YFLogxK

M3 - Article

C2 - 24551372

AN - SCOPUS:84901248604

SN - 1559-4076

VL - 2013

SP - 721

EP - 730

JO - AMIA ... Annual Symposium proceedings / AMIA Symposium. AMIA Symposium

JF - AMIA ... Annual Symposium proceedings / AMIA Symposium. AMIA Symposium

ER -

Optimized dual threshold entity resolution for electronic health record databases--training set size and active learning.

Abstract

ASJC Scopus subject areas

MD Anderson CCSG core facilities

Other files and links

Fingerprint

Cite this