2014
DOI: 10.1016/j.jbi.2014.01.014
|View full text |Cite
|
Sign up to set email alerts
|

Preparing an annotated gold standard corpus to share with extramural investigators for de-identification research

Abstract: Objective The current study aims to fill the gap in available healthcare de-identification resources by creating a new sharable dataset with realistic Protected Health Information (PHI) without reducing the value of the data for de-identification research. By releasing the annotated gold standard corpus with Data Use Agreement we would like to encourage other Computational Linguists to experiment with our data and develop new machine learning models for de-identification. This paper describes: (1) the modifica… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
4
1

Citation Types

0
30
1

Year Published

2015
2015
2020
2020

Publication Types

Select...
5
4

Relationship

0
9

Authors

Journals

citations
Cited by 40 publications
(35 citation statements)
references
References 17 publications
0
30
1
Order By: Relevance
“…A broader overview of de-id systems can be found in the recent review article by Meystre et al (2010), in which the authors describe 18 de-identification systems built between 1995 and 2010. Here, we focus on three recent tools: MIST, the MITRE Identification Scrubber Toolkit (Aberdeen et al, 2010), BoB, the “best of breed” tool from the Veteran's Health Administration (Ferrández et al, 2012), and an in-house tool from Cincinnati Children's Hospital Medical Center (Delager et al, 2014). …”
Section: Related Workmentioning
confidence: 99%
See 1 more Smart Citation
“…A broader overview of de-id systems can be found in the recent review article by Meystre et al (2010), in which the authors describe 18 de-identification systems built between 1995 and 2010. Here, we focus on three recent tools: MIST, the MITRE Identification Scrubber Toolkit (Aberdeen et al, 2010), BoB, the “best of breed” tool from the Veteran's Health Administration (Ferrández et al, 2012), and an in-house tool from Cincinnati Children's Hospital Medical Center (Delager et al, 2014). …”
Section: Related Workmentioning
confidence: 99%
“…The CCHMC system also utilizes pre-processing in the form of an in-house and the TreeTagger 3 part of speech processor, and post-processing in the form of rules that identify email addresses, match names to an external lexicon, and capture any names that the CRF module missed. When tested on the 2006 i2b2 corpus, with training data from other corpora, the system achieved precision, recall, and F1 of 0.9682, 0.9342, and 0.9509, respectively (Delager et al, 2014). …”
Section: Related Workmentioning
confidence: 99%
“…Deleger et al, (2014) recently created a corpus of 3,503 de-identified medical records of 22 different types, including discharge summaries, progress notes, and referrals. In all, their corpus contains 30,815 instances of PHI and is available upon request.…”
Section: Related Workmentioning
confidence: 99%
“…The case of the Electronic Medical Records and Genomics (eMERGE) Network is a clear example of project spanning a long period, focusing the first efforts and deliverables on building the logic and standards to organize data and the institutions collecting and sharing them (Lemke et al, 2010;Deleger et al, 2014;Jiang et al, 2015). The primary goal of the eMERGE is to combine biorepositories with EHR systems aimed at genomic discovery and implementation of genomic in the medical practice.…”
Section: Electronic Health Recordsmentioning
confidence: 99%