Preparing an annotated gold standard corpus to share with extramural investigators for de-identification research

Deléger, Louise; Lingren, Todd; Ni, Yizhao; Kaiser, Megan; Stoutenborough, Laura; Marsolo, Keith; Kouril, Michal; Molnar, Katalin; Solti, Imre

doi:10.1016/j.jbi.2014.01.014

Cited by 40 publications

(35 citation statements)

References 17 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…A broader overview of de-id systems can be found in the recent review article by Meystre et al (2010), in which the authors describe 18 de-identification systems built between 1995 and 2010. Here, we focus on three recent tools: MIST, the MITRE Identification Scrubber Toolkit (Aberdeen et al, 2010), BoB, the “best of breed” tool from the Veteran's Health Administration (Ferrández et al, 2012), and an in-house tool from Cincinnati Children's Hospital Medical Center (Delager et al, 2014). …”

Section: Related Workmentioning

confidence: 99%

“…The CCHMC system also utilizes pre-processing in the form of an in-house and the TreeTagger 3 part of speech processor, and post-processing in the form of rules that identify email addresses, match names to an external lexicon, and capture any names that the CRF module missed. When tested on the 2006 i2b2 corpus, with training data from other corpora, the system achieved precision, recall, and F1 of 0.9682, 0.9342, and 0.9509, respectively (Delager et al, 2014). …”

Section: Related Workmentioning

confidence: 99%

See 1 more Smart Citation

Automated systems for the de-identification of longitudinal clinical narratives: Overview of 2014 i2b2/UTHealth shared task Track 1

Stubbs

Kotfila

Uzuner

2015

Journal of Biomedical Informatics

176

142

View full text Add to dashboard Cite

The 2014 i2b2/UTHealth Natural Language Processing (NLP) shared task featured four tracks. The first of these was the de-identification track focused on identifying protected health information (PHI) in longitudinal clinical narratives. The longitudinal nature of clinical narratives calls particular attention to details of information that, while benign on their own in separate records, can lead to identification of patients in combination in longitudinal records. Accordingly, the 2014 de-identification track addressed a broader set of entities and PHI than covered by the Health Insurance Portability and Accountability Act – the focus of the de-identification shared task that was organized in 2006. Ten teams tackled the 2014 de-identification task and submitted 22 system outputs for evaluation. Each team was evaluated on their best performing system output. Three of the 10 systems achieved F1 scores over .90, and seven of the top 10 scored over .75. The most successful systems combined conditional random fields and hand-written rules. Our findings indicate that automated systems can be very effective for this task, but that de-identification is not yet a solved problem.

show abstract

Section: Related Workmentioning

confidence: 99%

Section: Related Workmentioning

confidence: 99%

Automated systems for the de-identification of longitudinal clinical narratives: Overview of 2014 i2b2/UTHealth shared task Track 1

Stubbs

Kotfila

Uzuner

2015

Journal of Biomedical Informatics

176

142

View full text Add to dashboard Cite

show abstract

“…Deleger et al, (2014) recently created a corpus of 3,503 de-identified medical records of 22 different types, including discharge summaries, progress notes, and referrals. In all, their corpus contains 30,815 instances of PHI and is available upon request.…”

Section: Related Workmentioning

confidence: 99%

Annotating longitudinal clinical narratives for de-identification: The 2014 i2b2/UTHealth corpus

Stubbs

Uzuner

2015

Journal of Biomedical Informatics

152

139

View full text Add to dashboard Cite

The 2014 i2b2/UTHealth natural language processing shared task featured a track focused on the de-identification of longitudinal medical records. For this track, we de-identified a set of 1,304 longitudinal medical records describing 296 patients. This corpus was de-identified under a broad interpretation of the HIPAA guidelines using double-annotation followed by arbitration, rounds of sanity checking, and proof reading. The average token-based F1 measure for the annotators compared to the gold standard was 0.927. The resulting annotations were used both to de-identify the data and to set the gold standard for the de-identification track of the 2014 i2b2/UTHealth shared task. All annotated private health information were replaced with realistic surrogates automatically and then read over and corrected manually. The resulting corpus is the first of its kind made available for de-identification research. This corpus was first used for the 2014 i2b2/UTHealth shared task, during which the systems achieved a mean F-measure of 0.872 and a maximum F-measure of 0.964 using entity-based micro-averaged evaluations.

show abstract

“…The case of the Electronic Medical Records and Genomics (eMERGE) Network is a clear example of project spanning a long period, focusing the first efforts and deliverables on building the logic and standards to organize data and the institutions collecting and sharing them (Lemke et al, 2010;Deleger et al, 2014;Jiang et al, 2015). The primary goal of the eMERGE is to combine biorepositories with EHR systems aimed at genomic discovery and implementation of genomic in the medical practice.…”

Section: Electronic Health Recordsmentioning

confidence: 99%

Big Data: Challenge and Opportunity for Translational and Industrial Research in Healthcare

Rossi

Grifantini

2018

Front. Digit. Humanit.

View full text Add to dashboard Cite

Research and innovation are constant imperatives for the healthcare sector: medicine, biology and biotechnology support it, and more recently computational and data-driven disciplines gained relevance to handle the massive amount of data this sector is and will be generating. To be effective in translational and healthcare industrial research, big data in the life science domain need to be organized, well annotated, catalogued, correlated and integrated: the biggest the data silos at hand, the stronger the need for organization and tidiness. The degree of such organization marks the transition from data to knowledge for strategic decision making. Medicine is supported by observations and data and for certain aspects medicine is becoming a data science supported by clinicians. While medicine defines itself as personalized, quantified (precision med) or in high-definition, clinicians should be prepared to deal with a world in which Internet of People paraphrases the Internet of Things paradigm. Integrated use of electronic health records (EHRs) and quantitative data (both clinical and molecular) is a key process to develop precision medicine. Health records collection was originally designed for patient care and billing and/or insurance purposes. The digitization of health records facilitates and opens up new possibilities for science and research and they should be now collected and managed with this aim in mind. More data and the ability to efficiently handle them is a significant advantage not only for clinicians and life science researchers, but for drugs producers too. In an industrial sector spending increasing efforts on drug repurposing, attention to efficient methods to unwind the intricacies of the hugely complex reality of human physiology, such as network based methods and physical chemistry computational methods, became of paramount importance. Finally, the main pillars of industrial R&D processes for vaccines, include initial discovery, early-late pre clinics, pre-industrialization, clinical phases and finally registration-commercialization. The passage from one step to another is regulated by stringent pass/fail criteria. Bottlenecks of the R&D process are often represented by animal and human studies, which could be rationalized by surrogate in vitro assays as well as by predictive molecular and cellular signatures and models.

show abstract

Preparing an annotated gold standard corpus to share with extramural investigators for de-identification research

Cited by 40 publications

References 17 publications

Automated systems for the de-identification of longitudinal clinical narratives: Overview of 2014 i2b2/UTHealth shared task Track 1

Automated systems for the de-identification of longitudinal clinical narratives: Overview of 2014 i2b2/UTHealth shared task Track 1

Annotating longitudinal clinical narratives for de-identification: The 2014 i2b2/UTHealth corpus

Big Data: Challenge and Opportunity for Translational and Industrial Research in Healthcare

Contact Info

Product

Resources

About