Abstract:A vast amount of usable electronic data is in the form of unstructured text. The relation extraction task aims to identify useful information in text (e.g. PersonW works for OrganisationX, GeneY encodes ProteinZ ) and recode it in a format such as a relational database or RDF triplestore that can be more effectively used for querying and automated reasoning. A number of resources have been developed for training and evaluating automatic systems for relation extraction in different domains. However, comparative… Show more
“…the ANNIE NER tagger which is part of GATE [5]. Relation extraction is often included as a subtask in text mining applications [6] with approaches to it ranging from rule-based through supervised to unsupervised machine learning.…”
BackgroundWith the improvements to text mining technology and the availability of large unstructured Electronic Healthcare Records (EHR) datasets, it is now possible to extract structured information from raw text contained within EHR at reasonably high accuracy. We describe a text mining system for classifying radiologists’ reports of CT and MRI brain scans, assigning labels indicating occurrence and type of stroke, as well as other observations. Our system, the Edinburgh Information Extraction for Radiology reports (EdIE-R) system, which we describe here, was developed and tested on a collection of radiology reports.The work reported in this paper is based on 1168 radiology reports from the Edinburgh Stroke Study (ESS), a hospital-based register of stroke and transient ischaemic attack patients. We manually created annotations for this data in parallel with developing the rule-based EdIE-R system to identify phenotype information related to stroke in radiology reports. This process was iterative and domain expert feedback was considered at each iteration to adapt and tune the EdIE-R text mining system which identifies entities, negation and relations between entities in each report and determines report-level labels (phenotypes).ResultsThe inter-annotator agreement (IAA) for all types of annotations is high at 96.96 for entities, 96.46 for negation, 95.84 for relations and 94.02 for labels. The equivalent system scores on the blind test set are equally high at 95.49 for entities, 94.41 for negation, 98.27 for relations and 96.39 for labels for the first annotator and 96.86, 96.01, 96.53 and 92.61, respectively for the second annotator.ConclusionAutomated reading of such EHR data at such high levels of accuracies opens up avenues for population health monitoring and audit, and can provide a resource for epidemiological studies. We are in the process of validating EdIE-R in separate larger cohorts in NHS England and Scotland. The manually annotated ESS corpus will be available for research purposes on application.
“…the ANNIE NER tagger which is part of GATE [5]. Relation extraction is often included as a subtask in text mining applications [6] with approaches to it ranging from rule-based through supervised to unsupervised machine learning.…”
BackgroundWith the improvements to text mining technology and the availability of large unstructured Electronic Healthcare Records (EHR) datasets, it is now possible to extract structured information from raw text contained within EHR at reasonably high accuracy. We describe a text mining system for classifying radiologists’ reports of CT and MRI brain scans, assigning labels indicating occurrence and type of stroke, as well as other observations. Our system, the Edinburgh Information Extraction for Radiology reports (EdIE-R) system, which we describe here, was developed and tested on a collection of radiology reports.The work reported in this paper is based on 1168 radiology reports from the Edinburgh Stroke Study (ESS), a hospital-based register of stroke and transient ischaemic attack patients. We manually created annotations for this data in parallel with developing the rule-based EdIE-R system to identify phenotype information related to stroke in radiology reports. This process was iterative and domain expert feedback was considered at each iteration to adapt and tune the EdIE-R text mining system which identifies entities, negation and relations between entities in each report and determines report-level labels (phenotypes).ResultsThe inter-annotator agreement (IAA) for all types of annotations is high at 96.96 for entities, 96.46 for negation, 95.84 for relations and 94.02 for labels. The equivalent system scores on the blind test set are equally high at 95.49 for entities, 94.41 for negation, 98.27 for relations and 96.39 for labels for the first annotator and 96.86, 96.01, 96.53 and 92.61, respectively for the second annotator.ConclusionAutomated reading of such EHR data at such high levels of accuracies opens up avenues for population health monitoring and audit, and can provide a resource for epidemiological studies. We are in the process of validating EdIE-R in separate larger cohorts in NHS England and Scotland. The manually annotated ESS corpus will be available for research purposes on application.
“…ADE-EXT (Adverse Drug Effect corpus, extended) (Gurulingappa et al, 2012) consists of MEDLINE case reports, annotated with drugs and conditions (e.g., diseases, signs and symptoms), along with untyped relationships between them. reACE (Edinburgh Regularized Automatic Content Extraction) (Hachey et al, 2012) consists of English broadcast news and newswire annotated with organization, person, fvw (facility, vehicle or weapon) and gpl (geographical, political or location) entities along with relationships between them. Relationships are classified in five types: general-affiliation, organisation-affiliation, partwhole, personal-social and agent-artifact.…”
Section: Source Corporamentioning
confidence: 99%
“…First, we show that, compared to a baseline Convolutional Neural Network (CNN)-based model, a syntax-based model (i.e., the TreeLSTM model) can better benefit from a TL strategy, even with very dissimilar additional source data. We conduct our experiments with two biomedical RE tasks and relatively small associated corpora, SNPPhenA (Bokharaeian et al, 2017) and EU-ADR (van Mulligen et al, 2012) as target corpora and three larger RE corpora, Semeval 2013DDI (Herrero-Zazo et al, 2013, ADE-EXT (Gurulingappa et al, 2012), reACE (Hachey et al, 2012) as source corpora. Second, we propose a syntax-based analysis, using both quantitative criteria and qualitative observations, to better understand the role of syntactic features in the TL behavior.…”
This work explores the detection of individuals' risk of type 2 diabetes mellitus (T2DM) directly from their social media (Twitter) activity. Our approach extends a deep learning architecture with several contributions: following previous observations that language use differs by gender, it captures and uses gender information through domain adaptation; it captures recency of posts under the hypothesis that more recent posts are more representative of an individual's current risk status; and, lastly, it demonstrates that in this scenario where activity factors are sparsely represented in the data, a bag-of-word neural network model using custom dictionaries of food and activity words performs better than other neural sequence models. Our best model, which incorporates all these contributions, achieves a risk-detection F 1 of 41.9, considerably higher than the baseline rate (36.9).
“…Relation extraction approaches can be classified in various ways. Knowledge engineering approaches (e.g., rule-based, linguistic based), learning approaches (e.g., statistical, machine learning, bootstrapping) and hybrid ones; for a general review of relation extraction techniques see [Hac09].…”
Section: Relationship Detection In Novelsmentioning
In prose literature often complex dynamics of interpersonal relationships can be observed between the different characters. Traditionally, node-link diagrams are used to depict the social network of a novel. However, static graphs can only visualize the overall social network structure but not the development of the networks over the course of the story, while dynamic graphs have the serious problem that there are many sudden changes between different portions of the overall social network. In this paper we explore means to show the relationships between the characters of a plot and at the same time their development over the course of a novel. Based on a careful exploration of the design space, we suggest a new visualization technique called Fingerprint Matrices. A case study exemplifies the usage of Fingerprint Matrices and shows that they are an effective means to analyze prose literature with respect to the development of relationships between the different characters.
MotivationLiterature can be studied in a number of different ways and from many perspectives, but text analysis will surely always make up a central component of literature studies. Our work aims at supporting literature scholars in this task by providing them with visualizations that make patterns or trends with respect to a certain text property easy to perceive. Specifically, the approach presented in this paper concentrates on the development of social networks in prose literature that represent the relationships between characters in a novel. The visualization of such networks can reveal inherent patterns like subgroups of characters that interact with each other. However, usually the relationships in a novel are not static but develop during the plot. Social networks build up gradually and some acquaintances may only be important for part of the story. The goal of our work is to enable literature scientists to dig deeper and explore a novel in terms of where in the plot certain protagonists are related to each other. This way a deeper understanding of the structure of the novel with respect to co-occurrences of characters becomes possible and more details of the story line are revealed. The basic idea of the paper is to visualize pairwise relations between characters in so-called co-occurrence glyphs that can be considered a fingerprint of the dynamics between two protagonists. The fingerprints are arranged in an adjacency matrix to get the overall picture of the storyline.The paper is structured as follows: First, some background information about the applied natural language processing techniques as well as related work is given in Section 2. This is followed by a careful consideration of the design space in Section 3 which motivates the visualization technique that is introduced in Section 4. Section 5 explains how to read Fingerprint Matrices. This is further exemplified in the case studies in Section 6 in which a modern English novel and a Swedish novel of the 19th century are analyzed. The paper concludes with a summary an...
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.