Identification of the certainty of events is an important text mining problem. In particular, biomedical texts report medical conditions or findings that might be factual, hedged or negated. Identification of negation and its scope over a term of interest determines whether a finding is reported and is a challenging task. Not much work has been performed for Spanish in this domain. In this work we introduce different algorithms developed to determine if a term of interest is under the scope of negation in radiology reports written in Spanish. The methods include syntactic techniques based in rules derived from PoS tagging patterns, constituent tree patterns and dependency tree patterns, and an adaption of NegEx, a well known rule-based negation detection algorithm (Chapman et al., 2001a). All methods outperform a simple dictionary lookup algorithm developed as baseline. NegEx and the PoS tagging pattern method obtain the best results with 0.92 F1.
Radiology reports express the results of a radiology study and contain information about anatomical entities, findings, measures and impressions of the medical doctor. The use of information extraction techniques can help physicians to access this information in order to understand data and to infer further knowledge.Supervised machine learning methods are very popular to address information extraction, but are usually domain and language dependent. To train new classification models, annotated data is required. Moreover, annotated data is also required as an evaluation resource of information extraction algorithms. However, one major drawback of processing clinical data is the low availability of annotated datasets. For this reason we performed a manual annotation of radiology reports written in Spanish. This paper presents the corpus, the annotation schema, the annotation guidelines and further insight of the data.
Named Entity Recognition in the clinical domain and in languages different from English has the difficulty of the absence of complete dictionaries, the informality of texts, the polysemy of terms, the lack of accordance in the boundaries of an entity, the scarcity of corpora and of other resources available. We present a Named Entity Recognition method for poorly resourced languages. The method was tested with Spanish radiology reports and compared with a conditional random fields system.
In this paper we present an annotated corpus which can be used for training and testing algorithms to automatically extract information about diseases outbreaks from news and health reports. We also propose initial approaches to extract information from it. The corpus has been constructed with two main tasks in mind. The first one, to extract entities about outbreaks such as disease, host, location among others. The second one, to retrieve relations among entities, for instance, in such geographic location fifteen cases of a given disease were reported. Overall, our goal is to offer resources and tools to perform an automated analysis so as to support early detection of disease outbreaks and therefore diminish their spreading.
In recent years, hate speech has gained relevance in social networks and other virtual media because of its intensity and its relationship with violent acts against members of protected groups. Due to the huge amounts of user-generated contents, a great effort has been made to develop automatic tools to aid the analysis and moderation of this speech, at least in its most threatening forms. One of the limitations of current approaches to automatic hate speech detection is the lack of context; most studies and resources focus on isolated messages, without considering any type of conversational context or even the topic being discussed. This severely restricts the available information for determining whether a post on a social network should be tagged as hateful or not. In this work, we assess the impact of adding contextual information to the hate speech detection task. In particular, we study a Twitter subdomain consisting of replies to posts by news outlets, which provides a natural environment for contextualized hate speech detection. We collected a novel corpus in the Rioplatense dialectal variety of Spanish focusing on hate speech associated with the COVID-19 pandemic, and manually annotated it using carefully designed guidelines. Our classification experiments using state-of-the-art transformer-based machine learning techniques show evidence that adding contextual information improves the performance of hate speech detection for two proposed tasks (binary and multilabel prediction), increasing their Macro F1 by 4.2 and 5.5 points, respectively. These results highlight the importance of exploiting contextual information for the task of hate speech detection. We make our code, models, and corpus available for further research.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.