Clinic Named Entity Recognition (CNER) aims to recognize named entities such as body part, disease and symptom from Electronic Health Records (EHRs), which can benefit many intelligent biomedical systems. In recent years, more and more attention has been paid to the end-to-end CNER with recurrent neural networks (RNNs), especially for long short-term memory networks (LSTMs). However, it remains a great challenge for RNNs to capture long range dependencies. Moreover, Chinese presents additional challenges, since it uses logograms instead of alphabets, the ambiguities of Chinese word and has no word boundaries. In this work, we present a BiLSTM-CRF with self-attention mechanism (Att-BiLSTM-CRF) model for Chinese CNER task, which aims to address these problems. Self-attention mechanism can learn long range dependencies by establishing a direct connection between each character. In order to learn more semantic information about Chinese characters, we propose a novel fine-grained character-level representation method. We also introduce part-of-speech (POS) labeling information about our model to capture the semantic information in input sentence. We conduct the experiment by using CCKS-2017 Shared Task 2 dataset to evaluate performance, and the experimental results indicated that our model outperforms other state-of-the-art methods. INDEX TERMS Natural language processing, named entity recognition, neural networks, self-attention.
MotivationBased on the next generation genome sequencing technologies, a variety of biological applications are developed, while alignment is the first step once the sequencing reads are obtained. In recent years, many software tools have been developed to efficiently and accurately align short reads to the reference genome. However, there are still many reads that can't be mapped to the reference genome, due to the exceeding of allowable mismatches. Moreover, besides the unmapped reads, the reads with low mapping qualities are also excluded from the downstream analysis, such as variance calling. If we can take advantages of the confident segments of these reads, not only can the alignment rates be improved, but also more information will be provided for the downstream analysis.ResultsThis paper proposes a method, called RAUR (Re-align the Unmapped Reads), to re-align the reads that can not be mapped by alignment tools. Firstly, it takes advantages of the base quality scores (reported by the sequencer) to figure out the most confident and informative segments of the unmapped reads by controlling the number of possible mismatches in the alignment. Then, combined with an alignment tool, RAUR re-align these segments of the reads. We run RAUR on both simulated data and real data with different read lengths. The results show that many reads which fail to be aligned by the most popular alignment tools (BWA and Bowtie2) can be correctly re-aligned by RAUR, with a similar Precision. Even compared with the BWA-MEM and the local mode of Bowtie2, which perform local alignment for long reads to improve the alignment rate, RAUR also shows advantages on the Alignment rate and Precision in some cases. Therefore, the trimming strategy used in RAUR is useful to improve the Alignment rate of alignment tools for the next-generation genome sequencing.AvailabilityAll source code are available at http://netlab.csu.edu.cn/bioinformatics/RAUR.html.
Recently, the Web has been rapidly "deepened" by many searchable databases online, where data are hidden behind query forms. For modelling and integrating Web databases, the very first challenge is to understand what a query interface says-or what query capabilities a source supports. Such automatic extraction of interface semantics is challenging, as query forms are created autonomously. Our approach builds on the observation that, across myriad sources, query forms seem to reveal some "concerted structure," by sharing common building blocks. Toward this insight, we hypothesize the existence of a hidden syntax that guides the creation of query interfaces, albeit from different sources. This hypothesis effectively transforms query interfaces into a visual language with a non-prescribed grammar-and, thus, their semantic understanding a parsing problem. Such a paradigm enables principled solutions for both declaratively representing common patterns, by a derived grammar, and systematically interpreting query forms, by a global parsing mechanism. To realize this paradigm, we must address the challenges of a hypothetical syntax-that it is to be derived, and that it is secondary to the input. At the heart of our form extractor, we thus develop a 2P grammar and a best-effort parser, which together realize a parsing mechanism for a hypothetical syntax. Our experiments show the promise of this approach-it achieves above 85% accuracy for extracting query conditions across random sources.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
hi@scite.ai
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.