A Neural Multi-digraph Model for Chinese NER with Gazetteers

Ding, Ruixue; Xie, Pengjun; Zhang, Xiaoyan; Lu, Wei; Li, Linlin; Si, Luo

doi:10.18653/v1/p19-1141

Cited by 113 publications

(67 citation statements)

References 17 publications

Supporting

Mentioning

Contrasting

Unclassified

Order By: Relevance

“…Zhang et al [26] investigate a lattice network which explicitly leverages word and word sequence information, and achieve F1-score of 58.79%. Our proposed model has a significant improvement in the named entities, which improves 1.96% compared with Ding et al [5]. And overall performance is significantly better than other models.…”

Section: Comparison With Previous Workmentioning

confidence: 61%

“…[14] add gazetteer-enhanced sub-tagger on hybrid semi-Markov CRF architecture and observe some promising results. And [5] also propose a neural multi-digraph model with the information of gazetteers.…”

Section: Related Workmentioning

confidence: 99%

“…And Cao et al [2] also use the information of CWS for NER. Zhang et al [24] and Ding et al [5] add additional features, and the latter achieve 94.4% F1-score. Zhu et al [28] investigate a Convolution Attention Network to capture the information from adjacent characters and sentence contexts, which achieves F1-score of 92.97%.…”

Section: Comparison With Previous Workmentioning

confidence: 99%

See 2 more Smart Citations

A Mixed Semantic Features Model for Chinese NER with Characters and Words

Chang

Jiang

et al. 2020

Lecture Notes in Computer Science

View full text Add to dashboard Cite

Named Entity Recognition (NER) is an essential part of many natural language processing (NLP) tasks. The existing Chinese NER methods are mostly based on word segmentation, or use the character sequences as input. However, using a single granularity representation would suffer from the problems of out-of-vocabulary and word segmentation errors, and the semantic content is relatively simple. In this paper, we introduce the self-attention mechanism into the BiLSTM-CRF neural network structure for Chinese named entity recognition with two embedding. Different from other models, our method combines character and word features at the sequence level, and the attention mechanism computes similarity on the total sequence consisted of characters and words. The character semantic information and the structure of words work together to improve the accuracy of word boundary segmentation and solve the problem of long-phrase combination. We validate our model on MSRA and Weibo corpora, and experiments demonstrate that our model can significantly improve the performance of the Chinese NER task.

show abstract

Section: Comparison With Previous Workmentioning

confidence: 61%

Section: Related Workmentioning

confidence: 99%

Section: Comparison With Previous Workmentioning

confidence: 99%

See 1 more Smart Citation

A Mixed Semantic Features Model for Chinese NER with Characters and Words

Chang

Jiang

et al. 2020

Lecture Notes in Computer Science

View full text Add to dashboard Cite

show abstract

“…During the last few years, researchers have improved the F1 score on this dataset using various techniques, ranging from using Bi-LSTM and CRF, a model similar to the design of Lample et al, and achieved 40.42% (Lin et al, 2017), incorporating the Transfer Learning (TL) approach that achieved an F1 score of 40.78%, to the model of Aguilar et al (2017) who boosted their model with an extra feature extracted from an external data resource, a gazetteer, and they scored an entity and surface F1 scores of 41.86% and 40.24%, respectively. Several researchers incorporated gazetteers to capture further features from the input text (Mishra and Diesner, 2016;Aguilar et al, 2017;Štravs and Zupančič, 2019;Dey and Prukayastha, 2013;Ding et al, 2019). Nevertheless, its usage is considered a limitation due to the difficulties of building and maintaining it up-to-date to cope with new terms and entities.…”

Section: Named Entity Recognitionmentioning

confidence: 99%

“…Varios investigadores han utilizado diccionarios geográficos para capturar características adicionales del texto de entrada (Mishra and Diesner, 2016;Aguilar et al, 2017;Štravs and Zupančič, 2019;Dey and Prukayastha, 2013;Ding et al, 2019). Sin embargo, su uso se podría considerar una limitación debido a las dificultades para construirlo y mantenerlo actualizado para hacer frente a nuevos términos y entidades.…”

Section: Minería De Textounclassified

Supervised machine learning for classification mining and ranking of illegal web contents = Aprendizaje automático supervisado para la clasificación, extracción de conocimiento y ordenación de contenidos web ilegales

Nabki¹,

Wesam²

View full text Add to dashboard Cite

In this thesis, we propose new algorithms, methods, and datasets that can be used to classify, to mine information, and rank web domains or similar resources containing text. Motivated by our joint work with INCIBE, we focus our efforts on detecting web resources which content could indicate illegal activities. Most of these textual web pages are hosted in a darknet, and, because of that, we centered our analysis in The Onion Router (Tor) Darknet, based on the common belief that this net hosts plenty of criminal activities. Additionally, we also addressed the same problem in Online Notepad Services (ONS), in particular, Pastebin service.Several of the contributions that we present here are already incorporated in tools developed by INCIBE that help Spanish Law Enforcement Agencies (LEAs) to monitor the contents of the Tor Darknet. Our work relies on the application of machine learning, both classical and deep, using most of the time supervised learning. This approach required the creation of different datasets, naming the first of them as Darknet Usage Text Addresses (DUTA), which contained 6, 831 labeled samples distributed over 26 classes. Posteriorly, we extended this dataset up to 10, 367 samples, naming it as DUTA-10K.Using DUTA, we evaluated the combination of two text representation techniques with three well-known classifiers to categorize the Tor domains. The combination of TF-IDF words representation with Logistic Regression achieved a 93.7% macro F1 score, in a subset of DUTA where eight categories of illegal activities were selected. To classify Pastebin contents, we use Active Learning to select and label only the most informative samples, reducing in this way, the cost of building a labeled dataset. Our design requires three cascade classifiers, saying the last one whether a sample belongs to one out of six categories related to criminal activities, obtaining an average class recall of 95.24% as binary, and 80.33% as multiclass.To enrich the information that we provide to LEAs, we develop first a semi-automatic algorithm to identify emerging products in Tor marketplaces. Using Graph Theory, we build a Products Correlations Graph (PCG), in which the nodes are the markets' products, and the edges reflect the simultaneous offering of two products in the same market. Our algorithm decomposes the PCG, using the k-shell algorithm, and analyzes the connectivity of the products in the core-shell. We apply this method to drug Hidden Services (HS) in DUTA, finding that MDMA and Ecstasy were the most emerging drug products during the analyzed period. Second, we used Named Entity Recognition (NER) to recognize rare and emerging named entities in noisy user-generated text. We overcome the use of gazetteers to incorporate external resources to neural network architectures, presenting a novel feature that we named Local Distance Neighbor (LDN), obtaining in this way the state-of-the-art F1 score on three categories of the W-NUT-2017 dataset: Group, Person, and Product. Furthermore, we present an application of NER...

show abstract

Chinese Named Entity Recognition: Applications and Challenges

Ren

Yao

et al. 2021

MDATA: A New Knowledge Representation Model

View full text Add to dashboard Cite

A Neural Multi-digraph Model for Chinese NER with Gazetteers

Cited by 113 publications

References 17 publications

A Mixed Semantic Features Model for Chinese NER with Characters and Words

A Mixed Semantic Features Model for Chinese NER with Characters and Words

Supervised machine learning for classification mining and ranking of illegal web contents = Aprendizaje automático supervisado para la clasificación, extracción de conocimiento y ordenación de contenidos web ilegales

Chinese Named Entity Recognition: Applications and Challenges

Contact Info

Product

Resources

About