Named Entity Recognition with Word Embeddings and Wikipedia Categories for a Low-Resource Language

Das, Arjun; Ganguly, Debasis; Garain, Utpal

doi:10.1145/3015467

Cited by 41 publications

(25 citation statements)

References 16 publications

Supporting

Mentioning

Contrasting

Unclassified

Order By: Relevance

“…Pengambilan entitas informasi berbasis teknologi informasi banyak dilakukan menggunakan metode Named Entity Recognition (NER). NER pada penelitian sebelumnya mampu digunakan dalam pengambilan informasi di kartu nama [2], video tutorial [3], teks artikel [4], teks unggahan di media sosial [5]- [7], dan informasi entitas di dalam rekaman data BTS [8]. Saat ini, berbagai tools yang telah tersedia dapat digunakan untuk menerapkan metode ini.…”

Section: Pendahuluanunclassified

“…Namun demikian, performa NER-tools pada poster berbahasa Indonesia masih mengalami kendala akurasi [9]- [11] karena bahasa Indonesia termasuk kategori bahasa dengan sumber daya rendah (low-resource language), seperti halnya bahasa Bengali [4] dan Cina [12]. Salah satu penyebabnya adalah karena bahasa Indonesia bukan bahasa internasional seperti bahasa Inggris atau Prancis yang sudah memiliki korpus teks dengan jutaan perbendaharaan kata dan tersedia di Internet [9].…”

Section: Pendahuluanunclassified

See 1 more Smart Citation

Combining the NER-OCR methods to improve information retrieval efficiency in the Indonesian posters

Rosidy¹,

Akhriza²,

Husni³

2020

Jurnal Teknologi dan Sistem Komputer

View full text Add to dashboard Cite

Event organizers in Indonesia often use websites to disseminate information about these events through digital posters. However, manually processing for transferring information from posters to websites is constrained by time efficiency, given the increasing number of posters uploaded. Also, information retrieval methods, such as Named Entity Recognition (NER) for Indonesian posters, are still rarely discussed in the literature. In contrast, the NER method application to Indonesian corpus is challenged by accuracy improvement because Indonesian is a low-resource language that causes a lack of corpus availability as a reference. This study proposes a solution to improve the efficiency of information extraction time from digital posters. The proposed solution is a combination of the NER method with the Optical Character Recognition (OCR) method to recognize text on posters developed with the support of relevant training data corpus to improve accuracy. The experimental results show that the system can increase time efficiency by 94 % with 82-92 % accuracy for several extracted information entities from 50 testing digital posters.

show abstract

Section: Pendahuluanunclassified

Combining the NER-OCR methods to improve information retrieval efficiency in the Indonesian posters

Rosidy¹,

Akhriza²,

Husni³

2020

Jurnal Teknologi dan Sistem Komputer

View full text Add to dashboard Cite

show abstract

“…According to Reference [28], "this is an important task because its performance directly affects the quality of many succeeding NLP applications such as information extraction". Its application recently gained popularity for processing semi-structured knowledge bases regarding entity disambiguation/mapping [29][30][31] and extracting/retrieving information [32] or for analyzing content generated on social media [33][34][35].…”

Section: Natural Language Processing Approachesmentioning

confidence: 99%

Hit or Miss? Evaluating the Potential of a Research Niche: A Case Study in the Field of Virtual Quality Management

et al. 2019

View full text Add to dashboard Cite

When knowledge is developed fast, as it is the case so often nowadays, one of the main difficulties in initiating new research in any field is to identify the domain's specific state-of-the-art and trends. In this context, to evaluate the potential of a research niche by assisting the literature review process and to add a new and modern large-scale and automated dimension to it, the paper proposes a methodology that uses "Latent Semantic Analysis" (LSA) for identifying trends, focused within the knowledge space created at the intersection of three sustainability-related methodologies/concepts: "virtual Quality Management" (vQM), "Industry 4.0", and "Product Life-Cycle" (PLC). The LSA was applied to a significant number of scientific papers published around these concepts to generate ontology charts that describe the knowledge structure of each by the frequency, position, and causal relation of associated notions. These notions are combined for defining the common high-density knowledge zone from where new technological solutions are expected to emerge throughout the PLC. The authors propose the concept of the knowledge space, which is characterized through specific descriptors with their own evaluation scales, obtained by processing the emerging information as identified by a combination of classic and innovative techniques. The results are validated through an investigation that surveys a relevant number of general managers, specialists, and consultants in the field of quality in the automotive sector from Romania. This practical demonstration follows each step of the theoretical approach and yields results that prove the capability of the method to contribute to the understanding and elucidation of the scientific area to which it is applied. Once validated, the method could be transferred to fields with similar characteristics. Even if their creators endowed them with a clear meaning at an incipient stage, when they become more popular in an emerging area, these concepts are quickly surrounded by a large amount of new knowledge that is developed with an amazing speed, enriching and enlarging their initial sphere.The "virtual Quality Management" (vQM) concept could be a significant example for the circumstances described previously. It is born through a semantic operation, joining two established and mature concepts: "virtual" and "QM", thus it is representative for an area which is in a period of high dynamic development and of interest for companies preoccupied with sustainability from the perspective of operations management and organizational culture.In this context in which the amount of information relating to new concepts quickly reaches unmanageable levels, regardless of the field, solutions that can analyze extended documentation with the purpose of disambiguating information and capturing the essentials, thus creating knowledge, become the focus of attention and gain in importance. Traditional solutions for that purpose lay in the literature review process, trying to collect, select, filter, and struc...

show abstract

“…Примером данного подхода является [8], в котором авторы применяют Word2vec для генерации кластеров слов с близкими контекстами. Такой подход показывает лучшие результаты в сравнении с классическим CRF для языков с низким объемом размеченных корпусов (например, Бенгальский язык).…”

Section: извлечение информации с использованием нейросетевых моделей unclassified

Information extraction using neural language models for the case of online job listings analysis

et al. 2018

View full text Add to dashboard Cite

In this article we discuss the approach to information extraction (IE) using neural language models. We provide a detailed overview of modern IE methods: both supervised and unsupervised. The proposed method allows to achieve a high quality solution to the problem of analyzing the relevant labor market requirements without the need for a time-consuming labelling procedure. In this experiment, professional standards act as a knowledge base of the labor domain. Comparing the descriptions of work actions and requirements from professional standards with the elements of job listings, we extract four entity types. The approach is based on the classification of vector representations of texts, generated using various neural language models: averaged word2vec, SIF-weighted averaged word2vec, TF-IDF-weighted averaged word2vec, paragraph2vec. Experimentally, the best quality was shown by the averaged word2vec (CBOW) model.

show abstract

Named Entity Recognition with Word Embeddings and Wikipedia Categories for a Low-Resource Language

Cited by 41 publications

References 16 publications

Combining the NER-OCR methods to improve information retrieval efficiency in the Indonesian posters

Combining the NER-OCR methods to improve information retrieval efficiency in the Indonesian posters

Hit or Miss? Evaluating the Potential of a Research Niche: A Case Study in the Field of Virtual Quality Management

Information extraction using neural language models for the case of online job listings analysis

Contact Info

Product

Resources

About