Survey of Named Entity Recognition Techniques for Various Indian Regional Languages

Kale, Shrutika; Govilkar, Sharvari

doi:10.5120/ijca2017913621

Cited by 11 publications

(5 citation statements)

References 8 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…But unstructured big data bring more challenges to these techniques such as scalability, automatic semantic labeling, selection of appropriate techniques for the task and requirements of user, data annotation. 25,30,34,35,57,100,111 Hence, the emergence of advanced learning-based approaches with rule-based will improve the performance of IE systems for the huge volume and variety of big data. Optimal feature extraction and selection: Feature extraction and transformation from unstructured data are more critical for data analysis as compared to structured data due to the heterogeneity and multidimensionality of unstructured documents. Features like bag-of-words, orthographic features, lexical features, and gazetteer-related features can be extracted from the text for learning-based approaches 130 that improves the data analysis process.…”

Section: Resultsmentioning

confidence: 99%

Section: Limitations Of Existing Ie Techniques For Unstructured Data mentioning

confidence: 99%

“…29 It is found that machine learning approaches the best suit for NER techniques for various Indian regional languages such as Hindi, Marathi, Bengali, Punjabi, Malayalam, Bengali, Kannada, Telugu, Tamil, Urdu, and Oriya, while HMM and CRF give best results considering their limitations. 30 IE from human language text is different for each language. But IE is easier for rich morphological languages like Russian and English.…”

mentioning

confidence: 99%

See 2 more Smart Citations

Limitations of information extraction methods and techniques for heterogeneous unstructured big data

Adnan

Akbar

2019

International Journal of Engineering Business Management

View full text Add to dashboard Cite

During the recent era of big data, a huge volume of unstructured data are being produced in various forms of audio, video, images, text, and animation. Effective use of these unstructured big data is a laborious and tedious task. Information extraction (IE) systems help to extract useful information from this large variety of unstructured data. Several techniques and methods have been presented for IE from unstructured data. However, numerous studies conducted on IE from a variety of unstructured data are limited to single data types such as text, image, audio, or video. This article reviews the existing IE techniques along with its subtasks, limitations, and challenges for the variety of unstructured data highlighting the impact of unstructured big data on IE techniques. To the best of our knowledge, there is no comprehensive study conducted to investigate the limitations of existing IE techniques for the variety of unstructured big data. The objective of the structured review presented in this article is twofold. First, it presents the overview of IE techniques from a variety of unstructured data such as text, image, audio, and video at one platform. Second, it investigates the limitations of these existing IE techniques due to the heterogeneity, dimensionality, and volume of unstructured big data. The review finds that advanced techniques for IE, particularly for multifaceted unstructured big data sets, are the utmost requirement of the organizations to manage big data and derive strategic information. Further, potential solutions are also presented to improve the unstructured big data IE systems for future research. These solutions will help to increase the efficiency and effectiveness of the data analytics process in terms of context-aware analytics systems, data-driven decision-making, and knowledge management.

show abstract

Section: Resultsmentioning

confidence: 99%

Section: Limitations Of Existing Ie Techniques For Unstructured Data mentioning

confidence: 99%

mentioning

confidence: 99%

See 1 more Smart Citation

Limitations of information extraction methods and techniques for heterogeneous unstructured big data

Adnan

Akbar

2019

International Journal of Engineering Business Management

View full text Add to dashboard Cite

show abstract

“…Named entity recognition (NER) is a fundamental task of information extraction, which seeks to discover elements in a text and assign them to predefined categories (Kale and Govilkar, 2017). Abundant articles have studied the Chinese NER models in general domains which can only retrieve common information such as organizations, persons and addresses (Liu et al , 2018).…”

Section: Introductionmentioning

confidence: 99%

Identifying business information through deep learning: analyzing the tender documents of an Internet-based logistics bidding platform

2023

DTA

View full text Add to dashboard Cite

PurposeThe tender documents, an essential data source for internet-based logistics tendering platforms, incorporate massive fine-grained data, ranging from information on tenderee, shipping location and shipping items. Automated information extraction in this area is, however, under-researched, making the extraction process a time- and effort-consuming one. For Chinese logistics tender entities, in particular, existing named entity recognition (NER) solutions are mostly unsuitable as they involve domain-specific terminologies and possess different semantic features.Design/methodology/approachTo tackle this problem, a novel lattice long short-term memory (LSTM) model, combining a variant contextual feature representation and a conditional random field (CRF) layer, is proposed in this paper for identifying valuable entities from logistic tender documents. Instead of traditional word embedding, the proposed model uses the pretrained Bidirectional Encoder Representations from Transformers (BERT) model as input to augment the contextual feature representation. Subsequently, with the Lattice-LSTM model, the information of characters and words is effectively utilized to avoid error segmentation.FindingsThe proposed model is then verified by the Chinese logistic tender named entity corpus. Moreover, the results suggest that the proposed model excels in the logistics tender corpus over other mainstream NER models. The proposed model underpins the automatic extraction of logistics tender information, enabling logistic companies to perceive the ever-changing market trends and make far-sighted logistic decisions.Originality/value(1) A practical model for logistic tender NER is proposed in the manuscript. By employing and fine-tuning BERT into the downstream task with a small amount of data, the experiment results show that the model has a better performance than other existing models. This is the first study, to the best of the authors' knowledge, to extract named entities from Chinese logistic tender documents. (2) A real logistic tender corpus for practical use is constructed and a program of the model for online-processing real logistic tender documents is developed in this work. The authors believe that the model will facilitate logistic companies in converting unstructured documents to structured data and further perceive the ever-changing market trends to make far-sighted logistic decisions.

show abstract

“…The work is very mature and the functionality comes out of the box with NLP libraries like NLTK [5] and spacy [10]. In contrast, limited work is done in the Indic languages like Hindi and Marathi [14]. [25] addresses the problems faced by Indian languages like the presence of abbreviations, ambiguities in named entity categories, different dialects, spelling variations and the presence of foreign words.…”

Section: Introductionmentioning

confidence: 99%

L3Cube-MahaNER: A Marathi Named Entity Recognition Dataset and BERT models

Patil¹,

Ranade²,

Sabane³

et al. 2022

Preprint

View full text Add to dashboard Cite

Named Entity Recognition (NER) is a basic NLP task and finds major applications in conversational and search systems. It helps us identify key entities in a sentence used for the downstream application. NER or similar slot filling systems for popular languages have been heavily used in commercial applications. In this work, we focus on Marathi, an Indian language, spoken prominently by the people of Maharashtra state. Marathi is a low resource language and still lacks useful NER resources. We present L3Cube-MahaNER, the first major gold standard named entity recognition dataset in Marathi. We also describe the manual annotation guidelines followed during the process. In the end, we benchmark the dataset on different CNN, LSTM, and Transformer based models like mBERT, XLM-RoBERTa, IndicBERT, MahaBERT, etc. The MahaBERT provides the best performance among all the models. The data and models are available at https://github.com/l3cubepune/MarathiNLP

show abstract

Survey of Named Entity Recognition Techniques for Various Indian Regional Languages

Cited by 11 publications

References 8 publications

Limitations of information extraction methods and techniques for heterogeneous unstructured big data

Limitations of information extraction methods and techniques for heterogeneous unstructured big data

Identifying business information through deep learning: analyzing the tender documents of an Internet-based logistics bidding platform

L3Cube-MahaNER: A Marathi Named Entity Recognition Dataset and BERT models

Contact Info

Product

Resources

About