Smoothed n-gram based models for tweet language identification: A case study of the Brazilian and European Portuguese national varieties

Castro, Dayvid; Souza, Ellen; Vitório, Douglas; Santos, Diego P. dos; Oliveira, Adriano L. I.

doi:10.1016/j.asoc.2017.05.065

Cited by 21 publications

(10 citation statements)

References 9 publications

Supporting

Mentioning

Contrasting

Unclassified

Order By: Relevance

“…An n-gram language model predicts the probability of a given n-gram within any sequence of words in the language. It is widely used in text mining [15,16], including in the legal domain [19]. An n-gram is a contiguous sequence of n items from a given sequence of text.…”

Section: Language Modelmentioning

confidence: 99%

An Information Retrieval Pipeline for Legislative Documents from the Brazilian Chamber of Deputies

Souza

Vitório

Moriyama

et al. 2021

Frontiers in Artificial Intelligence and Applications

Self Cite

View full text Add to dashboard Cite

This work investigates information retrieval methods to address the existing difficulties on the Preliminary Search, part of the law making process from the Brazilian Chamber of Deputies. For such, different preprocessing approaches, stemmers, language models, and BM25 variants were compared. Two legislative corpora from Chamber were used to build and validate the pipeline. All texts were converted to lowercase and had stopwords, accentuation, and punctuation removed. Words were represented by their stem combined with word unigram and bigram language models. Retrieving the bill that was originated from a specific job request, the BM25L with Savoy stemmer reached a R@20 of 0.7356. After removing queries with inconsistencies or which made reference exclusively to attachments, to other job requests, or to bills, the R@20 increased to 0.94.

show abstract

Section: Language Modelmentioning

confidence: 99%

An Information Retrieval Pipeline for Legislative Documents from the Brazilian Chamber of Deputies

Souza

Vitório

Moriyama

et al. 2021

Frontiers in Artificial Intelligence and Applications

Self Cite

View full text Add to dashboard Cite

show abstract

“…Performance evaluation of all the five algorithms named GD, GDM, GDA, GDX, and LM is carried out using a confusion matrix as shown in Table 2 (Dhaoui et al, 2017;Castro et al, 2017;Moraes et al, 2013) for binary datasets. The effectiveness of all the five algorithms used for updating the parameters of ANN is measured using precision, recall, f -score, accuracy, training time, and MSE as performance metrics.…”

Section: Performance Measuresmentioning

confidence: 99%

“…Further, statistical analysis of all the five algorithms is performed using Wilcoxon signed-rank test. (Castro et al, 2017) using the confusion matrix presented in Table 2.…”

Section: Performance Measuresmentioning

confidence: 99%

“…The supervised machine learning models are being widely used and have shown to be very effective for automation of sentiment classification of such a massive amount of data (Dhaoui et al, 2017;Melville et al, 2009). The supervised machine learning classifiers such as Naïve Bayes (Liu et al, 2017b;Ye et al, 2009;Goel et al, 2016), logistic regression (Qasem et al, 2015;Castro et al, 2017;Mungra et al, in press), support vector machine (SVM) (Castro et al, 2017;Tellez et al, 2017;Mungra et al, in press), artificial neural network (ANN) (Moraes et al, 2013;Liu et al, 2017b;Mungra et al, in press), decision tree (Liu et al, 2017b;Mungra et al, in press), and random forest (Aziz et al, 2017;Mungra et al, in press) have been effective in the task of sentiment classification (Liu et al, 2017b;Ye et al, 2009;Goel et al, 2016). ANN is a simple, robust, and a very popularly used classifier for classifying a piece of text into positive or negative class.…”

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

Sentiment Analysis: An empirical comparison between various training algorithms for Artificial Neural Network

Thakkar

Mungra

Agrawal

2020

IJICA

View full text Add to dashboard Cite

The proliferated increase in the commercial benefits of sentiment analysis accumulated a huge interest in the domain of sentiment classification. Sentiment analysis categorises a given text into positive or negative class. In this paper, we present an empirical comparison between different training algorithms gradient descent (GD), gradient descent with momentum backpropagation (GDM), gradient descent adaptive learning rate backpropagation (GDA), gradient descent with momentum and adaptive learning rate backpropagation (GDX), and Levenberg-Marquardt backpropagation (LM), used for training the neural network for the domain of sentiment classification. The performance of all the methods is compared and evaluated using three balanced binary datasets from various domains with different features using various performance metrics such as accuracy, precision, recall, f -score, mean squared error, and training time. The experiments are performed five times with different random seed values using 10-fold cross-validation. The results indicate that GDX and LM outperform other methods in terms of classification accuracy.

show abstract

“…In addition, there are now research efforts where the authors try to solve a specific language searching problem [6] - [13], but there is no complete software architecture easily customizable for different search applications. In [6] author's give one optimization of the method proposed in [2] where selection of the similarity measure is performed using the principles of redundancy and fault tolerance, in [7] is described one search engine using MySQL as one of cheap option, work [8] presents one architecture which uses different semantic web technologies and builds one prototype of semantic web mashup possibility, paper [9] proposes one novel Italian Sign Language Multi Word Net using process of integration the Multi Word Net lexical database and the Italian Sign Language, paper [10] describes a novel LInSTSS approach which is suitable for using to create a software tool which is capable to determine the semantic similarity of two presented no large texts, in paper [11], authors propose the use of smoothed ngram language models to classify tweets as a typical short texts from Twitter in both Portuguese languages -Brazilian and European variants, paper [12] deals with the software architecture which establishing electronic services for searching and presentation in an information system on scientific activities of the Ministry of Education, Science and Technological Development of the Republic of Serbia and work [13] has objective to give a lexicon based algorithm which is able to perform different natural language identification using minimal training data in the obligatory process of machine learning because this step is often the first step in many natural language processing tasks which is normally necessary to make in the shortest possible time. Therefore, we have a strong motive for designing the SEFRA frameworkhybrid solution based on existing Web services and technologies (framework source code is available at: https://bitbucket.org/mjovanov/pretraga/).…”

Section: Related Workmentioning

confidence: 99%

SEFRA - Web-based Framework Customizable for Serbian Language Search Applications

2019

APH

View full text Add to dashboard Cite

This paper presents SEFRAa web-based framework for searching Web content written in Serbian. SEFRA is an easily customizable hybrid solution that can be a platform for new search applications and/or a service for already existing ones. The proposed architecture solves the problems of indexing, searching and displaying search results adjusted for Serbian. It unifies several web technologies and services into one product suitable for use in the Western Balkan's countries for helping e-Government citizens' services and other public-sector services, private company administration, solving specific search problems for academic institutions and scientific literature publishers, etc. The proposed solution uses advanced Serbian language services accessible over the Web. It is also implementable for any other language where the target language morphology service exists. In other words, architecture is also customizable in this direction. It should be noted that the proposed architecture is optimized from both backend and web front-end perspective. The source code can be pulled from https://bitbucket.org/mjovanov/pretraga/. The one application of the proposed architecture is experimentally demonstrated through the search of crime law documents of Serbia. The experimental usage of this implementation shows that the problem of search relevance, is well-solved and easily customizable.

show abstract

Smoothed n-gram based models for tweet language identification: A case study of the Brazilian and European Portuguese national varieties

Cited by 21 publications

References 9 publications

An Information Retrieval Pipeline for Legislative Documents from the Brazilian Chamber of Deputies

An Information Retrieval Pipeline for Legislative Documents from the Brazilian Chamber of Deputies

Sentiment Analysis: An empirical comparison between various training algorithms for Artificial Neural Network

SEFRA - Web-based Framework Customizable for Serbian Language Search Applications

Contact Info

Product

Resources

About