Using Maximum Entropy Models to Discriminate between Similar Languages and Varieties

Porta, Jordi; Sancho, José-Luis

doi:10.3115/v1/w14-5314

Cited by 13 publications

(10 citation statements)

References 10 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Logistic Regression (LR) Chen and Maison (2003) used a logistic regression ("LR") model (also commonly referred to as "maximum entropy" within NLP), smoothed with a Gaussian prior. Porta and Sancho (2014) defined LR for character-based features as follows:…”

Section: Entropymentioning

confidence: 99%

Automatic Language Identification in Texts: A Survey

Jauhiainen

Lui²,

Zampieri³

et al. 2019

jair

102

View full text Add to dashboard Cite

Language identification ("LI") is the problem of determining the natural language that a document or part thereof is written in. Automatic LI has been extensively researched for over fifty years. Today, LI is a key part of many text processing pipelines, as text processing techniques generally assume that the language of the input text is known. Research in this area has recently been especially active. This article provides a brief history of LI research, and an extensive survey of the features and methods used in the LI literature. We describe the features and methods using a unified notation, to make the relationships between methods clearer. We discuss evaluation methods, applications of LI, as well as off-the-shelf LI systems that do not require training by the end user. Finally, we identify open issues, survey the work to date on each issue, and propose future directions for research in LI.LI as a task predates computational methods -the earliest interest in the area was motivated by the needs of translators, and simple manual methods were developed to quickly identify documents in specific languages. The earliest known work to describe a functional LI program for text is by Mustonen (1965), a statistician, who used multiple discriminant analysis to teach a computer how to distinguish, at the word level, between English, Swedish and Finnish. Mustonen compiled a list of linguistically-motivated character-based features, and trained his language identifier on 300 words for each of the three target languages. The training procedure created two discriminant functions, which were tested with 100 words for each language. The experiment resulted in 76% of the words being correctly classified; even by current standards this percentage would be seen as acceptable given the small amount of training material, although the composition of training and test data is not clear, making the experiment unreproducible.In the early 1970s, Nakamura (1971) considered the problem of automatic LI. According to Rau (1974) and the available abstract of Nakamura's article, 1 his language identifier was able to distinguish between 25 languages written with the Latin alphabet. As features, the method used the occurrence rates of characters and words in each language. From the abstract it seems that, in addition to the frequencies, he used some binary presence/absence features of particular characters or words, based on manual LI. Rau (1974) wrote his master's thesis "Language Identification by Statistical Analysis" for the Naval Postgraduate School at Monterey, California. The continued interest and the need to use LI of text in military intelligence settings is evidenced by the recent articles of, for example, Rafidha Rehiman et al. (2013), Rowe et al. (2013), and Voss et al. (2014. As features for LI, Rau (1974) used, e.g., the relative frequencies of characters and character bigrams. With a majority vote classifier ensemble of seven classifiers using Kolmogor-Smirnov's Test of Goodness of Fit and Yule's characteristic (K), he managed...

show abstract

Section: Entropymentioning

confidence: 99%

Automatic Language Identification in Texts: A Survey

Jauhiainen

Lui²,

Zampieri³

et al. 2019

jair

102

View full text Add to dashboard Cite

show abstract

“…An important challenge has been the development of methods to measure the distance between very similar languages or variants and for short texts, where more precision is required, such as in Porta and Sancho (2014); Purver (2014) and Goutte, Léger, Malmasi, and Zampieri (2016).…”

Section: Corpus-driven Methodologiesmentioning

confidence: 99%

A Methodology to Measure the Diachronic Language Distance between Three Languages Based on Perplexity

Campos

Gamallo

Alegria³

et al. 2020

Journal of Quantitative Linguistics

View full text Add to dashboard Cite

The aim of this paper is to apply a corpus-based methodology, based on the measure of perplexity, to automatically calculate the cross-lingual language distance between historical periods of three languages. The three historical corpora have been constructed and collected with the closest spelling to the original on a balanced basis of fiction and nonfiction. This methodology has been applied to measure the historical distance of Galician with respect to Portuguese and Spanish, from the Middle Ages to the end of the 20th century, both in original spelling and automatically transcribed spelling. The quantitative results are contrasted with hypotheses extracted from experts in historical linguistics. Results show that Galician and Portuguese are varieties of the same language in the Middle Ages and that Galician converges and diverges with Portuguese and Spanish since the last period of the 19th century. In this process, orthography plays a relevant role. It should be pointed out that the method is unsupervised and can be applied to other languages.

show abstract

“…In the four editions of the DSL shared task a variety of computation methods have been tested. This includes Maximum Entropy (Porta and Sancho, 2014), Prediction by Partial Matching (PPM) (Bobicev, 2015), language model perplexity (Gamallo et al, 2017), SVMs (Purver, 2014), Convolution Neural Networks (CNNs) (Belinkov and Glass, 2016), word-based back-off models (Jauhiainen et al, 2015;Jauhiainen et al, 2016), and classifier ensembles , the approach we apply in this paper.…”

Section: Related Workmentioning

confidence: 99%

Discriminating between Indo-Aryan Languages Using SVM Ensembles

Ciobanu¹,

Zampieri²,

Malmasi³

et al. 2018

Preprint

View full text Add to dashboard Cite

In this paper we present a system based on SVM ensembles trained on characters and words to discriminate between five similar languages of the Indo-Aryan family: Hindi, Braj Bhasha, Awadhi, Bhojpuri, and Magahi. We investigate the performance of individual features and combine the output of single classifiers to maximize performance. The system competed in the Indo-Aryan Language Identification (ILI) shared task organized within the VarDial Evaluation Campaign 2018. Our best entry in the competition, named ILIdentification, scored 88.95% F1 score and it was ranked 3 rd out of 8 teams.

show abstract

Using Maximum Entropy Models to Discriminate between Similar Languages and Varieties

Cited by 13 publications

References 10 publications

Automatic Language Identification in Texts: A Survey

Automatic Language Identification in Texts: A Survey

A Methodology to Measure the Diachronic Language Distance between Three Languages Based on Perplexity

Discriminating between Indo-Aryan Languages Using SVM Ensembles

Contact Info

Product

Resources

About