Mixed-script query labelling using supervised learning and ad hoc retrieval using sub word indexing

Mukherjee, Abhinav; Ravi, Anirudh; Datta, Kaustav

doi:10.1145/2824864.2824873

Cited by 9 publications

(6 citation statements)

References 2 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Gambäck (2013, 2014) used the language tags of the previous three words with an SVM. Mukherjee et al (2014) used language labels of surrounding words with NB. King et al (2015) used the language probabilities of the previous word to determining weights for languages.…”

Section: Statistics Of Words Van Der Lee and Boschmentioning

confidence: 99%

Automatic Language Identification in Texts: A Survey

Jauhiainen

Lui²,

Zampieri³

et al. 2019

jair

102

View full text Add to dashboard Cite

Language identification ("LI") is the problem of determining the natural language that a document or part thereof is written in. Automatic LI has been extensively researched for over fifty years. Today, LI is a key part of many text processing pipelines, as text processing techniques generally assume that the language of the input text is known. Research in this area has recently been especially active. This article provides a brief history of LI research, and an extensive survey of the features and methods used in the LI literature. We describe the features and methods using a unified notation, to make the relationships between methods clearer. We discuss evaluation methods, applications of LI, as well as off-the-shelf LI systems that do not require training by the end user. Finally, we identify open issues, survey the work to date on each issue, and propose future directions for research in LI.LI as a task predates computational methods -the earliest interest in the area was motivated by the needs of translators, and simple manual methods were developed to quickly identify documents in specific languages. The earliest known work to describe a functional LI program for text is by Mustonen (1965), a statistician, who used multiple discriminant analysis to teach a computer how to distinguish, at the word level, between English, Swedish and Finnish. Mustonen compiled a list of linguistically-motivated character-based features, and trained his language identifier on 300 words for each of the three target languages. The training procedure created two discriminant functions, which were tested with 100 words for each language. The experiment resulted in 76% of the words being correctly classified; even by current standards this percentage would be seen as acceptable given the small amount of training material, although the composition of training and test data is not clear, making the experiment unreproducible.In the early 1970s, Nakamura (1971) considered the problem of automatic LI. According to Rau (1974) and the available abstract of Nakamura's article, 1 his language identifier was able to distinguish between 25 languages written with the Latin alphabet. As features, the method used the occurrence rates of characters and words in each language. From the abstract it seems that, in addition to the frequencies, he used some binary presence/absence features of particular characters or words, based on manual LI. Rau (1974) wrote his master's thesis "Language Identification by Statistical Analysis" for the Naval Postgraduate School at Monterey, California. The continued interest and the need to use LI of text in military intelligence settings is evidenced by the recent articles of, for example, Rafidha Rehiman et al. (2013), Rowe et al. (2013), and Voss et al. (2014. As features for LI, Rau (1974) used, e.g., the relative frequencies of characters and character bigrams. With a majority vote classifier ensemble of seven classifiers using Kolmogor-Smirnov's Test of Goodness of Fit and Yule's characteristic (K), he managed...

show abstract

Section: Statistics Of Words Van Der Lee and Boschmentioning

confidence: 99%

Automatic Language Identification in Texts: A Survey

Jauhiainen

Lui²,

Zampieri³

et al. 2019

jair

102

View full text Add to dashboard Cite

show abstract

“…For the reasons explained in Section 5.2, we are unable to directly compare to the systems that participated in the FIRE 2014 shared task. The best reported F-score results on the deromanization of transliterated search subtask were 7.3% for Bengali (Gupta et al, 2014a) and 30.4% for Hindi (Mukherjee et al, 2014). We attribute the superior results of our system to its ability to handle spelling variations found in romanized codemixed texts.…”

Section: Sequence Predictionmentioning

confidence: 78%

Joint Approach to Deromanization of Code-mixed Texts

Riyadh¹,

Kondrak²

2019

Proceedings of the Sixth Workshop On

View full text Add to dashboard Cite

The conversion of romanized texts back to the native scripts is a challenging task because of the inconsistent romanization conventions and non-standard language use. This problem is compounded by code-mixing, i.e., using words from more than one language within the same discourse. In this paper, we propose a novel approach for handling these two problems together in a single system. Our approach combines three components: language identification, back-transliteration, and sequence prediction. The results of our experiments on Bengali and Hindi datasets establish the state of the art for the task of deromanization of code-mixed texts.

show abstract

“…They fine-tuned their system for those languages and performed very well in the respective language tracks. Two teams (Asterish [33] and BITS-Lipyantaran [27]) used Google transliteration API for Hindi, and they achieved the highest TF scores. The teams which used machine learning on token-based and n-gram features have higher labeling accuracy than the teams which only relied on dictionaries and rules.…”

Section: Submissionsmentioning

confidence: 99%

“…In Ad hoc@MSIR'14, we received seven runs and we observed that the two runs from BITS-Lipyantran [27] performs best across all the metrics. Table 9 presents the results of the seven runs received.…”

Section: Submissionsmentioning

confidence: 99%

MSIR@FIRE: A Comprehensive Report from 2013 to 2016

et al. 2020

View full text Add to dashboard Cite

India is a nation of geographical and cultural diversity where over 1600 dialects are spoken by the people. With the technological advancement, penetration of the internet and cheaper access to mobile data, India has recently seen a sudden growth of internet users. These Indian internet users generate contents either in English or in other vernacular Indian languages. To develop technological solutions for the contents generated by the Indian users using the Indian languages, the Forum for Information Retrieval Evaluation (FIRE) was established and held for the first time in 2008. Although Indian languages are written using indigenous scripts, often websites and user-generated content (such as tweets and blogs) in these Indian languages are written using Roman script due to various socio-cultural and technological reasons. A challenge that search engines face while processing transliterated queries and documents is that of extensive spelling variation. MSIR track was first introduced in 2013 at FIRE and the aim of MSIR was to systematically formalize several research problems that one must solve to tackle the code mixing in Web search for users of many languages around the world, develop related data sets, test benches and most importantly, build a research community focusing on this important problem that has received very little attention. This document is a comprehensive report on the 4 years of MSIR track evaluated at FIRE between 2013 and 2016.

show abstract

Mixed-script query labelling using supervised learning and ad hoc retrieval using sub word indexing

Cited by 9 publications

References 2 publications

Automatic Language Identification in Texts: A Survey

Automatic Language Identification in Texts: A Survey

Joint Approach to Deromanization of Code-mixed Texts

MSIR@FIRE: A Comprehensive Report from 2013 to 2016

Contact Info

Product

Resources

About