A weakly supervised multivariate approach to the study of language variation

Diwersy, Sascha; Evert, Stefan; Neumann, Stella

doi:10.1515/9783110317558.174

Cited by 24 publications

(10 citation statements)

References 0 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…LI for closely-related languages, language varieties, and dialects has been studied for Malay-Indonesian (Ranaivo-Malançon, 2006), Indian languages (Murthy and Kumar, 2006), South Slavic languages (Ljubešić et al, 2007;Tiedemann and Ljubešić, 2012;Kranjcić, 2014, 2015), Serbo-Croatian dialects (Zecevic and Vujicic-Stankovic, 2013), English varieties (Lui and Cook, 2013;Simaki et al, 2017), Dutch-Flemish (van der Lee and Bosch, 2017), Dutch dialects (including a temporal dimension) (Trieschnigg et al, 2012), German Dialects (Hollenstein and Aepli, 2015) Mainland-Singaporean-Taiwanese Chinese (Huang and Lee, 2008), Portuguese varieties (Zampieri and Gebre, 2012;, Spanish varieties Maier and Gómez-Rodríguez, 2014), French varieties (Mokhov, 2010a,b;Diwersy et al, 2014), languages of the Iberian Peninsula , Romanian dialects (Ciobanu and Dinu, 2016), and Arabic dialects Zaidan and Callison-Burch, 2014;Tillmann et al, 2014;Sadat et al, 2014b;Wray, 2018), the last of which we discuss in more detail in this section. As to off-the-shelf tools which can identify closely-related languages, Zampieri and Gebre (2014) released a LI system trained to identify 27 languages, including 10 language varieties.…”

Section: Similar Languages Language Varieties and Dialectsmentioning

confidence: 99%

“…by substituting named entities or content words by placeholders), or at a higher level of abstraction, using POS tags or other morphosyntactic information Lui et al, 2014b;Bestgen, 2017), or even adversarial machine learning to modify the learned representations to remove such artefacts (Li et al, 2018). Finally, an interesting research direction could be to combine work on closely-related languages with the analysis of regional or dialectal differences in language use (Peirsman et al, 2010;Anstein, 2013;Doyle, 2014;Diwersy et al, 2014).…”

Section: Similar Languages Language Varieties and Dialectsmentioning

confidence: 99%

See 1 more Smart Citation

Automatic Language Identification in Texts: A Survey

Jauhiainen

Lui²,

Zampieri³

et al. 2019

jair

102

View full text Add to dashboard Cite

Language identification ("LI") is the problem of determining the natural language that a document or part thereof is written in. Automatic LI has been extensively researched for over fifty years. Today, LI is a key part of many text processing pipelines, as text processing techniques generally assume that the language of the input text is known. Research in this area has recently been especially active. This article provides a brief history of LI research, and an extensive survey of the features and methods used in the LI literature. We describe the features and methods using a unified notation, to make the relationships between methods clearer. We discuss evaluation methods, applications of LI, as well as off-the-shelf LI systems that do not require training by the end user. Finally, we identify open issues, survey the work to date on each issue, and propose future directions for research in LI.LI as a task predates computational methods -the earliest interest in the area was motivated by the needs of translators, and simple manual methods were developed to quickly identify documents in specific languages. The earliest known work to describe a functional LI program for text is by Mustonen (1965), a statistician, who used multiple discriminant analysis to teach a computer how to distinguish, at the word level, between English, Swedish and Finnish. Mustonen compiled a list of linguistically-motivated character-based features, and trained his language identifier on 300 words for each of the three target languages. The training procedure created two discriminant functions, which were tested with 100 words for each language. The experiment resulted in 76% of the words being correctly classified; even by current standards this percentage would be seen as acceptable given the small amount of training material, although the composition of training and test data is not clear, making the experiment unreproducible.In the early 1970s, Nakamura (1971) considered the problem of automatic LI. According to Rau (1974) and the available abstract of Nakamura's article, 1 his language identifier was able to distinguish between 25 languages written with the Latin alphabet. As features, the method used the occurrence rates of characters and words in each language. From the abstract it seems that, in addition to the frequencies, he used some binary presence/absence features of particular characters or words, based on manual LI. Rau (1974) wrote his master's thesis "Language Identification by Statistical Analysis" for the Naval Postgraduate School at Monterey, California. The continued interest and the need to use LI of text in military intelligence settings is evidenced by the recent articles of, for example, Rafidha Rehiman et al. (2013), Rowe et al. (2013), and Voss et al. (2014. As features for LI, Rau (1974) used, e.g., the relative frequencies of characters and character bigrams. With a majority vote classifier ensemble of seven classifiers using Kolmogor-Smirnov's Test of Goodness of Fit and Yule's characteristic (K), he managed...

show abstract

Section: Similar Languages Language Varieties and Dialectsmentioning

confidence: 99%

Section: Similar Languages Language Varieties and Dialectsmentioning

confidence: 99%

Automatic Language Identification in Texts: A Survey

Jauhiainen

Lui²,

Zampieri³

et al. 2019

jair

102

View full text Add to dashboard Cite

show abstract

“…There have been studies that went beyond lexical features in an attempt to capture some of the abstract systemic differences between similar languages using linguistically motivated features. This includes the use of semi-delexicalized text representations in which named entities or content words are replaced by placeholders, or fully de-lexicalized representations using POS tags and other morphosyntactic information (Zampieri, Gebre, and Diwersy 2013;Diwersy, Evert, and Neumann 2014;Bestgen 2017).…”

Section: Language and Dialect Identificationmentioning

confidence: 99%

“…Language identification was studied for closely related languages such as Malay-Indonesian (Ranaivo-Malançon 2006), South Slavic languages (Ljubešić, Mikelić, and Boras 2007;Tiedemann and Ljubešić 2012), and languages of the Iberian Peninsula (Zubiaga et al 2014). It was also applied to national varieties of English (Lui and Cook 2013;Simaki et al 2017), French (Mokhov 2010;Diwersy et al 2014), Chinese (Huang and Lee 2008), and Portuguese (Zampieri and Gebre 2012;Zampieri et al 2016), as well as to dialects of Romanian (Ciobanu and Dinu 2016), Arabic (Elfardy and Diab 2013; Zaidan and Callison-Burch 2014; Tillmann, Al-Onaizan, and Mansour 2014; Sadat, Kazemi, and Farzindar 2014; Wray 2018), and German (Hollenstein and Aepli 2015). The VarDial shared tasks included the languages in the DSLCC, as well as Chinese varieties, Dutch and Flemish, dialects of Arabic, Romanian, and German, and many others.…”

Section: Language and Dialect Identificationmentioning

confidence: 99%

Natural language processing for similar languages, varieties, and dialects: A survey

Zampieri

Nakov²,

Scherrer

2020

Nat. Lang. Eng.

View full text Add to dashboard Cite

There has been a lot of recent interest in the natural language processing (NLP) community in the computational processing of language varieties and dialects, with the aim to improve the performance of applications such as machine translation, speech recognition, and dialogue systems. Here, we attempt to survey this growing field of research, with focus on computational methods for processing similar languages, varieties, and dialects. In particular, we discuss the most important challenges when dealing with diatopic language variation, and we present some of the available datasets, the process of data collection, and the most common data collection strategies used to compile datasets for similar languages, varieties, and dialects. We further present a number of studies on computational methods developed and/or adapted for preprocessing, normalization, part-of-speech tagging, and parsing similar languages, language varieties, and dialects. Finally, we discuss relevant applications such as language and dialect identification and machine translation for closely related languages, language varieties, and dialects.

show abstract

“…One possible confounding factor is the topicality of the training data -if the data for each variety is drawn from different datasets, it is possible that a classifier will simply learn the topical differences between datasets. Diwersy et al (2014) carried out a study of colligations in French varieties, where the variation in the grammatical function of noun lemmas was studied across French-language newspapers from six countries. In their initial analysis the found that the characteristic features of each country included the name of the country and other country-specific proper nouns, which resulted in near 100% classification accuracy but do not provide any insight into national varieties from a linguistic perspective.…”

Section: De-lexicalized Text Representation For Dslmentioning

confidence: 99%

Exploring Methods and Resources for Discriminating Similar Languages

Lui

Letcher

Adams

et al. 2014

Proceedings of the First Workshop on Applying NLP Tools to Similar Languages, Varieties and Dialects

View full text Add to dashboard Cite

The Discriminating between Similar Languages (DSL) shared task at VarDial challenged participants to build an automatic language identification system to discriminate between 13 languages in 6 groups of highly-similar languages (or national varieties of the same language). In this paper, we describe the submissions made by team UniMelb-NLP, which took part in both the closed and open categories. We present the text representations and modeling techniques used, including cross-lingual POS tagging as well as fine-grained tags extracted from a deep grammar of English, and discuss additional data we collected for the open submissions, utilizing custombuilt web corpora based on top-level domains as well as existing corpora.

show abstract

A weakly supervised multivariate approach to the study of language variation

Cited by 24 publications

References 0 publications

Automatic Language Identification in Texts: A Survey

Automatic Language Identification in Texts: A Survey

Natural language processing for similar languages, varieties, and dialects: A survey

Exploring Methods and Resources for Discriminating Similar Languages

Contact Info

Product

Resources

About