Proceedings of the 28th Annual ACM Symposium on Applied Computing 2013
DOI: 10.1145/2480362.2480535
|View full text |Cite
|
Sign up to set email alerts
|

Determining language variant in microblog messages

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1
1

Citation Types

0
7
0

Year Published

2015
2015
2019
2019

Publication Types

Select...
5
1

Relationship

0
6

Authors

Journals

citations
Cited by 11 publications
(7 citation statements)
references
References 11 publications
0
7
0
Order By: Relevance
“…The POS-tag patterns were then used by a neural network to indicate whether the sentences were written in Malay or not. Laboreiro et al (2013) used Jspell to detect differences in the grammar of Portuguese variants. Bekavac et al (2014) used a syntactic grammar to recognize verb-da-verb constructions, which are characteristic of the Serbian language.…”
Section: Statistics Of Words Van Der Lee and Boschmentioning
confidence: 99%
“…The POS-tag patterns were then used by a neural network to indicate whether the sentences were written in Malay or not. Laboreiro et al (2013) used Jspell to detect differences in the grammar of Portuguese variants. Bekavac et al (2014) used a syntactic grammar to recognize verb-da-verb constructions, which are characteristic of the Serbian language.…”
Section: Statistics Of Words Van Der Lee and Boschmentioning
confidence: 99%
“…However, Brazil has ∼20 times the population of Portugal, which may explain Brazil having a greater number of publications in the research area. Therefore, if we analyse data from Twitter, one of the most widely used micro‐blogs for text mining, choosing a random tweet in Portuguese, there is a 95% chance of it originating in Brazil [29]. Facebook and Twitter are important sources of UGC; however, the former is less used in text mining as it often contains non‐text data, e.g.…”
Section: Discussionmentioning
confidence: 99%
“…Therefore, if we analyse data Others: Co-training algorithm, behaviour knowledge space, blocking doubtful predictions, conditional random fields, Dijkstra's shortest-path algorithm, entropy maximisation, expectation-maximisation, filtered space saving algorithm, Hill climbing algorithm, Hoeffding adaptive trees, active classifier, incremental lazy associative classifier, incremental model maintenance, JRip, Kaldor-Hicks-efficient selective sampling, linear regression, LocalMaxs algorithm, logic programming, logistic regression, map-reduce paradigm, noise-contrastive estimation, OneR classification algorithm, Online rule extraction, Pareto-efficient selective sampling, structural risk minimisation. from Twitter, one of the most widely used micro-blogs for text mining, choosing a random tweet in Portuguese, there is a 95% chance of it originating in Brazil [29]. Facebook and Twitter are important sources of UGC; however, the former is less used in text mining as it often contains non-text data, e.g.…”
Section: Main Findingsmentioning
confidence: 99%
“…Laboreiro et al [35] used a Bayesian classifier to distinguish between European and Brazilian variants of tweets written in Portuguese language, achieving 95% accuracy. Winkelmolen and Mascardi [72] also describe a Bayesian classifier that performs well on very short texts and made experiments on film subtitles in 22 languages.…”
Section: Tweets/short Messagesmentioning
confidence: 99%
“…However, the emergence of social media and the chatspeak employed by its users has brought about new previously unseen issues that need to be studied in order to deal with these kinds of texts. Three key issues posited in the literature [63,24,69] and that, as of today, cannot be considered solved include: (i) distinguishing similar languages [76], (ii) dealing with multilingual documents [43], and (iii) language identification for short texts [6,10,35,20,70,52]. The shared task organized at TweetLID has considered these three unresolved issues, and has enabled participants to compare the performance of their systems in these situations.…”
Section: Challengesmentioning
confidence: 99%