Native Language Identification with User Generated Content

Goldin, Gili; Rabinovich, Ella; Wintner, Shuly

doi:10.18653/v1/d18-1395

Cited by 18 publications

(39 citation statements)

References 27 publications

(18 reference statements)

Supporting

Mentioning

Contrasting

Order By: Relevance

“…We create a dataset of sentences from comments by users who self-identify as being from L1 English countries, as well as a set of comments by users who self-identify as being from Russia. These datasets are constructed using similar methodology to recent work in native language identification [13]. This test is used to demonstrate the tendency of each model to generate more false positives when considering English comments written by users who speak Russian as a first language, as opposed to English native speakers.…”

Section: Methodsmentioning

confidence: 99%

See 1 more Smart Citation

Towards Ethical Content-Based Detection Of Online Influence Campaigns

Crothers

Japkowicz

Viktor

2019

2019 IEEE 29th International Workshop on Machine Learning for Signal Processing (MLSP)

View full text Add to dashboard Cite

The detection of clandestine efforts to influence users in online communities is a challenging problem with significant active development. We demonstrate that features derived from the text of user comments are useful for identifying suspect activity, but lead to increased erroneous identifications (false positive classifications) when keywords over-represented in past influence campaigns are present. Drawing on research in native language identification (NLI), we use "named entity masking" (NEM) to create sentence features robust to this shortcoming, while maintaining comparable classification accuracy. We demonstrate that while NEM consistently reduces false positives when key named entities are mentioned, both masked and unmasked models exhibit increased false positive rates on English sentences by Russian native speakers, raising ethical considerations that should be addressed in future research.Index Terms-influence campaign detection, native language identification, algorithmic bias, natural language processing, bidirectional encoder representations from transformers (BERT).

show abstract

Section: Methodsmentioning

confidence: 99%

“…Reddit has been the data source for past work on Native-Language Identification (NLI) on sophisticated second-language speakers [11] [13]. This work entailed the creation of datasets of Reddit comments from users of a variety of different languages by looking for self-identified "flair" in European subreddits.…”

Section: Corpus Iii: Augmented L2 Reddit Datasetmentioning

confidence: 99%

Towards Ethical Content-Based Detection Of Online Influence Campaigns

Crothers

Japkowicz

Viktor

2019

2019 IEEE 29th International Workshop on Machine Learning for Signal Processing (MLSP)

View full text Add to dashboard Cite

show abstract

“…Linear classifier with content-independent features (LR) Replicating Goldin et al (2018), we trained a logistic regression classifier with three types of features: function words, POS trigrams, and sentence length, all of which are reflective of the style of writing. We deliberately avoided using content features (e.g., word frequencies).…”

Section: Baselinesmentioning

confidence: 99%

Topics to Avoid: Demoting Latent Confounds in Text Classification

Kumar¹,

Wintner²,

Smith³

et al. 2019

Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conferen

Self Cite

View full text Add to dashboard Cite

Despite impressive performance on many text classification tasks, deep neural networks tend to learn frequent superficial patterns that are specific to the training data and do not always generalize well. In this work, we observe this limitation with respect to the task of native language identification. We find that standard text classifiers which perform well on the test set end up learning topical features which are confounds of the prediction task (e.g., if the input text mentions Sweden, the classifier predicts that the author's native language is Swedish). We propose a method that represents the latent topical confounds and a model which "unlearns" confounding features by predicting both the label of the input text and the confound; but we train the two predictors adversarially in an alternating fashion to learn a text representation that predicts the correct label but is less prone to using information about the confound. We show that this model generalizes better and learns features that are indicative of the writing style rather than the content. 1

show abstract

“…This work considers the problem of learning to compare users on social media. A related task which has received considerably more attention is predicting user attributes (Han et al, 2014;Sap et al, 2014;Dredze et al, 2013;Culotta et al, 2015;Volkova et al, 2015;Goldin et al, 2018). The inferred user attributes have proven useful for social science and public health research (Mislove et al, 2011;Morgan-Lopez et al, 2017).…”

Section: Related Workmentioning

confidence: 99%

Learning Invariant Representations of Social Media Users

Andrews

Bishop²

2019

Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conferen

View full text Add to dashboard Cite

The evolution of social media users' behavior over time complicates user-level comparison tasks such as verification, classification, clustering, and ranking. As a result, naïve approaches may fail to generalize to new users or even to future observations of previously known users. In this paper, we propose a novel procedure to learn a mapping from short episodes of user activity on social media to a vector space in which the distance between points captures the similarity of the corresponding users' invariant features. We fit the model by optimizing a surrogate metric learning objective over a large corpus of unlabeled social media content. Once learned, the mapping may be applied to users not seen at training time and enables efficient comparisons of users in the resulting vector space. We present a comprehensive evaluation to validate the benefits of the proposed approach using data from Reddit, Twitter, and Wikipedia.

show abstract

Native Language Identification with User Generated Content

Cited by 18 publications

References 27 publications

Towards Ethical Content-Based Detection Of Online Influence Campaigns

Towards Ethical Content-Based Detection Of Online Influence Campaigns

Topics to Avoid: Demoting Latent Confounds in Text Classification

Learning Invariant Representations of Social Media Users

Contact Info

Product

Resources

About