Tweet Classification without the Tweet: An Empirical Examination of User versus Document Attributes

Lynn, Veronica E.; Giorgi, Salvatore; Balasubramanian, Niranjan; Schwartz, H. Andrew

doi:10.18653/v1/w19-2103

Cited by 18 publications

(19 citation statements)

References 32 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…We also include the top participant from the shared task Zarrella and Marsh (2016) which uses a different F1 score as defined for the shared task, referred to here as SemEval F1 3 . Lastly, we compare our results to the approach of Lynn et al (2019), from whom we received the extended history dataset, which uses the labeled tweet and a list of accounts the author follows. However, they only report the weighted-F1 score for their best performing model.…”

Section: Resultsmentioning

confidence: 99%

See 1 more Smart Citation

MeLT: Message-Level Transformer with Masked Document Representations as Pre-Training for Stance Detection

Matero¹,

Soni²,

Balasubramanian³

et al. 2021

Findings of the Association for Computational Linguistics: EMNLP 2021

Self Cite

View full text Add to dashboard Cite

Much of natural language processing is focused on leveraging large capacity language models, typically trained over single messages with a task of predicting one or more tokens. However, modeling human language at higher-levels of context (i.e., sequences of messages) is underexplored. In stance detection and other social media tasks where the goal is to predict an attribute of a message, we have contextual data that is loosely semantically connected by authorship. Here, we introduce Message-Level Transformer (MeLT) -a hierarchical messageencoder pre-trained over Twitter and applied to the task of stance prediction. We focus on stance prediction as a task benefiting from knowing the context of the message (i.e., the sequence of previous messages). The model is trained using a variant of masked-language modeling; where instead of predicting tokens, it seeks to generate an entire masked (aggregated) message vector via reconstruction loss. We find that applying this pre-trained masked messagelevel transformer to the downstream task of stance detection achieves F1 performance of 67%.

show abstract

Section: Resultsmentioning

confidence: 99%

“…In total we have 3,021 instances with a split of 1658 train, 418 dev, and 945 test across all targets. The original 2016 shared task had 4,100 instances, however due to accounts or messages being deleted over time, we were unable to replicate the complete original dataset and instead used the smaller version available from Lynn et al (2019)…”

Section: A Appendixmentioning

confidence: 99%

MeLT: Message-Level Transformer with Masked Document Representations as Pre-Training for Stance Detection

Matero¹,

Soni²,

Balasubramanian³

et al. 2021

Findings of the Association for Computational Linguistics: EMNLP 2021

Self Cite

View full text Add to dashboard Cite

show abstract

“…Our work is aligned with a growing set of methods to to embed language processing within the social and human contexts they are applied (Lynn et al, 2019). Most similar is the work on language generation or dialog agents (i.e.…”

Section: Related Workmentioning

confidence: 99%

Characterizing Social Spambots by their Human Traits

Giorgi¹,

Ungar²,

Schwartz³

2021

Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021

Self Cite

View full text Add to dashboard Cite

Social spambots, an emerging class of spammers attempting to emulate people, are difficult for both human annotators and classic bot detection techniques to reliably distinguish from genuine accounts. We examine this human emulation through studying the human characteristics (personality, gender, age, emotions) exhibited by social spambots' language, hypothesizing the values for these attributes will be unhuman-like (e.g. unusually high or low). We found our hypothesis mostly disconfirmed -individually, social bots exhibit very human-like attributes. However, a striking pattern emerged when consider the full distributions of these estimated human attributes: social bots were extremely similar and average in their expressed personality, demographics, and emotion (in contrast with traditional bots which we found to exhibit more variance and extreme values than genuine accounts). We thus consider how well social bots can be identified only using the 17 variables of these human attributes and ended up with a new state of the art in social spambot detection (e.g. F 1 = .946). Further, simulating the situation of not knowing the bots a priori, we found that even an unsupervised clustering using the same 17 attributes could yield nearly as accurate of social bot identification (F 1 = 0.925).

show abstract

“…Such tasks present an interesting challenge for the NLP community to model the people behind the language rather than the language itself, and the social scientific community has begun to see success of such approaches as an alternative or supplement to standard psychological assessment techniques like questionnaires (Kern et al, 2016;Eichstaedt et al, 2018). Generally, such work is helping to embed NLP in a greater social and human context (Hovy and Spruit, 2016;Lynn et al, 2019).…”

Section: Introductionmentioning

confidence: 99%

Empirical Evaluation of Pre-trained Transformers for Human-Level NLP: The Role of Sample Size and Dimensionality

Ganesan¹,

Matero²,

Ravula³

et al. 2021

Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Langua

Self Cite

View full text Add to dashboard Cite

In human-level NLP tasks, such as predicting mental health, personality, or demographics, the number of observations is often smaller than the standard 768+ hidden state sizes of each layer within modern transformer-based language models, limiting the ability to effectively leverage transformers. Here, we provide a systematic study on the role of dimension reduction methods (principal components analysis, factorization techniques, or multi-layer auto-encoders) as well as the dimensionality of embedding vectors and sample sizes as a function of predictive performance. We first find that fine-tuning large models with a limited amount of data pose a significant difficulty which can be overcome with a pre-trained dimension reduction regime. RoBERTa consistently achieves top performance in humanlevel tasks, with PCA giving benefit over other reduction methods in better handling users that write longer texts. Finally, we observe that a majority of the tasks achieve results comparable to the best performance with just 1 12 of the embedding dimensions.

show abstract

Tweet Classification without the Tweet: An Empirical Examination of User versus Document Attributes

Cited by 18 publications

References 32 publications

MeLT: Message-Level Transformer with Masked Document Representations as Pre-Training for Stance Detection

MeLT: Message-Level Transformer with Masked Document Representations as Pre-Training for Stance Detection

Characterizing Social Spambots by their Human Traits

Empirical Evaluation of Pre-trained Transformers for Human-Level NLP: The Role of Sample Size and Dimensionality

Contact Info

Product

Resources

About