Proceedings of the 24th International Conference on World Wide Web 2015
DOI: 10.1145/2736277.2741141
|View full text |Cite
|
Sign up to set email alerts
|

User Review Sites as a Resource for Large-Scale Sociolinguistic Studies

Abstract: Sociolinguistic studies investigate the relation between language and extra-linguistic variables. This requires both representative text data and the associated socio-economic meta-data of the subjects. Traditionally, sociolinguistic studies use small samples of hand-curated data and metadata. This can lead to exaggerated or false conclusions. Using social media data offers a large-scale source of language data, but usually lacks reliable socio-economic meta-data. Our research aims to remedy both problems by e… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
4
1

Citation Types

0
61
0

Year Published

2016
2016
2023
2023

Publication Types

Select...
5
5

Relationship

0
10

Authors

Journals

citations
Cited by 71 publications
(63 citation statements)
references
References 19 publications
(22 reference statements)
0
61
0
Order By: Relevance
“…The training data includes three harmonized data-sets: STREUSLE 2.1 (Schneider and Smith, 2015), Ritter and Lowlands Twitter dataset (Johannsen et al, 2014). The test set also consists of three sources: online reviews from the TrustPilot corpus (Hovy et al, 2015), tweets from the Tweebank corpus (Kong et al, 2014) and TED talk transcripts (Cettolo et al, 2012;Neubig et al, 2014). All datasets use the 17 Universal POS categories and the extended BIO scheme from Schneider and Smith, 2015.…”
Section: Datamentioning
confidence: 99%
“…The training data includes three harmonized data-sets: STREUSLE 2.1 (Schneider and Smith, 2015), Ritter and Lowlands Twitter dataset (Johannsen et al, 2014). The test set also consists of three sources: online reviews from the TrustPilot corpus (Hovy et al, 2015), tweets from the Tweebank corpus (Kong et al, 2014) and TED talk transcripts (Cettolo et al, 2012;Neubig et al, 2014). All datasets use the 17 Universal POS categories and the extended BIO scheme from Schneider and Smith, 2015.…”
Section: Datamentioning
confidence: 99%
“…For example, data consumers such as service providers and business partners, use textual data to study customers' behaviors, track users' responses to products, advertise more efficiently, and provide personalized services to users according to their needs. Textual data has been used in many tasks such as sentiment analysis, part-of-speech tagging and information extraction and retrieval [31]. Textual data thus has tremendous usages by various data consumers and have become one of the profitable resources for data publisher [1,51].…”
Section: Introductionmentioning
confidence: 99%
“…Beyond we verify the presence of status homophily in the Twitter social network our results may inform novel methods to infer socioeconomic status of people from the way they use language. Furthermore, our work, rooted within the web content analysis line of research [19], extends the usual focus on aggregated textual features (like document frequency metrics or embedding methods) to specific linguistic markers, thus enabling sociolinguistics knowledge to inform the data collection process.…”
Section: Introductionmentioning
confidence: 99%