User Review Sites as a Resource for Large-Scale Sociolinguistic Studies

Hovy, Dirk; Johannsen, Anders; Søgaard, Anders

doi:10.1145/2736277.2741141

Cited by 71 publications

(63 citation statements)

References 19 publications

(22 reference statements)

Supporting

Mentioning

Contrasting

Order By: Relevance

“…The training data includes three harmonized data-sets: STREUSLE 2.1 (Schneider and Smith, 2015), Ritter and Lowlands Twitter dataset (Johannsen et al, 2014). The test set also consists of three sources: online reviews from the TrustPilot corpus (Hovy et al, 2015), tweets from the Tweebank corpus (Kong et al, 2014) and TED talk transcripts (Cettolo et al, 2012;Neubig et al, 2014). All datasets use the 17 Universal POS categories and the extended BIO scheme from Schneider and Smith, 2015.…”

Section: Datamentioning

confidence: 99%

ICL-HD at SemEval-2016 Task 10: Improving the Detection of Minimal Semantic Units and their Meanings with an Ontology and Word Embeddings

Kirilin¹,

Krauss²,

Versley

2016

Proceedings of the 10th International Workshop on Semantic Evaluation (SemEval-2016)

View full text Add to dashboard Cite

This paper presents our system submitted for SemEval 2016 Task 10: Detecting Minimal Semantic Units and their Meanings (DiM-SUM;Schneider, Hovy, et al., 2016). We extend AMALGrAM (Schneider and Smith, 2015) by tapping two additional information sources. The first information source uses a semantic knowledge base (YAGO3; Suchanek et al., 2007) to improve supersense tagging (SST) for named entities. The second information source employs word embeddings (GloVe; Pennington et al., 2014) to capture fine-grained latent semantics and therefore improving the supersense identification for both nouns and verbs. We conduct a detailed evaluation and error analysis for our features and come to the conclusion that both our extensions lead to an improved detection for SST.

show abstract

Section: Datamentioning

confidence: 99%

ICL-HD at SemEval-2016 Task 10: Improving the Detection of Minimal Semantic Units and their Meanings with an Ontology and Word Embeddings

Kirilin¹,

Krauss²,

Versley

2016

Proceedings of the 10th International Workshop on Semantic Evaluation (SemEval-2016)

View full text Add to dashboard Cite

show abstract

“…For example, data consumers such as service providers and business partners, use textual data to study customers' behaviors, track users' responses to products, advertise more efficiently, and provide personalized services to users according to their needs. Textual data has been used in many tasks such as sentiment analysis, part-of-speech tagging and information extraction and retrieval [31]. Textual data thus has tremendous usages by various data consumers and have become one of the profitable resources for data publisher [1,51].…”

Section: Introductionmentioning

confidence: 99%

Privacy Preserving Text Representation Learning

Beigi

Shu

Guo

et al. 2019

Proceedings of the 30th ACM Conference on Hypertext and Social Media

View full text Add to dashboard Cite

Online users generate tremendous amounts of textual information by participating in different activities, such as writing reviews and sharing tweets. This textual data provides opportunities for researchers and business partners to study and understand individuals. However, this user-generated textual data not only can reveal the identity of the user but also may contain individual's private information (e.g., age, location, gender). Hence, "you are what you write" as the saying goes. Publishing the textual data thus compromises the privacy of individuals who provided it. The need arises for data publishers to protect people's privacy by anonymizing the data before publishing it. It is challenging to design effective anonymization techniques for textual information which minimizes the chances of re-identification and does not contain users' sensitive information (high privacy) while retaining the semantic meaning of the data for given tasks (high utility). In this paper, we study this problem and propose a novel double privacy preserving text representation learning framework, DPText, which learns a textual representation that (1) is differentially private, (2) does not contain private information and (3) retains high utility for the given task. Evaluating on two natural language processing tasks, i.e., sentiment analysis and part of speech tagging, we show the effectiveness of this approach in terms of preserving both privacy and utility.

show abstract

“…Beyond we verify the presence of status homophily in the Twitter social network our results may inform novel methods to infer socioeconomic status of people from the way they use language. Furthermore, our work, rooted within the web content analysis line of research [19], extends the usual focus on aggregated textual features (like document frequency metrics or embedding methods) to specific linguistic markers, thus enabling sociolinguistics knowledge to inform the data collection process.…”

Section: Introductionmentioning

confidence: 99%

Socioeconomic Dependencies of Linguistic Patterns in Twitter

Abitbol

Karsai

Magué

et al. 2018

Proceedings of the 2018 World Wide Web Conference on World Wide Web - WWW '18

View full text Add to dashboard Cite

Our usage of language is not solely reliant on cognition but is arguably determined by myriad external factors leading to a global variability of linguistic patterns. This issue, which lies at the core of sociolinguistics and is backed by many small-scale studies on faceto-face communication, is addressed here by constructing a dataset combining the largest French Twitter corpus to date with detailed socioeconomic maps obtained from national census in France. We show how key linguistic variables measured in individual Twitter streams depend on factors like socioeconomic status, location, time, and the social network of individuals. We found that (i) people of higher socioeconomic status, active to a greater degree during the daytime, use a more standard language; (ii) the southern part of the country is more prone to use more standard language than the northern one, while locally the used variety or dialect is determined by the spatial distribution of socioeconomic status; and (iii) individuals connected in the social network are closer linguistically than disconnected ones, even after the effects of status homophily have been removed. Our results inform sociolinguistic theory and may inspire novel learning methods for the inference of socioeconomic status of people from the way they tweet.

show abstract

User Review Sites as a Resource for Large-Scale Sociolinguistic Studies

Cited by 71 publications

References 19 publications

ICL-HD at SemEval-2016 Task 10: Improving the Detection of Minimal Semantic Units and their Meanings with an Ontology and Word Embeddings

ICL-HD at SemEval-2016 Task 10: Improving the Detection of Minimal Semantic Units and their Meanings with an Ontology and Word Embeddings

Privacy Preserving Text Representation Learning

Socioeconomic Dependencies of Linguistic Patterns in Twitter

Contact Info

Product

Resources

About