Proceedings of the Fifth International Workshop on Natural Language Processing for Social Media 2017
DOI: 10.18653/v1/w17-1106
|View full text |Cite
|
Sign up to set email alerts
|

A Twitter Corpus and Benchmark Resources for German Sentiment Analysis

Abstract: In this paper we present SB10k, a new corpus for sentiment analysis with approx. 10,000 German tweets. We use this new corpus and two existing corpora to provide state-of-the-art benchmarks for sentiment analysis in German: we implemented a CNN (based on the winning system of SemEval-2016) and a feature-based SVM and compare their performance on all three corpora. For the CNN, we also created German word embeddings trained on 300M tweets. These word embeddings were then optimized for sentiment analysis using d… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
3
1
1

Citation Types

0
31
0

Year Published

2018
2018
2022
2022

Publication Types

Select...
5
1
1
1

Relationship

0
8

Authors

Journals

citations
Cited by 53 publications
(31 citation statements)
references
References 19 publications
0
31
0
Order By: Relevance
“…Given the computational complexity of identifying particle-verb combinations when the particle appears at a distance, it is highly likely that for split particle verbs, the base verb of the verb-particle combination is processed as if it were a simple verb (e.g., werfe, wirfst, wirft, werfen, and werft, 1st, 2nd, and 3rd person singular and plural present, respectively). As a consequence, the semantic similarity of simple verbs and particle verbs computed from the word embeddings provided by Cieliebak et al (2017) and Deriu et al (2017) is in all likelihood larger than it should be. Not all words in the experiment are in this database; but for six words, we were able to replace the infinitive by a related form (einpassen → reinpassen, verqualmen→ verqualmt, fortlaufen → fortlaufend, bestürzen → bestürzend, verfinstern → verfinstert, beschneien → beschneites).…”
Section: Semantic Vectors From Tweetsmentioning
confidence: 90%
See 2 more Smart Citations
“…Given the computational complexity of identifying particle-verb combinations when the particle appears at a distance, it is highly likely that for split particle verbs, the base verb of the verb-particle combination is processed as if it were a simple verb (e.g., werfe, wirfst, wirft, werfen, and werft, 1st, 2nd, and 3rd person singular and plural present, respectively). As a consequence, the semantic similarity of simple verbs and particle verbs computed from the word embeddings provided by Cieliebak et al (2017) and Deriu et al (2017) is in all likelihood larger than it should be. Not all words in the experiment are in this database; but for six words, we were able to replace the infinitive by a related form (einpassen → reinpassen, verqualmen→ verqualmt, fortlaufen → fortlaufend, bestürzen → bestürzend, verfinstern → verfinstert, beschneien → beschneites).…”
Section: Semantic Vectors From Tweetsmentioning
confidence: 90%
“…As LDL-based semantic vectors for German are currently under construction, we fell back on the word embeddings (semantic vectors) provided at http://www.spinningbytes.com/resources/ wordembeddings/ (Cieliebak et al, 2017;Deriu et al, 2017). These embeddings (obtained with word2vec, Mikolov et al, 2013) are 300-dimensional vectors derived from a 50 million word corpus of German tweets.…”
Section: Semantic Vectors From Tweetsmentioning
confidence: 99%
See 1 more Smart Citation
“…To test whether the above results are specific to the NDL-based semantic vectors that we used, a separate analysis was carried out by using a different algorithm for constructing semantic vectors, applied to a different language register. We downloaded the word embeddings from https:// www.spinningbytes.com/resources/wordembeddings/ (Cieliebak et al, 2017;Deriu et al, 2017). These embeddings are 200-dimension vectors, which were trained with Word2Vec on 200 million A summary of the GAM fitted to the acoustic durations is provided in Table 9.…”
Section: Appendix Ldl With Tweet-based Word2vec Embeddingsmentioning
confidence: 99%
“…We built 100-dimensional word embeddings from CODE ALLTAG XL (Krieg-Holz et al, 2016) using WORD2VEC (Mikolov et al, 2013) for all words occurring at least 3 times in CODE ALLTAG XL . Furthermore, we employed WORD2VEC word embeddings from Reimers et al (2014) with a minimum word frequency of 5 and 100 dimensions (UKP), 300-dimensional FASTTEXT word embeddings from SPINNING-BYTES (Cieliebak et al, 2017) trained on German tweets (TWITTER) and, finally, FASTTEXT word embeddings (Grave et al, 2018) based on COM-MON CRAWL and WIKIPEDIA (FASTTEXT). We also tried to utilize embeddings generated from the German TWITTER HATESPEECH corpora from Ross et al (2016) and Wiegand et al (2018b) under the assumption that they might contain a large number of rough and vulgar words.…”
Section: Regression Modelsmentioning
confidence: 99%