Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing 2016
DOI: 10.18653/v1/d16-1099
|View full text |Cite
|
Sign up to set email alerts
|

The Effects of Data Size and Frequency Range on Distributional Semantic Models

Abstract: This paper investigates the effects of data size and frequency range on distributional semantic models. We compare the performance of a number of representative models for several test settings over data of varying sizes, and over test items of various frequency. Our results show that neural network-based models underperform when the data is small, and that the most reliable model over data of varying sizes and frequency ranges is the inverted factorized model.

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

11
49
1

Year Published

2017
2017
2024
2024

Publication Types

Select...
5
1
1

Relationship

1
6

Authors

Journals

citations
Cited by 71 publications
(61 citation statements)
references
References 12 publications
11
49
1
Order By: Relevance
“…Given that semantic word representations have not been extensively tested and tuned in small dataset such as dreams collections, we started analyzing the models performance and parameter dependencies in two semantic tests when they were trained with small datasets. In accordance with Sahlgren and Lenci (2016) we found that LSA outperforms the Skip-gram model when they are trained in corpora smaller than 1 million words.…”
Section: Ukwacsupporting
confidence: 86%
See 3 more Smart Citations
“…Given that semantic word representations have not been extensively tested and tuned in small dataset such as dreams collections, we started analyzing the models performance and parameter dependencies in two semantic tests when they were trained with small datasets. In accordance with Sahlgren and Lenci (2016) we found that LSA outperforms the Skip-gram model when they are trained in corpora smaller than 1 million words.…”
Section: Ukwacsupporting
confidence: 86%
“…While Skip-gram tends to produce better embeddings than LSA when they are trained with larger corpora, under training with smaller corpora Skip-gram performance is considerably lower than LSA's. In accordance with Sahlgren and Lenci (2016) results, the threshold in corpus size below which LSA outperform Skip-gram is around the million of words.…”
Section: Corpus Size Analysis In Semantic Testsupporting
confidence: 84%
See 2 more Smart Citations
“…Few studies, except Sahlgren and Lenci (2016), have considered this setup in detail. We evaluate one word-based and two character-based embedding models on word relatedness tasks for English and German.…”
Section: Introductionmentioning
confidence: 99%