The Effects of Data Size and Frequency Range on Distributional Semantic Models

Sahlgren, Magnus; Lenci, Alessandro

doi:10.18653/v1/d16-1099

Cited by 71 publications

(61 citation statements)

References 12 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Given that semantic word representations have not been extensively tested and tuned in small dataset such as dreams collections, we started analyzing the models performance and parameter dependencies in two semantic tests when they were trained with small datasets. In accordance with Sahlgren and Lenci (2016) we found that LSA outperforms the Skip-gram model when they are trained in corpora smaller than 1 million words.…”

Section: Ukwacsupporting

confidence: 86%

“…While Skip-gram tends to produce better embeddings than LSA when they are trained with larger corpora, under training with smaller corpora Skip-gram performance is considerably lower than LSA's. In accordance with Sahlgren and Lenci (2016) results, the threshold in corpus size below which LSA outperform Skip-gram is around the million of words.…”

Section: Corpus Size Analysis In Semantic Testsupporting

confidence: 84%

“…The random subsample has 140k documents with a corpus size of 57M words in its cleaned form. This corpus was used in Sahlgren and Lenci (2016) work.…”

Section: Corporamentioning

confidence: 99%

“…In a recent study, Asr et al (2016) show that a co-occurrence model, like LSA, outperforms Skip-gram model in a semantic classification task over a medium size corpus (8 million words). In the same line, Sahlgren and Lenci (2016) compare the performance of different embeddings techniques in several semantic task when the corpus size is varied. They showed that the Skip-gram model outperforms other embeddings in the case of large corpus size (1 billion words), while in the case of a smaller corpus size (1 million words) LSA beats other models.…”

Section: Introductionmentioning

confidence: 99%

See 3 more Smart Citations

The interpretation of dream meaning: Resolving ambiguity using Latent Semantic Analysis in a small corpus of text

Altszyler

Ribeiro

Sigman

et al. 2017

Consciousness and Cognition

View full text Add to dashboard Cite

A B S T R A C TComputer-based dreams content analysis relies on word frequencies within predefined categories in order to identify different elements in text. As a complementary approach, we explored the capabilities and limitations of word-embedding techniques to identify word usage patterns among dream reports. These tools allow us to quantify words associations in text and to identify the meaning of target words. Word-embeddings have been extensively studied in large datasets, but only a few studies analyze semantic representations in small corpora. To fill this gap, we compared Skip-gram and Latent Semantic Analysis (LSA) capabilities to extract semantic associations from dream reports. LSA showed better performance than Skip-gram in small size corpora in two tests. Furthermore, LSA captured relevant word associations in dream collection, even in cases with low-frequency words or small numbers of dreams. Word associations in dreams reports can thus be quantified by LSA, which opens new avenues for dream interpretation and decoding.

show abstract

Section: Ukwacsupporting

confidence: 86%

Section: Corpus Size Analysis In Semantic Testsupporting

confidence: 84%

“…The random subsample has 140k documents with a corpus size of 57M words in its cleaned form. This corpus was used in Sahlgren and Lenci (2016) work.…”

Section: Corporamentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

See 2 more Smart Citations

The interpretation of dream meaning: Resolving ambiguity using Latent Semantic Analysis in a small corpus of text

Altszyler

Ribeiro

Sigman

et al. 2017

Consciousness and Cognition

View full text Add to dashboard Cite

show abstract

“…Few studies, except Sahlgren and Lenci (2016), have considered this setup in detail. We evaluate one word-based and two character-based embedding models on word relatedness tasks for English and German.…”

Section: Introductionmentioning

confidence: 99%

Addressing Low-Resource Scenarios with Character-aware Embeddings

Papay¹,

Padó²,

Vu³

2018

Proceedings of the Second Workshop on Subword/Character LEvel Models

View full text Add to dashboard Cite

Most modern approaches to computing word embeddings assume the availability of text corpora with billions of words. In this paper, we explore a setup where only corpora with millions of words are available, and many words in any new text are out of vocabulary. This setup is both of practical interest -modeling the situation for specific domains and low-resource languages -and of psycholinguistic interest, since it corresponds much more closely to the actual experiences and challenges of human language learning and use. We evaluate skip-gram word embeddings and two types of character-based embeddings on word relatedness prediction. On large corpora, performance of both model types is equal for frequent words, but character awareness already helps for infrequent words. Consistently, on small corpora, the characterbased models perform overall better than skipgrams. The concatenation of different embeddings performs best on small corpora and robustly on large corpora.

show abstract