Semantic representations extracted from large language corpora predict high-level human judgment in seven diverse behavioral domains

Richie, Russell; Zou, Wanling; Bhatia, Sudeep

doi:10.31234/osf.io/g9j83

Cited by 6 publications

(12 citation statements)

References 21 publications

(31 reference statements)

Supporting

Mentioning

Contrasting

Order By: Relevance

“…For example, among the 4436 words in the feature norms of Buchanan et al (2019), participants listed on average 15 features per word. Similarly, Richie, Zou, and Bhatia (2018) showed that predictive accuracy of models regressing numerical semantic judgments about words (e.g., tastiness of foods) onto high-dimensional word embedding-based vector representations of words only started to plateau with about ten principal components (within, e.g., the embeddings for the set of foods), suggesting that people were making their semantic judgments on the basis of many (>>>2) dimensions of the Fig. 1.…”

Section: The Spatial Arrangement Methods For Measuring Similaritymentioning

confidence: 92%

The spatial arrangement method of measuring similarity can capture high-dimensional semantic structures

et al. 2020

Self Cite

View full text Add to dashboard Cite

Psychologists collect similarity data to study a variety of phenomena including categorization, generalization and discrimination, and representation itself. However, collecting similarity judgments between all pairs of items in a set is expensive, spurring development of techniques like the Spatial Arrangement Method (SpAM; Goldstone, Behavior Research Methods, Instruments, & Computers, 26, 381-386, 1994), wherein participants place items on a two-dimensional plane such that proximity reflects perceived similarity. While SpAM greatly hastens similarity measurement, and has been successfully used for lower-dimensional, perceptual stimuli, its suitability for higher-dimensional, conceptual stimuli is less understood. In study 1, we evaluated the ability of SpAM to capture the semantic structure of eight different categories composed of 20-30 words each. First, SpAM distances correlated strongly (r = .71) with pairwise similarity judgments, although below SpAM and pairwise judgment splithalf reliabilities (r's > .9). Second, a cross-validation exercise with multidimensional scaling fits at increasing latent dimensionalities suggested that aggregated SpAM data favored higher (> 2) dimensional solutions for seven of the eight categories explored here. Third, split-half reliability of SpAM dissimilarities was high (Pearson r = .90), while the average correlation between pairs of participants was low (r = .15), suggesting that when different participants focus on different pairs of stimulus dimensions, reliable high-dimensional aggregate similarity data is recoverable. In study 2, we show that SpAM can recover the Big Five factor space of personality trait adjectives, and that cross-validation favors a four-or five-dimension solution on this dataset. We conclude that SpAM is an accurate and reliable method of measuring similarity for high-dimensional items like words. We publicly release our data for researchers.

show abstract

Section: The Spatial Arrangement Methods For Measuring Similaritymentioning

confidence: 92%

The spatial arrangement method of measuring similarity can capture high-dimensional semantic structures

et al. 2020

Self Cite

View full text Add to dashboard Cite

show abstract

“…The CSLB norms used in Sommerauer and Fokkens's (2018) study contained only perceptual, taxonomic, and other properties mainly for concrete concepts. By contrast, most of the properties analyzed in Grand et al (2018) and Richie et al (2019) were abstract. Therefore, in these studies, concrete and abstract properties were not compared together, and thus it still remains unclear which properties can be captured better than others by DSMs.…”

Section: Related Workmentioning

confidence: 94%

“…Using GloVe vectors, they revealed that abstract properties, such as gender and danger, were significantly predicted across all categories, and even perceptual properties, such as size, were captured for some relevant categories. Richie et al (2019) also showed that 14 properties, most of which are abstract (e.g., competent and sincere), were predicted by skip‐gram vectors.…”

Section: Related Workmentioning

confidence: 99%

“…Although the studies described thus far have focused only on specific information, very recent studies exist that have compared the representational ability of word vectors among various types of features (Grand et al, 2018; Richie, Zou, & Bhatia, 2019; Sommerauer & Fokkens, 2018). Sommerauer and Fokkens (2018) compared the representational ability of word vectors between various types of features, that is, semantic properties in the Center for Speech, Language and the Brain (CSLB) concept property norms (Devereux, Tyler, Geertzen, & Randall, 2014).…”

Section: Related Workmentioning

confidence: 99%

See 1 more Smart Citation

Exploring What Is Encoded in Distributional Word Vectors: A Neurobiologically Motivated Analysis

Utsumi¹

2020

Cognitive Science

View full text Add to dashboard Cite

The pervasive use of distributional semantic models or word embeddings for both cognitive modeling and practical application is because of their remarkable ability to represent the meanings of words. However, relatively little effort has been made to explore what types of information are encoded in distributional word vectors. Knowing the internal knowledge embedded in word vectors is important for cognitive modeling using distributional semantic models. Therefore, in this paper, we attempt to identify the knowledge encoded in word vectors by conducting a computational experiment using Binder et al.'s (2016) featural conceptual representations based on neurobiologically motivated attributes. In an experiment, these conceptual vectors are predicted from text‐based word vectors using a neural network and linear transformation, and prediction performance is compared among various types of information. The analysis demonstrates that abstract information is generally predicted more accurately by word vectors than perceptual and spatiotemporal information, and specifically, the prediction accuracy of cognitive and social information is higher. Emotional information is also found to be successfully predicted for abstract words. These results indicate that language can be a major source of knowledge about abstract attributes, and they support the recent view that emphasizes the importance of language for abstract concepts. Furthermore, we show that word vectors can capture some types of perceptual and spatiotemporal information about concrete concepts and some relevant word categories. This suggests that language statistics can encode more perceptual knowledge than often expected.

show abstract

“…(vehicles not in the testing set). By contrast, prior work using projection techniques to predict feature ratings from embedding spaces (Grand et al, 2018;Richie et al, 2019) has used adjectives as endpoints, ignoring the potential influence of domain-level semantic context on similarity judgments (e.g., "size" was defined as a vector from "small," "tiny," "minuscule" to "large," "huge," "giant," regardless of semantic context). However, as we argued above, feature ratings may be impacted by semantic context much as-and perhaps for the same reasons as-similarity judgments.…”

Section: Experiments 2: Contextual Projection Captures Reliable Infor...mentioning

confidence: 99%

Context Matters: Recovering Human Semantic Structure from Machine Learning Analysis of Large‐Scale Text Corpora

et al. 2022

View full text Add to dashboard Cite

Applying machine learning algorithms to automatically infer relationships between concepts from large-scale collections of documents presents a unique opportunity to investigate at scale how human semantic knowledge is organized, how people use it to make fundamental judgments ("How similar are cats and bears?"), and how these judgments depend on the features that describe concepts (e.g., size, furriness). However, efforts to date have exhibited a substantial discrepancy between algorithm predictions and human empirical judgments. Here, we introduce a novel approach to generating embeddings for this purpose motivated by the idea that semantic context plays a critical role in human judgment. We leverage this idea by constraining the topic or domain from which documents used for generating embeddings are drawn (e.g., referring to the natural world vs. transportation apparatus). Specifically, we trained state-of-the-art machine learning algorithms using contextually-constrained text corpora (domain-specific subsets of Wikipedia articles, 50+ million words each) and showed that this procedure greatly improved predictions of empirical similarity judgments and feature ratings of contextually relevant concepts. Furthermore, we describe a novel, computationally tractable method for improving predictions of contextually-unconstrained embedding models based on dimensionality reduction of their internal representation to a small number of contextually relevant semantic features. By improving the correspondence between predictions derived automatically by machine learning methods using

show abstract

Semantic representations extracted from large language corpora predict high-level human judgment in seven diverse behavioral domains

Cited by 6 publications

References 21 publications

The spatial arrangement method of measuring similarity can capture high-dimensional semantic structures

The spatial arrangement method of measuring similarity can capture high-dimensional semantic structures

Exploring What Is Encoded in Distributional Word Vectors: A Neurobiologically Motivated Analysis

Context Matters: Recovering Human Semantic Structure from Machine Learning Analysis of Large‐Scale Text Corpora

Contact Info

Product

Resources

About