This chapter applies a language design perspective to the lexicon. It reviews and synthesizes a body of work in cognitive science and linguistics that uses ideas from computer science, specifically information theory, to explore how structural features of lexicons can be explained by principles of efficient communication. It pays particular attention to four major properties of lexicons. The first is the structure of word frequency distributions, particularly the Zipfian structure of these distributions and the way that individual semantic spaces are carved up so as to be maximally efficient. The second is the relationship between word frequency and properties like word length and phonotactic probability. The third concerns lexical arbitrariness: the extent to which word forms contain information about their meanings. Finally, the chapter considers how lexicons are structured for child language learning.
This paper presents evidence of a linguistic focus effect on coreference resolution in broad-coverage human sentence processing. While previous work has explored the role of prominence in coreference resolution (Almor, 1999; Foraker and McElree, 2007), these studies use constructed stimuli with specific syntactic patterns (e.g. cleft constructions) which could have idiosyncratic frequency confounds. This paper explores the generalizability of this effect on coreference resolution in a broad-coverage analysis. In particular, the current work proposes several new estimators of prominence appropriate for broadcoverage sentence processing and evaluates them as predictors of reading behavior in the Natural Stories corpus (Futrell, Gibson, Tily, Vishnevetsky, Piantadosi, and Fedorenko, in prep), a collection of "constructed-natural" narratives read by a large number of subjects. Results show a strong facilitation effect for one of these predictors on exploratory data and confirm that it generalizes to held-out data. These results provide broad-coverage support for the hypothesis that coreference resolution is easier when the target entity is focused by discourse properties, resulting in faster reading times.
We describe a mechanism for automatically estimating frequencies of verb subcategorization frames in a large corpus. A tagged corpus is first partially parsed to identify noun phrases and then a regular grammar is used to estimate the appropriate subcategorization frame for each verb token in the corpus. In an experiment involving the identification of six fixed subcategorization frames, our current system showed more than 80% accuracy. In addition, a new statistical method enables the system to learn patterns of errors based on a set of training samples and substantially improves the accuracy of the frequency estimation.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
hi@scite.ai
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.