The influence of word familiarity on word recognition has been very well established in the literature. Familiarity can be measured in a number of ways, typically in the form of written frequency, subjective ratings, or age of acquisition. In general, words that are more familiar are recognized more rapidly than those words that are less familiar (e.g., word frequency- Alegre & Gordon, 1999;Connine & Mullennix, 1990; age of acquisition-Dewhurst, Hitch, & Barry, 1998;Gerhand & Barry, 1998). When other factors such as word length are held constant, high frequency words or words acquired at an earlier age are recognized faster than low frequency words or words acquired at a later age. It should be noted that typically, the earliest words acquired are learned through conversation. Moreover, throughout a lifetime, most individuals (presumably) encounter words more often in speech than in texts, and this reality highlights the importance of suitable spoken counts to analyze speechbased word familiarity. Although written frequency counts are readily available (most notably, Francis & Ku era, 1982;Ku era & Francis, 1967), few (if any) spoken counts exist for American English. The present paper reports the construction of a 1.6 million word spoken frequency database tagged for speaker attributes such as gender and age.The use of spoken word frequency counts is conspicuously absent in the literature, presumably due to a lack of appropriate frequency counts for American English. The most notable spoken frequency database is based on a British English corpus of 190,000 words (which included 10,630 different words) that were recorded without the direct knowledge of the speaker (Brown, 1984). Given the inherent difficulty of speech transcription for the purpose of generating spoken counts, the Brown (1984) corpus is commendable, although still considerably smaller in scope than typical written frequency databases (e.g., Ku era & Francis, 1967, collected over 1 million words representative of over 40,000 different words). The discrepancy in scope between spoken and written counts in conjunction with the absence of a large-scale spoken frequency database in American English motivated the construction of a new spoken frequency database.
Spoken English CorpusOur spoken frequency counts were derived from the Michigan Corpus of Academic Spoken English (MICASE). The corpus is available online, and includes 152 transcriptions of lectures, meetings, advisement sessions, public addresses, and other educational conversations recorded at the University of Michigan (Simpson, Swales, & Briggs, 2002). On average, each of the 152 transcriptions contains approximately 11,000 words spoken by students, faculty, and other staff members in a variety of academic fields. The speakers ranged in age and gender and a majority of the speakers were educated native speakers of American English with a small percentage of nonnative speakers. In total, the transcripts derive from approximately 190 hours of recordings made between 1997 and 2001. Further info...