Background: Advances in natural language processing (NLP) and computational linguistics have facilitated major improvements on traditional readability formulas that aim at predicting the overall difficulty of a text. Recent studies have identified several types of linguistic features that are theoretically motivated and predictive of human judgments of text readability, which outperform predictions made by traditional readability formulas, such as Flesch-Kincaid. The purpose of this study is to develop new readability models using advanced NLP tools to measure both text comprehension and reading speed. Methods: This study used crowdsourcing techniques to collect human judgments of text comprehension and reading speed across a diverse variety of topic domains (science, technology and history). Linguistic features taken from state-of-the-art NLP tools were used to develop models explaining human judgments of text comprehension and reading speed. The accuracy of these models was then compared with classic readability formulas. Results: The results indicated that models employing linguistic features more theoretically related to text comprehension and reading speed outperform classic readability models. Conclusions: This study developed new readability formulas based on advanced NLP tools for both text comprehension and reading speed. These formulas, based on linguistic features that better represent theoretical and behavioural accounts of the reading process, significantly outperformed classic readability formulas.
This paper reports on an approximate or partial replication of a study by Salsbury, Crossley & McNamara (2011) that examined the longitudinal developmental of a number of core lexical features related to word imageability, concreteness, familiarity, and meaningfulness in a spoken corpus of English second language (L2) learners. Salsbury et al. found no developmental growth patterns for word familiarity but strong growth patterns for word concreteness, imageability, and meaningfulness as a function of time such that L2 learners began to produce more sophisticated words. Salsbury et al. were the first to formally identify this relation between English proficiency and lexical sophistication, and a large number of studies investigating lexical proficiency have cited this article as a foundational study. There were, however, a number of limitations to the Salsbury et al. (2011) study that make it appropriate for replication. First, the sample size was relatively small (six learners sampled six times over the course of a year). In addition, the study did not control for a number of factors important in L2 acquisition studies (e.g., age, proficiency level, gender) and used a statistical technique that averaged group means and did not properly account for individual participant variation. This replication study addresses these areas and the findings from the replication reflect those reported by Salsbury et al., providing support for the notion that developing L2 lexicons move from the production of words with stronger links to core lexical items to words with weaker links to core lexical items over time. Implications for language learning and teaching are discussed.
A number of longitudinal studies of L2 production have reported frequency effects wherein learners' produce more frequent words as a function of time. The current study investigated the spoken output of English L2 learners over a four-month period of time using both native and non-native English speaker frequency norms for both word types and word tokens. The study also controlled for individual differences such as first language distance, English proficiency, gender, and age. Results demonstrated that lower level L2 learners produced more infrequent tokens at the beginning of the study and that high intermediate learners, when compared to advanced learners, produced more infrequent tokens at the beginning of the study and more frequent tokens toward the end of the study. Main effects were also reported for proficiency level, age, and language distance. These results provide further evidence that L2 production may not follow expected frequency trends (i.e., that more infrequent tokens are produced as a function of time).
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.