Purpose
Corpus analyses of spontaneous language fragments of varying length provide useful insights in the language change caused by brain damage, such as caused by some forms of dementia. Sample size is an important experimental parameter to consider when designing spontaneous language analyses studies. Sample length influences the confidence levels of analyses. Machine learning approaches often favor to use as much language as available, whereas language evaluation in a clinical setting is often based on truncated samples to minimize annotation labor and to limit any discomfort for participants. This article investigates, using Bayesian estimation of machine learned models, what the ideal text length should be to minimize model uncertainty.
Method
We use the Stanford parser to extract linguistic variables and train a statistic model to distinguish samples by speakers with no brain damage from samples by speakers with probable Alzheimer's disease. We compare the results to previously published models that used CLAN for linguistic analysis.
Results
The uncertainty around six individual variables and its relation to sample length are reported. The same model with linguistic variables that is used in all three experiments can predict group membership better than a model without them. One variable (concept density) is more informative when measured using the Stanford tools than when measured using CLAN.
Conclusion
For our corpus of German speech, the optimal sample length is found to be around 700 words long. Longer samples do not provide more information.
Abstract. Disorders of language and/or communicative abilities in neurodegenerative diseases are a common phenomenon. Over the past few decades, there has been a growing interest in language performance connected to these diseases. To date, studies in the field of language impairments in Alzheimer’s disease (AD), Parkinson’s disease (PD), and frontotemporal lobar degeneration (FTLD) have focused mainly on particular aspects of language processing in the isolated disease or on comparing certain language tasks in two neurodegenerative diseases. To enable a better understanding and comparison of the underlying linguistic deficits in all three disorders, this paper focuses on phonological, semantic, and grammatical processing in each of the disorders. A review of the literature on language processing deficits reveals that phonological, semantic, and grammatical processing is impaired in the early stages of AD, PD, and FTLD, and that the underlying deficits are sometimes linguistic in nature. Language disorders, however, may also reflect cognitive deficits, such as short-term verbal memory impairments, attention deficits, and reduced switching capacities, all of which have an impact on language processing.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.