The present paper discusses the benefits and challenges of token-based typology, which takes into account the frequencies of words and constructions in language use. This approach makes it possible to introduce new criteria for language classification, which would be difficult or impossible to achieve with the traditional, type-based approach. This point is illustrated by several quantitative studies of word order variation, which can be measured as entropy at different levels of granularity. I argue that this variation can be explained by general functional mechanisms and pressures, which manifest themselves in language use, such as optimization of processing (including avoidance of ambiguity) and grammaticalization of predictable units occurring in chunks. The case studies are based on multilingual corpora, which have been parsed using the Universal Dependencies annotation scheme.
This chapter discusses popular non-parametric methods in corpus linguistics: conditional inference trees and conditional random forests. These methods, which allow the researcher to model and interpret the relationships between a numeric or categorical response variable and various predictors, are particularly attractive in 'tricky' situations, when the use of parametric methods (in particular, regression models) can be problematic, for example, in the situations of 'small n, large p', complex interactions, non-linearity and correlated predictors. For illustration, the chapter discusses a case study of T and V politeness forms in Russian based on a corpus of film subtitles.
Abstract The present study investigates the cross-linguistic differences in the use of so-called T/V forms (e.g. French tu and vous, German du and Sie, Russian ty and vy) in ten European languages from different language families and genera. These constraints represent an elusive object of investigation because they depend on a large number of subtle contextual features and social distinctions, which should be cross-linguistically matched. Film subtitles in different languages offer a convenient solution because the situations of communication between film characters can serve as comparative concepts. I selected more than two hundred contexts that contain the pronouns you and yourself in the original English versions, which are then coded for fifteen contextual variables that describe the Speaker and the Hearer, their relationships and different situational properties. The creators of subtitles in the other languages have to choose between T and V when translating from English, where the T/V distinction is not expressed grammatically. On the basis of these situations translated in ten languages, I perform multivariate analyses using the method of conditional inference trees in order to identify the most relevant contextual variables that constrain the T/V variation in each language.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.