Teaching a foreign language in a non-linguistic college or university should be professionally oriented, which brings up the question of selecting the relevant vocabulary of a professional discourse under study. Modern text corpora are too general in subject matter and the time span. Therefore, a specially compiled collection of texts can serve the purpose of selecting the vocabulary. In the case of the Chinese language, the task is complicated by the lack of word segmentation in such texts. Taking into account the fact that most words in Chinese are written in two characters, it is assumed that one of the methods applicable in this situation is a comprehensive frequency analysis of text sequences of two characters – character bigrams. The analysis of frequent bigrams has showed that 70% of the most frequent lexical units are representative of the discourse, including 11% of out-of-vocabulary ones. The remaining part of bigrams pertain to syntactic constructions, including structurally incomplete ones, and fragments of longer lexical units. Thus, the high frequency of character co-occurrence can with a rather high probability (p > 0.7) be considered as an indicator of lexicality in identifying representative vocabulary in an unsegmented the matic collection of texts in Chinese.
Studying professional discourse, a researcher has now an opportunity to create collections of texts and apply linguistic analysis software tools to them. However, when it comes to Chinese discourse there is a problem with the reliability of automatic word segmentation of texts. One of the ways to extract lexical units in Chinese texts is to apply statistical association measures for collocations to Chinese character bigrams. The purpose of this work is to conduct a comparative analysis of seven different statistical measures for collocations as a means of extracting two-syllabic lexical units (binomes) in an unsegmented Chinese character text. The subject of the analysis is the lexical, grammatical and frequency characteristics of bigrams with higher values of the statistical measures. Their comparison makes it possible to draw a conclusion about the features of statistical measures, in particular, about the best correspondence of linguistic tasks to statistical measures. The linguistic material of the study was a collection of 560 military-related news texts in Chinese with more than 720 thousand characters. The results show that the statistical measures considered can be divided into three groups according to the characteristics of bigrams receiving the highest values. The first group includes measures MI, MS and logDice, which give priority to rare bigrams with limited compatibility of components, such as the Chinese two-syllable single morpheme words “lianmianzi”. These measures do not extract terms well, but can be used to search for phraseologically related components. The measures of the second group, t-score and log-likelihood, are frequency-oriented, similar to frequency analysis, but they cope with non-lexical bigrams better, while log-likelihood somewhat lowers the rank of numerals and pronouns, picking out best the typical vocabulary of professional discourse. The third group includes measures MI3 and MI.log-f, which average the opposite approaches of the first two groups. The MI3 measure is considered to be the most universal one; it could be used to compare different corpora or collections of texts. It is concluded that applying statistical association measures to Chinese character bi-grams is possible and appropriate, when taking into account the correspondence of their specifics to a research task.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
hi@scite.ai
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.