A study on Chinese register characteristics based on regression analysis and text clustering

Hou, Renkui; Huang, Chu‐Ren; Liu, Hongchao

doi:10.1515/cllt-2016-0062

Cited by 9 publications

(9 citation statements)

References 13 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…This result suggests that robust identification technology of similar and related languages must also take into consideration other dimensions of textual variations such as gender, genre, or register. In fact, in a series of study we did take register into consideration (Hou et al 2017, Hou, Huang andLiu 2019;. In addition to show that fitted power function model between linguistic units and their constituents is an effective tool for classification of similar languages, we also showed that single feature works better with binary classification.…”

Section: Resultsmentioning

confidence: 84%

“…Altmann (2005, 2007) demonstrated an effective way to model discrete phenomenon using continuous models and vice versa. Hou, Huang, and Liu (2019) showed that the sentence/clause length in Chinese texts can also be fitted by Formula (1) in variations of the Chinese language based on data from Mainland China.…”

Section: Language As Complex Self-adaptive Systemmentioning

confidence: 99%

“…Hou et al (2014) examined sentence length distribution in terms of words and word length distribution in terms of Chinese characters in order to distinguish formal written style and daily informal style based on text clustering. Hou, Huang, and Liu (2019) modeled the sentence length using Formula (1) and proved that the fitted parameters of the sentence length distribution can differentiate different register texts in Mainland variety of Chinese.…”

Section: Youmentioning

confidence: 99%

“…The clause length distribution in Chinese Mandarin can be fitted by Formula (1) from Hou, Huang and Liu (2019). Then, the texts can be represented by the fitted parameters, a, b, and c, of the clause length distribution.…”

Section: Fitted Parameters Of the Clause Length Distributionmentioning

confidence: 99%

See 3 more Smart Citations

Classification of regional and genre varieties of Chinese: A correspondence analysis approach based on comparable balanced corpora

Hou

Huang

2020

Nat. Lang. Eng.

Self Cite

View full text Add to dashboard Cite

This paper proposes a robust text classification and correspondence analysis approach to identification of similar languages. In particular, we propose to use the readily available information of clauses and word length distribution to model similar languages. The modeling and classification are based on the hypothesis that languages are self-adaptive complex systems and hence can be classified by dynamic features describing the system, especially in terms of distributional relations of constituents of a system. For similar languages whose grammatical differences are often subtle, classification based on dynamic system features should be more effective. To test this hypothesis, we considered both regional and genre varieties of Mandarin Chinese for classification. The data are extracted from two comparable balanced corpora to minimize possible confounding factors. The two corpora are the Sinica Corpus from Taiwan and the Lancaster Corpus of Mandarin Chinese from Mainland China, and the two genres are reportage and review. Our text classification and correspondence analysis results show that the linguistically felicitous two-level constituency model combining power functions between word and clauses effectively classifies the two varieties of Chinese for both genres. In addition, we found that genres do have compounding effect on classification of regional varieties. In particular, reportage in two varieties is more likely to be classified than review, corroborating the complex system view of language variations. That is, language variations and changes typically do not take place evenly across the board for the complete language system. This further enhances our hypothesis that dynamic complex system features, such as the power functions captured by the Menzerath–Altmann law, provide effective models in classifications of similar languages.

show abstract

Section: Resultsmentioning

confidence: 84%

Section: Language As Complex Self-adaptive Systemmentioning

confidence: 99%

Section: Youmentioning

confidence: 99%

Section: Fitted Parameters Of the Clause Length Distributionmentioning

confidence: 99%

See 2 more Smart Citations

Classification of regional and genre varieties of Chinese: A correspondence analysis approach based on comparable balanced corpora

Hou

Huang

2020

Nat. Lang. Eng.

Self Cite

View full text Add to dashboard Cite

show abstract

“…According to the approach of many Chinese treebanks (e.g., Chen et al 1996 for Sinica TreeBank, Huang and Chen 2017) and the analysis of sentence length distribution in quantitative linguistics (Hou, Huang, and Liu 2017), all segments between commas, semicolons, colons, periods, exclamation marks, and question marks that express pauses in utterances are marked as sentences. Actually, the sentences that are identified by this definition are clauses (Hou et al 2017) and conform to the definitions that rely on pauses and intonation changes in the utterances.…”

Section: Resultsmentioning

confidence: 99%

Robust stylometric analysis and author attribution based on tones and rimes

Hou

Huang

2019

Nat. Lang. Eng.

Self Cite

View full text Add to dashboard Cite

In this article, we propose an innovative and robust approach to stylometric analysis without annotation and leveraging lexical and sub-lexical information. In particular, we propose to leverage the phonological information of tones and rimes in Mandarin Chinese automatically extracted from unannotated texts. The texts from different authors were represented by tones, tone motifs, and word length motifs as well as rimes and rime motifs. Support vector machines and random forests were used to establish the text classification model for authorship attribution. From the results of the experiments, we conclude that the combination of bigrams of rimes, word-final rimes, and segment-final rimes can discriminate the texts from different authors effectively when using random forests to establish the classification model. This robust approach can in principle be applied to other languages with established phonological inventory of onset and rimes.

show abstract