Automatic Classification of Text Complexity

Santucci, Valentino; Santarelli, Filippo; Forti, Luciana; Spina, Stefania

doi:10.3390/app10207285

Cited by 19 publications

(10 citation statements)

References 48 publications

Supporting

Mentioning

Contrasting

Unclassified

Order By: Relevance

“…The results have shown the effectiveness of these features and the SVM classifier. Similar results can be found in the research articles by Szügyi et al ( 2019 ) for texts in the German language and Santucci et al ( 2020 ) where the authors achieved the best results for the Italian language using a set of linguistic features in conjunction with the Random Forest classifier. Lyashevskaya et al ( 2021 ) showed the effectiveness of linguistic features for the task of complexity assessment of the texts written by Russian learners of English.…”

Section: Related Worksupporting

confidence: 87%

A hybrid model of complexity estimation: Evidence from Russian legal texts

Blinova

Tarasov²

2022

Front. Artif. Intell.

View full text Add to dashboard Cite

This article proposes a hybrid model for the estimation of the complexity of legal documents in Russian. The model consists of two main modules: linguistic feature extractor and a transformer-based neural encoder. The set of linguistic metrics includes both non-specific metrics traditionally used to predict complexity, as well as style-specific metrics developed in order to deal with the peculiarities of official texts. The model was trained on a dataset constructed from text sequences from Russian textbooks. Training data were collected on either subjects related to the topic of legal documents such as Jurisprudence, Economics, Social Sciences, or subjects characterized by the use of general languages such as Literature, History, and Culturology. The final set of materials used contain 48 thousand selected text blocks having various subjects and level-of-complexity identifiers. We have tested the baseline fine-tuned BERT model, models trained on linguistic features, and models trained on features in combination with BERT predictions. The scores show that a hybrid approach to complexity estimation can provide high-quality results in terms of different metrics. The model has been tested on three sets of legal documents.

show abstract

Section: Related Worksupporting

confidence: 87%

A hybrid model of complexity estimation: Evidence from Russian legal texts

Blinova

Tarasov²

2022

Front. Artif. Intell.

View full text Add to dashboard Cite

show abstract

“…Somewhat more rarely than SVM, decision trees and random forests are used to classify texts; the essence of the latter method is the use of a large number of decision trees, which together have good predictive power. In Kauchak et al (2014) and Santucci et al (2020), random forests perform better than other models.…”

Section: Machine Learning and Natural Language Processing Methods For Assessing The Readability Of Textsmentioning

confidence: 95%

Assessment of the Clarity of Bank of Russia Monetary Policy Communication by Neural Network Approach

Evstigneeva¹,

Sidorovskiy²

2021

RJMF

View full text Add to dashboard Cite

Inflation targeting requires clear and transparent central bank’s communication. Analysts and market participants understand it as a broad list of information disclosed by the central bank. The general public understands it rather as the ability of a central bank to speak and explain its decisions in a plain language. In recent decades, monetary authorities in many countries have made significant progress in this direction. However, there has been no research on the quality of communication for the Bank of Russia. This paper aims to create a tool for automated evaluation of the readability of the Bank of Russia’s monetary policy communication, taking into account the available experience of linguistic and textual analysis, including machine learning methods, as well as to provide recommendations for its improvement. This can contribute to improving the effectiveness of the Bank of Russia communication on monetary policy, which is vital for its credibility, anchoring inflation expectations, and predictability of the regulator’s decisions.

show abstract

“…The task of CEFR classification itself however, seems to have received fewer attention. Among the studies that address this problem for various languages are Santucci et al (2020) (Italian), Hancke and Meurers (2013) (German), Vajjala and Lõo (2014) (Estonian) and Volodina et al (2016) (Swedish). Earlier work on English (our language of interest) is represented by Tack et al (2017), who create their own annotated corpus and experiment with automated classification using several classification algorithms.…”

Section: Related Workmentioning

confidence: 99%

Mitigating Learnerese Effects for CEFR Classification

Jalota¹,

Bourgonje²,

Sas³

et al. 2022

Proceedings of the 17th Workshop on Innovative Use of NLP for Building Educational Applications (BEA 2022)

View full text Add to dashboard Cite

The role of an author's L1 in SLA can be challenging for automated CEFR classification, in that texts from different L1 groups may be too heterogeneous to combine them as training data. We experiment with recent debiasing approaches by attempting to devoid textual representations of L1 features. This results in a more homogeneous group when aggregating CEFRannotated texts from different L1 groups, leading to better classification performance. Using iterative null-space projection, we marginally improve classification performance for a linear classifier by 1 point. An MLP (e.g. non-linear) classifier remains unaffected by this procedure. We discuss possible directions of future work to attempt to increase this performance gain.

show abstract

Automatic Classification of Text Complexity

Cited by 19 publications

References 48 publications

A hybrid model of complexity estimation: Evidence from Russian legal texts

A hybrid model of complexity estimation: Evidence from Russian legal texts

Assessment of the Clarity of Bank of Russia Monetary Policy Communication by Neural Network Approach

Mitigating Learnerese Effects for CEFR Classification

Contact Info

Product

Resources

About