Trained on a large corpus, pretrained models (PTMs) can capture different levels of concepts in context and hence generate universal language representations, which greatly benefit downstream natural language processing (NLP) tasks. In recent years, PTMs have been widely used in most NLP applications, especially for high-resource languages, such as English and Chinese. However, scarce resources have discouraged the progress of PTMs for low-resource languages. Transformer-based PTMs for the Khmer language are presented in this work for the first time. We evaluate our models on two downstream tasks: Part-of-speech tagging and news categorization. The dataset for the latter task is self-constructed. Experiments demonstrate the effectiveness of the Khmer models. In addition, we find that the current Khmer word segmentation technology does not aid performance improvement. We aim to release our models and datasets to the community in hopes of facilitating the future development of Khmer NLP applications.
As a fundamental task in natural language processing (NLP), Chinese Grammatical Error Correction (CGEC) [1–3] has gradually received widespread attention and become a research hotspot. However, one obvious deficiency of the existing CGEC evaluation systems is that the evaluation values of the same error correction models are signif- icantly influenced by the Chinese word segmentation (CWS) results or different language models. However, it is expected that these met- rics should be independent of the CWS results and language models for a fair evaluation. To this end, we propose three novel eval- uation metrics for CGEC in two dimensions: reference-based and reference-less. What’s more, according to these three evaluation met- rics, we build a new evaluation metric that can comprehensively evaluate the CGEC model from multiple dimensions. We deeply eval- uate and analyze the reasonableness and validity of the proposed metrics, and we expect them to become a new standard for CGEC.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.