The lack of big parallel data is present for the Kazakh language. This problem seriously impairs the quality of machine translation from and into Kazakh. This article considers the neural machine translation of the Kazakh language on the basis of synthetic corpora. The Kazakh language belongs to the Turkic languages, which are characterised by rich morphology. Neural machine translation of natural languages requires large training data. The article will show the model for the creation of synthetic corpora, namely the generation of sentences based on complete suffixes for the Kazakh language. The novelty of this approach of the synthetic corpora generation for the Kazakh language is the generation of sentences on the basis of the complete system of suffixes of the Kazakh language. By using generated synthetic corpora we are improving the translation quality in neural machine translation of Kazakh-English and Kazakh-Russian pairs.
The problems of machine translation are constantly arising. While the most advanced translation platforms, such as Google and Yandex, allow for high-quality translations of languages with simple grammatical structures, more morphologically rich languages still suffer from the translation of complex sentences, and translation services leave many structural errors. This study focused on designing the rules for the grammatical structures of complex sentences in the Kazakh language, which has a difficult grammar with many rules. First, the types of complex sentences in the Kazakh language were thoroughly observed with the use of templates from the FuzzyWuzzy library. Then, the correction of complex sentences was completed with parallel corpora. The sentences were translated into English and Russian by existing machine translation systems. Therefore, the grammar of both Kazakh–English and Kazakh–Russian language pairs was considered. They both used the rules specifically designed for the post-editing steps. Finally, the performance of the developed algorithm was evaluated for an accuracy score for each pair of languages. This approach was then proposed for use in other corpora generation, post-editing, and analysis systems in future works.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.