This article explores in depth various sandhi (joining) rules in Kerala’s Malayalam language, which play a vital role in framing of the inflected and agglutinated forms of words and their compounds. It discusses significant progress in a scientific method to generate a specific annotated data set of Malayalam words that would be useful in many Natural Language Processing tasks which involve Malayalam preprocessing. The article discusses the results and issues encountered in developing this word-splitting tool for Malayalam, mainly in the context of improving the alignments between parallel texts that form a core resource in the Machine Translation task.
This paper underlines a methodology for translating text from English into the Dravidian language, Malayalam using statistical models. By using a monolingual Malayalam corpus and a bilingual English/Malayalam corpus in the training phase, the machine automatically generates Malayalam translations of English sentences. This paper also discusses a technique to improve the alignment model by incorporating the parts of speech information into the bilingual corpus. Removing the insignificant alignments from the sentence pairs by this approach has ensured better training results. Pre-processing techniques like suffix separation from the Malayalam corpus and stop word elimination from the bilingual corpus also proved to be effective in training. Various handcrafted rules designed for the suffix separation process which can be used as a guideline in implementing suffix separation in Malayalam language are also presented in this paper. The structural difference between the English Malayalam pair is resolved in the decoder by applying the order conversion rules. Experiments conducted on a sample corpus have generated reasonably good Malayalam translations and the results are verified with F measure, BLEU and WER evaluation metrics.
Text classification is a task of automatic classification of text into one of the predefined categories. The problem of text classification has been widely studied in different communities like natural language processing, data mining and information retrieval. Text classification is an important constituent in many information management tasks like topic identification, spam filtering, email routing, language identification, genre classification, readability assessment etc. The performance of text classification improves notably when phrase patterns are used. The use of phrase patterns helps in capturing non-local behaviours and thus helps in the improvement of text classification task. Phrase structure extraction is the first step to continue with the phrase pattern identification. In this survey, detailed study of phrase structure learning methods have been carried out. This will enable future work in several NLP tasks, which uses syntactic information from phrase structure like grammar checkers, question answering, information extraction, machine translation, text classification. The paper also provides different levels of classification and detailed comparison of the phrase structure learning methods.
Statistical Machine Translation (SMT) is a preferred Machine Translation approach to convert text in a specific language into another by automatically learning translations using a parallel corpus. SMT has been successful in producing quality translations in many foreign languages, but there are only a few works attempted in South Indian languages. The paper discusses on experiments conducted with SMT for Malayalam language and analyzes how the methods defined for SMT in foreign languages affect a Dravidian language, Malayalam. The baseline SMT model does not work for Malayalam due to its unique characteristics like agglutinative nature and morphological richness. Hence, the challenge is to identify where precisely the SMT model has to be modified such that it adapts the challenges of the language peculiarity into the baseline model and give better translations for English to Malayalam translation. The alignments between English and Malayalam sentence pairs, subjected to the training process in SMT, plays a crucial role in producing quality output translation. Therefore, this work focuses on improving the translation model of SMT by refining the alignments between English-Malayalam sentence pairs. The phrase alignment algorithms align the verb and noun phrases in the sentence pairs and develop a new set of alignments for the English-Malayalam sentence pairs. These alignment sets refines the alignments formed from Giza++ produced as a result of EM training algorithm. The improved Phrase Based SMT model trained using these refined alignments resulted in better translation quality, as indicated by the AER and BLUE scores.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
hi@scite.ai
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.