Spell checker is the application, which helps in finding the spelling errors in a given text. Applications like word processors, mails, search engines, speech recognition and social media forums need these kinds of spell checking tools to increase the correctness of the system. Spell checking is completely implemented in languages such as English, French, and Chinese. But as far as Indian regional languages is concerned, very few works have been carried out, that too partially. Tamil is one such Indian regional language, which requires a fully implemented spell checking application as many people started using this language in social media platforms like Facebook and Twitter. Spelling errors fall on different categories in Tamil language, which involves Sandhi errors, Homophone errors (Mayangoli), and misspelt words error. To tackle all these errors, a new ensemble approach is proposed in this paper. The proposed approach consists of Levenshtein's edit distance algorithm, rule-based algorithm, Soundex algorithm along with LSTM (Long Short Term Memory) model. We have used a special feature called combine character splitting of Tamil alphabets for feeding the LSTM model to improve the performance of the system. Proposed system produced an accuracy of 95.67%, which is approved by the Tamil scholar.
This paper addresses the problem of part of speech (POS) tagging for the Tamil language, which is low resourced and agglutinative. POS tagging is the process of assigning syntactic categories for the words in a sentence. This is the preliminary step for many of the Natural Language Processing (NLP) tasks. For this work, various sequential deep learning models such as recurrent neural network (RNN), Long Short-Term Memory (LSTM), Gated Recurrent Unit (GRU) and Bi-directional Long Short-Term Memory (Bi-LSTM) were used at the word level. For evaluating the model, the performance metrics such as precision, recall, F1-score and accuracy were used. Further, a tag set of 32 tags and 225 000 tagged Tamil words was utilized for training. To find the appropriate hidden state, the hidden states were varied as 4, 16, 32 and 64, and the models were trained. The experiments indicated that the increase in hidden state improves the performance of the model. Among all the combinations, Bi-LSTM with 64 hidden states displayed the best accuracy (94%). For Tamil POS tagging, this is the initial attempt to be carried out using a deep learning model.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.