Artificial Neural Networks have proved their efficiency in a large number of research domains. In this paper, we have applied Artificial Neural Networks on Arabic text to prove correct language modeling, text generation, and missing text prediction. In one hand, we have adapted Recurrent Neural Networks architectures to model Arabic language in order to generate correct Arabic sequences. In the other hand, Convolutional Neural Networks have been parameterized, basing on some specific features of Arabic, to predict missing text in Arabic documents. We have demonstrated the power of our adapted models in generating and predicting correct Arabic text comparing to the standard model. The model had been trained and tested on known free Arabic datasets. Results have been promising with sufficient accuracy.
Missing text prediction is one of the major concerns of Natural Language Processing deep learning community's attention. However, the majority of text prediction related research is performed in other languages but not Arabic. In this paper, we take a first step in training a deep learning language model on Arabic language. Our contribution is the prediction of missing text from text documents while applying Convolutional Neural Networks (CNN) on Arabic Language Models. We have built CNN-based Language Models responding to specific settings in relation with Arabic language. We have prepared our dataset of a large quantity of text documents freely downloaded from Arab World Books, Hindawi foundation, and Shamela datasets. To calculate the accuracy of prediction, we have compared documents with complete text and same documents with missing text. We realized training, validation and test steps at three different stages aiming to increase the performance of prediction. The model had been trained at first stage on documents of the same author, then at the second stage, it had been trained on documents of the same dataset, and finally, at the third stage, the model had been trained on all document confused. Steps of training, validation and test have been repeated many times by changing each time the author, dataset, and the combination author-dataset, respectively. Also we have used the technique of enlarging training data by feeding the CNN-model each time by a larger quantity of text. The model gave a high performance of Arabic text prediction using Convolutional Neural Networks with an accuracy that have reached 97.8% in best case.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.