With the increasing availability of electronic documents and the rapid growth of the World Wide Web, the task of automatic categorization of documents became the key method for organizing the information and knowledge discovery. Proper classification of e-documents, online news, blogs, e-mails and digital libraries need text mining, machine learning and natural language processing techniques to get meaningful knowledge. The aim of this paper is to highlight the important techniques and methodologies that are employed in text documents classification,while at the same time making awareness of some of the interesting challenges that remain to be solved, focused mainly on text representation and machine learning techniques. This paper provides a review of the theory and methods of document classification and text mining, focusing on the existing literature.
In Urdu, part of speech (POS) tagging is a challenging task as it is both inflectionally and derivationally rich morphological language. Verbs are generally conceived a highly inflected object in Urdu comparatively to nouns. POS tagging is used as a preliminary linguistic text analysis in diverse natural language processing domains such as speech processing, information extraction, machine translation, and others. It is a task that first identifies appropriate syntactic categories for each word in running text and second assigns the predicted syntactic tag to all concerned words. The current work is the extension of our previous work. Previously, we presented conditional random field (CRF)-based POS tagger with both language dependent and independent feature set. However, in the current study, we offer: 1) the implementation of both machine and deep learning models for Urdu POS tagging task with well-balanced language-independent feature set and 2) to highlight diverse challenges which cause Urdu POS task a challenging one. In this research, we demonstrated the effectiveness of machine learning and deep learning models for Urdu POS task. Empirically, we have evaluated the performance of all models on two benchmark datasets. The core models evaluated in this study are CRF, support vector machine (SVM), two variants of the deep recurrent neural network (DRNN), and a variant of n-gram Markov model the bigram hidden Markov model (HMM). The two variants of DRRN models evaluated include forward long short-term memory (LSTM)-RNN and LSTM-RNN with CRF output. INDEX TERMS Urdu, part of speech (POS), conditional random field (CRF), support vector machine (SVM), recurrent neural network (RNN), hidden Markov model (HMM).
Long efforts have been made to enable machines to understand human language. Nowadays such activities fall under the broad umbrella of machine comprehension. The results are optimistic due to the recent advancements in the field of machine learning. Deep learning promises to bring even better results but requires expensive and resource hungry hardware. In this paper, we demonstrate the use of deep learning in the context of machine comprehension by using non-GPU machines. Our results suggest that the good algorithm insight and detailed understanding of the dataset can help in getting meaningful results through deep learning even on non-GPU machines.
Opinion targets identification is an important task of the opinion mining problem. Several approaches have been employed for this task, which can be broadly divided into two major categories: supervised and unsupervised. The supervised approaches require training data, which need manual work and are mostly domain dependent. The unsupervised technique is most popularly used due to its two main advantages: domain independent and no need for training data. This study presents a review of the state of the art unsupervised approaches for opinion target identification due to its potential applications in opinion mining from web documents. This study compares the existing approaches that might be helpful in the future research work of opinion mining and features extraction.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.