The rapid growth of electronic documents are causing problems like unstructured data that need more time and effort to search a relevant document. Text Document Classification (TDC) has a great significance in information processing and retrieval where unstructured documents are organized into predefined classes. Urdu is the most favorite research language in South Asian languages because of its complex morphology, unique features, and lack of linguistic resources like standard datasets. As compared to short text, like sentiment analysis, long text classification needs more time and effort because of large vocabulary, more noise, and redundant information. Machine Learning (ML) and Deep Learning (DL) models have been widely used in text processing. Despite the major limitations of ML models, like learn directed features, these are the favorite methods for Urdu TDC. To the best of our knowledge, it is the first study of Urdu TDC using DL model. In this paper, we design a large multipurpose and multi-format dataset that contain more than ten thousand documents organize into six classes. We use Single-layer Multisize Filters Convolutional Neural Network (SMFCNN) for classification and compare its performance with sixteen ML baseline models on three imbalanced datasets of various sizes. Further, we analyze the effects of preprocessing methods on SMFCNN performance. SMFCNN outperformed the baseline classifiers and achieved 95.4%, 91.8%, and 93.3% scores of accuracy on medium, large and small size dataset respectively. The designed dataset would be publically and freely available in different formats for future research in Urdu text processing. INDEX TERMS Convolutional neural network, deep learning, machine learning, natural language processing, text document classification, Urdu text classification.
In recent years, unethical behavior in the cyber-environment has been revealed. The presence of offensive language on social media platforms and automatic detection of such language is becoming a major challenge in modern society. The complexity of natural language constructs makes this task even more challenging. Until now, most of the research has focused on resource-rich languages like English. Roman Urdu and Urdu are two scripts of writing the Urdu language on social media. The Roman script uses the English language characters while the Urdu script uses Urdu language characters. Urdu and Hindi languages are similar with the only difference in their writing script but the Roman scripts of both languages are similar. This study is about the detection of offensive language from the user's comments presented in a resourcepoor language Urdu. We propose the first offensive dataset of Urdu containing user-generated comments from social media. We use individual and combined n-grams techniques to extract features at character-level and word-level. We apply seventeen classifiers from seven machine learning techniques to detect offensive language from both Urdu and Roman Urdu text comments. Experiments show that the regression-based models using character n-grams show superior performance to process the Urdu language. Character-level tri-gram outperforms the other word and character n-grams. LogitBoost and SimpleLogistic outperform the other models and achieve 99.2% and 95.9% values of F-measure on Roman Urdu and Urdu datasets respectively. Our designed dataset is publically available on GitHub for future research.
Human detection in crowded scenes is one of the research components of crowd safety problem analysis, such as emergency warning and security monitoring platforms. Although the existing anchor-free methods have fast inference speed, they are not suitable for object detection in crowded scenes due to the model's inability to predict the well-fined object detection bounding boxes. This work proposes an end-to-end anchor-free network, Multidimensional Weighted Cross-Attention Network (MANet), which can perform real-time human detection in crowded scenes. Specifically, the Double-flow Weighted Feature Cascade Module (DW-FCM) is used in the extractor to highlight the contribution of features at different levels. The Triplet Cross Attention Module (TCAM) is used in the detector head to enhance the association dependence of multi-dimension features, further strengthening human boundary features' discrimination ability at a fine-grained level. Moreover, the strategy of Adaptively Opposite Thrust Mapping (AOTM) ground-truth annotation is proposed to achieve bias correction of erroneous mappings and reduce the iterations of useless learning of the network. These strategies effectively alleviate the defect that the existing anchor-free network cannot correctly distinguish and locate the individual human in crowded scenes. Compared with the anchor-based detection method, there is no need to set anchor parameters manually, and the detection speed can satisfy the real-time application. Finally, through extensive comparative experiments on CrowdHuman and WIDER FACE datasets, the results demonstrate that the improved strategy achieves the state-ofthe-art result in the anchor-free methods.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.