Experimental Evaluation of Deep Learning models for Marathi Text Classification

Kulkarni, Atharva; Mandhane, Meet; Likhitkar, Manali; Kshirsagar, Gayatri; Jagdale, Jayashree; Joshi, Raviraj

doi:10.48550/arxiv.2101.04899

Cited by 5 publications

(6 citation statements)

References 13 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…In this work, we provide a comparative view of different families of algorithms on a range of datasets. Similar comparison of deep learning approaches on different datasets and languages have been studied in [14,13,9,32,10,17].…”

Section: Introductionmentioning

confidence: 88%

Comparative Study of Long Document Classification

Wagh,

Khandve,

Joshi

et al. 2021

Preprint

Self Cite

View full text Add to dashboard Cite

The amount of information stored in the form of documents on the internet has been increasing rapidly. Thus it has become a necessity to organize and maintain these documents in an optimum manner. Text classification algorithms study the complex relationships between words in a text and try to interpret the semantics of the document. These algorithms have evolved significantly in the past few years. There has been a lot of progress from simple machine learning algorithms to transformer-based architectures. However, existing literature has analyzed different approaches on different data sets thus making it difficult to compare the performance of machine learning algorithms. In this work, we revisit long document classification using standard machine learning approaches. We benchmark approaches ranging from simple Naive Bayes to complex BERT on six standard text classification datasets. We present an exhaustive comparison of different algorithms on a range of long document datasets. We re-iterate that long document classification is a simpler task and even basic algorithms perform competitively with BERT-based approaches on most of the datasets. The BERT-based models perform consistently well on all the datasets and can be blindly used for the document classification task when the computations cost is not a concern. In the shallow model's category, we suggest the usage of raw BiLSTM + Max architecture which performs decently across all the datasets. Even simpler Glove + Attention bag of words model can be utilized for simpler use cases. The importance of using sophisticated models is clearly visible in the IMDB sentiment dataset which is a comparatively harder task.

show abstract

Section: Introductionmentioning

confidence: 88%

Comparative Study of Long Document Classification

Wagh,

Khandve,

Joshi

et al. 2021

Preprint

Self Cite

View full text Add to dashboard Cite

show abstract

“…We are using common deep learning text classification approaches for the task of Hate speech detection [19]. The models are used directly for binary classification tasks whereas a hierarchical approach is used for multi-labeled fine-grained classification.…”

Section: Model Architecturesmentioning

confidence: 99%

Hate and Offensive Speech Detection in Hindi and Marathi

Velankar¹,

Patil²,

Gore³

et al. 2021

Preprint

Self Cite

View full text Add to dashboard Cite

Sentiment analysis is the most basic NLP task to determine the polarity of text data. There has been a significant amount of work in the area of multilingual text as well. Still hate and offensive speech detection faces a challenge due to inadequate availability of data, especially for Indian languages like Hindi and Marathi. In this work, we consider hate and offensive speech detection in Hindi and Marathi texts. The problem is formulated as a text classification task using the state of the art deep learning approaches. We explore different deep learning architectures like CNN, LSTM, and variations of BERT like multilingual BERT, IndicBERT, and monolingual RoBERTa. The basic models based on CNN and LSTM are augmented with fast text word embeddings. We use the HASOC 2021 Hindi and Marathi hate speech datasets to compare these algorithms. The Marathi dataset consists of binary labels and the Hindi dataset consists of binary as well as more-fine grained labels. We show that the transformer-based models perform the best and even the basic models along with FastText embeddings give a competitive performance. Moreover, with normal hyper-parameter tuning, the basic models perform better than BERT-based models on the fine-grained Hindi dataset.

show abstract

“…For conducting baseline experiments on our dataset, hashtags, mentions, and special symbols were removed during preprocessing. We used some of the widely used text classification architectures for sentiment classification (Kulkarni et al, 2021;Kowsari et al, 2019;Kim, 2014;Sun et al, 2019). The text is tokenized as words or sub-words and passed to the algorithms mentioned • CNN: The initial embedding layer outputs word embeddings of size 300.…”

Section: Experimentationsmentioning

confidence: 99%

“…Marathi is an Indian language spoken by around 83 million people and ranks as the third most spoken language in India. But surprisingly, there is no significant work or resource for the task of sentiment analysis in Marathi (Kulkarni et al, 2021). A sentiment analysis dataset curated by IIT-Bombay is available, but it has a very small size consisting of only 150 samples (Balamurali et al, 2012).…”

Section: Introductionmentioning

confidence: 99%

L3CubeMahaSent: A Marathi Tweet-based Sentiment Analysis Dataset

Kulkarni¹,

Mandhane²,

Likhitkar³

et al. 2021

Preprint

Self Cite

View full text Add to dashboard Cite

Sentiment analysis is one of the most fundamental tasks in Natural Language Processing. Popular languages like English, Arabic, Russian, Mandarin, and also Indian languages such as Hindi, Bengali, Tamil have seen a significant amount of work in this area. However, the Marathi language which is the third most popular language in India still lags behind due to the absence of proper datasets. In this paper, we present the first major publicly available Marathi Sentiment Analysis Dataset -L3CubeMahaSent. It is curated using tweets extracted from various Maharashtrian personalities' Twitter accounts. Our dataset consists of ∼16,000 distinct tweets classified in three broad classes viz. positive, negative, and neutral. We also present the guidelines using which we annotated the tweets. Finally, we present the statistics of our dataset and baseline classification results using CNN, LSTM, ULMFiT, and BERT-based deep learning models.1. We present a ∼16,000 tweets strong Marathi

show abstract

Experimental Evaluation of Deep Learning models for Marathi Text Classification

Cited by 5 publications

References 13 publications

Comparative Study of Long Document Classification

Comparative Study of Long Document Classification

Hate and Offensive Speech Detection in Hindi and Marathi

L3CubeMahaSent: A Marathi Tweet-based Sentiment Analysis Dataset

Contact Info

Product

Resources

About