Detecting Depression in Thai Blog Posts: a Dataset and a Baseline

Hämäläinen, Mika; Patpong, Pattama; Alnajjar, Khalid; Partanen, Niko; Rueter, Jack

doi:10.18653/v1/2021.wnut-1.3

Cited by 6 publications

(3 citation statements)

References 22 publications

(13 reference statements)

Supporting

Mentioning

Contrasting

Order By: Relevance

“…As the model we are basing our work on (i.e., multilingual BERT) is trained on a generic encyclopedia corpus (Wikipedia) and has little exposure to Islamic and Qur'anic concepts, we continue training the multilingual BERT model to adapt it to the domain of the task here. In our previous research (Hämäläinen et al, 2021a;Hämäläinen et al, 2021b), we have found that BERT based models tend to work better if their training data has had text of a similar domain as the downstream task the model is fine-tuned for. Therefore, we believe that domain adaptation is beneficial in this case as well.…”

Section: Domain Adaptationmentioning

confidence: 99%

Harnessing Multilingual Resources to Question Answering in Arabic

Alnajjar¹,

Hämäläinen²

2022

Preprint

Self Cite

View full text Add to dashboard Cite

The goal of the paper is to predict answers to questions given a passage of Qur'an. The answers are always found in the passage, so the task of the model is to predict where an answer starts and where it ends. As the initial data set is rather small for training, we make use of multilingual BERT so that we can augment the training data by using data available for languages other than Arabic. Furthermore, we crawl a large Arabic corpus that is domain specific to religious discourse. Our approach consists of two steps, first we train a BERT model to predict a set of possible answers in a passage. Finally, we use another BERT based model to rank the candidate answers produced by the first BERT model.

show abstract

Section: Domain Adaptationmentioning

confidence: 99%

Harnessing Multilingual Resources to Question Answering in Arabic

Alnajjar¹,

Hämäläinen²

2022

Preprint

Self Cite

View full text Add to dashboard Cite

show abstract

“…These metrics were defined using Eq. (3)(4)(5)(6), respectively. This study proposed a novel stacking ensemble strategy consisting of five main stages, as denoted in Fig.…”

Section: ) Evaluation Metricsmentioning

confidence: 99%

“…The remaining datasets that have been appeared in studies, have been kept private and dataset details have not disclosed. Moreover, many of them contain a very small number of trainable examples, such as Thai sentence Wiki [18], which is comprised of 600 samples, and Thai depression dataset with only 944 samples [5].…”

Section: Introductionmentioning

confidence: 99%

SETAR: Stacking Ensemble Learning for Thai Sentiment Analysis Using RoBERTa and Hybrid Feature Representation

Thiengburanathum,

Charoenkwan

2023

IEEE Access

View full text Add to dashboard Cite

Sentiment classification of social media posts is among the most challenging and timeconsuming tasks for analysts. This is particularly true when applied to languages that employ scriptio continua, such as the Thai language, in which there are no spaces between written words and where there is no end of sentence punctuation. Thai is considered a scarce-resource language as few datasets are available to researchers. Although machine-learning (ML) and deep-learning (DL) algorithms can identify sentiment classification polarity, the performance of the existing classification models are still inadequate. This study proposes a novel stacking ensemble learning technique for identifying sentiment classification polarity in the Thai language, SETAR. Our stacking ensemble strategy utilized the pre-trained Thai language model (WangChanBERTa), based on a Robustly Optimized BERT Pretraining Approach (RoBERTa) architecture to form a feature vector. This feature was combined with three distinct feature vectors obtained from three well-known categories, namely Word2Vec, TF-IDF, and bag-of-words, as a new hybrid sentence representation. The base learners were trained using seven chosen complex heterogeneous ML algorithms, including support vector machine (SVM), random forest (RF), extremely randomized trees (ET), light gradient boosting machine (LGBM), multi-layer perceptron (MLP), partial least squares (PLS), and logistic regression (LR) to enable the development of the final meta-learners. The results revealed that our proposed stacking ensemble model outperformed the baseline models of all classification metrics among the training and test sets, as was determined by extensive benchmarking, carried out on the four datasets, which included our developed sentiment corpus that domain experts annotated.

show abstract