Compilation, Analysis and Application of a Comprehensive Bangla Corpus KUMono

Akther, Aysha; Islam, Md. Shymon; Sultana, Hafsa; Rahman, Aowabin; Saha, Sujana; Alam, Kazi Masudul; Debnath, Rameswar

doi:10.1109/access.2022.3195236

Cited by 6 publications

(1 citation statement)

References 25 publications

(33 reference statements)

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Lexical metrics, such as the Total Word Count and Unique Word Count, offer insights into the textual richness and the diversity of the vocabulary. These metrics are crucial for evaluating the potential of datasets to provide varied linguistic input necessary for training robust NLP models [ 39 – 41 ]. By examining the total count of words alongside the count of unique words, the assessment not only evaluates the volume of linguistic content available but also its variety, which is indicative of the potential complexity and nuance that NLP models must grapple with.…”

Section: Resultsmentioning

confidence: 99%

Creating and validating the Fine-Grained Question Subjectivity Dataset (FQSD): A new benchmark for enhanced automatic subjective question answering systems

Babaali,

Fatemi,

Nematbakhsh

2024

PLoS ONE

View full text Add to dashboard Cite

In the domain of question subjectivity classification, there exists a need for detailed datasets that can foster advancements in Automatic Subjective Question Answering (ASQA) systems. Addressing the prevailing research gaps, this paper introduces the Fine-Grained Question Subjectivity Dataset (FQSD), which comprises 10,000 questions. The dataset distinguishes between subjective and objective questions and offers additional categorizations such as Subjective-types (Target, Attitude, Reason, Yes/No, None) and Comparison-form (Single, Comparative). Annotation reliability was confirmed via robust evaluation techniques, yielding a Fleiss’s Kappa score of 0.76 and Pearson correlation values up to 0.80 among three annotators. We benchmarked FQSD against existing datasets such as (Yu, Zha, and Chua 2012), SubjQA (Bjerva 2020), and ConvEx-DS (Hernandez-Bocanegra 2021). Our dataset excelled in scale, linguistic diversity, and syntactic complexity, establishing a new standard for future research. We employed visual methodologies to provide a nuanced understanding of the dataset and its classes. Utilizing transformer-based models like BERT, XLNET, and RoBERTa for validation, RoBERTa achieved an outstanding F1-score of 97%, confirming the dataset’s efficacy for the advanced subjectivity classification task. Furthermore, we utilized Local Interpretable Model-agnostic Explanations (LIME) to elucidate model decision-making, ensuring transparent and reliable model predictions in subjectivity classification tasks.

show abstract

Section: Resultsmentioning

confidence: 99%