Semi-automatic discourse annotation in a low-resource language: Developing a connective lexicon for Nigerian Pidgin

Marchal, Marian; Scholman, Merel; Demberg, Vera

doi:10.18653/v1/2021.codi-main.8

Cited by 4 publications

(3 citation statements)

References 20 publications

(21 reference statements)

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Label distribution statistics extracted from the MLCT provides ground for obtaining additional performance metrics for MLC model, like Matthews correlation coefficient or Cohen's Kappa coefficient, as these values can be computed from the confusion matrix in the MCC model [11], [12]. While there is an assumption that these measures are not applicable for the evaluation of Multi-label classifier [19] we believe that defining Multi-label Confusion Tensor opens up the possibility of to obtain confusion-based performance metrics that have not been applicable to the MLC model until now.…”

Section: Discussionmentioning

confidence: 99%

Multi-Label Confusion Tensor

Krstinić,

Skelin,

Slapničar

et al. 2024

IEEE Access

View full text Add to dashboard Cite

The confusion matrix is the tool commonly used for the evaluation of the performance of a classification algorithm. While the computation of the confusion matrix for multi-class classification follows a well-developed procedure, the common approach for computing the confusion matrix for multilabel classification suffers from the ambiguity related to one-vs-rest strategy and ignores the possibility that predictions could be partially correct, which also leads to inaccuracies of the derived evaluation metric. Only recently, two approaches have been proposed for the calculation of the confusion matrix, which take into account the specifics of multi-label classification. In this work, a new method for calculating evaluation metrics for multi-label classification is proposed. The proposed method is based on the calculation of two confusion matrices combined into the confusion tensor. It builds upon the insights into the shortcomings of the two existing approaches for calculating the multi-label confusion matrix. The main drawback of these techniques is their inability to compute precision and recall precisely. The Multi-Label Confusion Tensor was tested on synthetic and real data and compared with existing methods for calculating the multi-label confusion matrix. The source code and the data used to test the methodology are made publicly available. INDEX TERMS Multi-label classification, confusion matrix, classification performance, machine learningWhile these metrics provide an objective assessment of the efficiency of the classifier on the entire data set, they provide

show abstract

Section: Discussionmentioning

confidence: 99%

Multi-Label Confusion Tensor

Krstinić,

Skelin,

Slapničar

et al. 2024

IEEE Access

View full text Add to dashboard Cite

show abstract

“…We train annotators on DR labeling and ask annotators to choose from a set of discourse labels. We allow for multiple labels to investigate what relations are more confusable or perceived as co-occurring (Marchal et al, 2022).…”

Section: Related Workmentioning

confidence: 99%

Incorporating Annotator Uncertainty into Representations of Discourse Relations

López Cortez,

Jacobs

2023

Proceedings of the 24th Meeting of the Special Interest Group on Discourse and Dialogue

View full text Add to dashboard Cite

Annotation of discourse relations is a known difficult task, especially for non-expert annotators. In this paper, we investigate novice annotators' uncertainty on the annotation of discourse relations on spoken conversational data. We find that dialogue context (single turn, pair of turns within speaker, and pair of turns across speakers) is a significant predictor of confidence scores. We compute distributed representations of discourse relations from co-occurrence statistics that incorporate information about confidence scores and dialogue context. We perform a hierarchical clustering analysis using these representations and show that weighting discourse relation representations with information about confidence and dialogue context coherently models our annotators' uncertainty about discourse relation labels.

show abstract

“…In various fields of NLP [9], [10], [11], [12], [13], there have been efforts to tackle the situation of LRL data scarcity by annotating RRL datasets. This paper introduces a method for integrating Hindi terms into English supervised corpora.…”

Section: Introductionmentioning

confidence: 99%

Enhancing Low-Resource Question-Answering Performance Through Word Seeding and Customized Refinement

Pandya,

Bhatt

2024

IJACSA

View full text Add to dashboard Cite

The state-of-the-art approaches in Question-Answering (QA) systems necessitate extensive supervised training datasets. In low-resource languages (LRL), the scarcity of data poses a bottleneck, and the manual annotation of labeled data is a rigorous process. Addressing this challenge, some recent efforts have explored cross-lingual or multilingual QA learning by leveraging training data from resource-rich languages (RRL). However, the efficiency of such approaches relies on syntactic compatibility between languages. The paper introduces the innovative method that involves seeding LRL data into RRL to create a bilingual supervised corpus while preserving the syntactical structure of RRL. The method employs the translation and transliteration of selected parts-of-speech (POS) category words. Additionally, the paper also proposes a customized approach to fine-tune the models using bilingual data. Employing the bilingual data and the proposed fine-tuning approach, the most successful model has achieved a 75.62 F1 score on the XQuAD Hindi dataset and a 68.92 F1 score on the MLQA Hindi dataset in a zero-shot architecture. In the experiments conducted using few-shot learning setup, the highest F1 scores of 79.17 on the XQuAD Hindi dataset and 70.42 on the MLQA Hindi dataset have been achieved.

show abstract

Semi-automatic discourse annotation in a low-resource language: Developing a connective lexicon for Nigerian Pidgin

Cited by 4 publications

References 20 publications

Multi-Label Confusion Tensor

Multi-Label Confusion Tensor

Incorporating Annotator Uncertainty into Representations of Discourse Relations

Enhancing Low-Resource Question-Answering Performance Through Word Seeding and Customized Refinement

Contact Info

Product

Resources

About