Using weak supervision to generate training datasets from social media data: a proof of concept to identify drug mentions

Tekumalla, Ramya; Banda, Juan M.

doi:10.1007/s00521-021-06614-2

Cited by 10 publications

(5 citation statements)

References 24 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Leveraging weak supervision has demonstrated potential in the realm of social media mining [3,28,29]. A dataset acquired without manual curation using a weak labelling heuristic detailed in the respective section is termed as 'bronze-standard dataset'.…”

Section: Bronze Standard Datasetmentioning

confidence: 99%

Towards Robust Urdu Aspect-based Sentiment Analysis through Weakly-Supervised Annotation Framework

Maqsood,

Latif,

Salman

et al. 2024

Preprint

View full text Add to dashboard Cite

Aspect-Based Sentiment Analysis (ABSA) is pivotal for diverse applications but faces significant hurdles in under-resourced languages like Urdu, primarily due to the absence of a comprehensive, annotated benchmark corpus. This study tackles this gap by introducing a novel Weakly Supervised technique to construct a benchmark dataset tailored for Urdu ABSA, addressing public availability, domain coverage, and annotation comprehensiveness. Our dataset encompasses detailed annotations across all ABSA dimensions i.e. aspect, opinion, sentiment polarity and category. Through a comparative analysis involving Large Language Models (LLMs), human annotations, and pre-trained models based on expertly curated datasets, we demonstrate the dataset’s complexity and the nuanced nature of ABSA in Urdu, as reflected in the challenging outcomes of ABSA subtasks using a basic LSTM approach. This research not only advances Urdu ABSA techniques but also illuminates the broader challenges of Opinion Mining in under-resourced languages, setting a precedent for future work in this critical area.

show abstract

Section: Bronze Standard Datasetmentioning

confidence: 99%

Towards Robust Urdu Aspect-based Sentiment Analysis through Weakly-Supervised Annotation Framework

Maqsood,

Latif,

Salman

et al. 2024

Preprint

View full text Add to dashboard Cite

show abstract

“…We retrained the models adding the biocreative validation dataset and finally obtained the predictions on the test data. We filtered all the positive predictions and extracted the spans of the medication term using a medication dictionary ( 47 ). The SMMT_NER utility from the Social Media Mining Toolkit ( 48 ) was utilized for identifying the spans of the medication.…”

Section: Systemsmentioning

confidence: 99%

Automatic Extraction of Medication Mentions from Tweets—Overview of the BioCreative VII Shared Task 3 Competition

et al. 2023

View full text Add to dashboard Cite

This study presents the outcomes of the shared task competition BioCreative VII (Task 3) focusing on the extraction of medication names from a Twitter user’s publicly available tweets (the user’s ‘timeline’). In general, detecting health-related tweets is notoriously challenging for natural language processing tools. The main challenge, aside from the informality of the language used, is that people tweet about any and all topics, and most of their tweets are not related to health. Thus, finding those tweets in a user’s timeline that mention specific health-related concepts such as medications requires addressing extreme imbalance. Task 3 called for detecting tweets in a user’s timeline that mentions a medication name and, for each detected mention, extracting its span. The organizers made available a corpus consisting of 182 049 tweets publicly posted by 212 Twitter users with all medication mentions manually annotated. The corpus exhibits the natural distribution of positive tweets, with only 442 tweets (0.2%) mentioning a medication. This task was an opportunity for participants to evaluate methods that are robust to class imbalance beyond the simple lexical match. A total of 65 teams registered, and 16 teams submitted a system run. This study summarizes the corpus created by the organizers and the approaches taken by the participating teams for this challenge. The corpus is freely available at https://biocreative.bioinformatics.udel.edu/tasks/biocreative-vii/track-3/. The methods and the results of the competing systems are analyzed with a focus on the approaches taken for learning from class-imbalanced data.

show abstract

“…Obtaining a considerable amount of high-quality label data costs a lot. Therefore, the strong dependence on labelled data hinders the application of the deep learning model, which is the bottleneck of supervised learning [24]. In natural language processing, it is difficult to obtain high-quality labelled texts.…”

Section: Related Workmentioning

confidence: 99%

A semi-supervised short text sentiment classification method based on improved Bert model from unlabelled data

Zou

Wang

2023

J Big Data

View full text Add to dashboard Cite

Short text information has considerable commercial value and immeasurable social value. Natural language processing and short text sentiment analysis technology can organize and analyze short text information on the Internet. Natural language processing tasks such as sentiment classification have achieved satisfactory performance under a supervised learning framework. However, traditional supervised learning relies on large-scale and high-quality manual labels and obtaining high-quality label data costs a lot. Therefore, the strong dependence on label data hinders the application of the deep learning model to a large extent, which is the bottleneck of supervised learning. At the same time, short text datasets such as product reviews have an imbalance in the distribution of data samples. To solve the above problems, this paper proposes a method to predict label data according to semi-supervised learning mode and implements the MixMatchNL data enhancement method. Meanwhile, the Bert pre-training model is updated. The cross-entropy loss function in the model is improved to the Focal Loss function to alleviate the data imbalance in short text datasets. Experimental results based on public datasets indicate the proposed model has improved the accuracy of short text sentiment recognition compared with the previous update and other state-of-the-art models.

show abstract

Using weak supervision to generate training datasets from social media data: a proof of concept to identify drug mentions

Cited by 10 publications

References 24 publications

Towards Robust Urdu Aspect-based Sentiment Analysis through Weakly-Supervised Annotation Framework

Towards Robust Urdu Aspect-based Sentiment Analysis through Weakly-Supervised Annotation Framework

Automatic Extraction of Medication Mentions from Tweets—Overview of the BioCreative VII Shared Task 3 Competition

A semi-supervised short text sentiment classification method based on improved Bert model from unlabelled data

Contact Info

Product

Resources

About