Social scientists have shown that up to 50% of the comments posted to a news article have no relation to its journalistic content. In this study we propose a classification algorithm to categorize user comments posted to a news article based on their alignment to its content. The alignment seeks to match user comments to an article based on similarity of content, entities in discussion, and topics. We propose a BERTAC, BERT-based approach that learns jointly article-comment embeddings and infers the relevance class of comments. We introduce an ordinal classification loss that penalizes the difference between the predicted and true labels. We conduct a thorough study to show influence of the proposed loss on the learning process. The results on five representative news outlets show that our approach can learn the comment class with up to 36% average accuracy improvement comparing to the baselines, and up to 25% comparing to the BA-BC. BA-BC is our approach that consists of two models aimed to capture dis-jointly the formal language of news articles and the informal language of comments. We also conduct a user study to evaluate human labeling performance to understand the difficulty of the classification task. The user agreement on comment-article alignment is "moderate" per Krippendorff's alpha score, which suggests that the classification task is difficult. Keywords: Text mining • Text classification • Online news • News comments • Relevancy • Understanding user-generated text J. Alshehri and M. Stanojevic-contributed equally.
Disagreement among text annotators as a part of a human (expert) labeling process produces noisy labels, which affect the performance of supervised learning algorithms for natural language processing. Using only high agreement annotations introduces another challenge: the data imbalance problem. We study this challenge within the problem of relating user comments to the content of a news article. We show that traditional techniques for learning from imbalanced data, such as oversampling, using weighted loss functions, or assigning weak labels using crowdsourcing, may not be sufficient for modeling complex temporal relationships between news articles and user comments. In this study, we propose a framework for aligning comments and articles (1) from imbalanced news data characterized with (2) different degrees of annotator agreement, under (3) a constrained budget for human labeling and computing resources. Within the framework, we propose a Semi-Automatic Labeling solution based on Human-AI collaboration. We compare our proposed technique with traditional data imbalance handling techniques and synthetic data generation on the article-comment alignment problem, where the goal is to determine a category of an article-comment pair that represents how relevant the comment is to the article. Finding an effective and efficient solution is essential because it is time-consuming and prohibitively costly to manually label a sufficiently large amount of article-comment pairs based on the semantic understanding of an article and its comments. We discover that the Human-AI collaboration outperforms all alternative techniques by 17% of article-comment alignment accuracy. When there is no time or budget for re-labeling some article-comment pairs, we found that synonym augmentation is a reasonable alternative. We also provide a detailed analysis of the effect of humans in the loop and the use of unlabeled data.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.