Evaluating keyphrase extraction algorithms for finding similar news articles using lexical similarity calculation and semantic relatedness measurement by word embedding

Sarwar, Talha Bin; Noor, Noorhuzaimi Mohd; Miah, M. Saef Ullah

doi:10.7717/peerj-cs.1024

Cited by 12 publications

(6 citation statements)

References 37 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…The most significant choice to make when applying our proposed methodology is the keyword extraction technique. We reviewed recent studies that addressed the problem of keyword extraction, focusing on those that compared the performance of state of the art techniques on the gold-standard keyword extraction datasets ( Sarwar, Noor & Miah, 2022 ; Piskorski et al, 2021 ; Miah et al, 2021 ; Papagiannopoulou & Tsoumakas, 2020 ). We also checked the methods that were reported by the recent techniques as effective baselines.…”

Section: Methodsmentioning

confidence: 99%

Unsupervised query reduction for efficient yet effective news background linking

Essam¹,

Elsayed²

2023

PeerJ Computer Science

View full text Add to dashboard Cite

In this article, we study efficient techniques to tackle the news background linking problem, in which an online reader seeks background knowledge about a given article to better understand its context. Recently, this problem attracted many researchers, especially in the Text Retrieval Conference (TREC) community. Surprisingly, the most effective method to date uses the entire input news article as a search query in an ad-hoc retrieval approach to retrieve the background links. In a scenario where the lookup for background links is performed online, this method becomes inefficient, especially if the search scope is big such as the Web, due to the relatively long generated query, which results in a long response time. In this work, we evaluate different unsupervised approaches for reducing the input news article to a much shorter, hence efficient, search query, while maintaining the retrieval effectiveness. We conducted several experiments using the Washington Post dataset, released specifically for the news background linking problem. Our results show that a simple statistical analysis of the article using a recent keyword extraction technique reaches an average of 6.2× speedup in query response time over the full article approach, with no significant difference in effectiveness. Moreover, we found that further reduction of the search terms can be achieved by eliminating relatively low TF-IDF values from the search queries, yielding even more efficient retrieval of 13.3× speedup, while still maintaining the retrieval effectiveness. This makes our approach more suitable for practical online scenarios. Our study is the first to address the efficiency of news background linking systems. We, therefore, release our source code to promote research in that direction.

show abstract

Section: Methodsmentioning

confidence: 99%

Unsupervised query reduction for efficient yet effective news background linking

Essam¹,

Elsayed²

2023

PeerJ Computer Science

View full text Add to dashboard Cite

show abstract

“…We used two lists of normalized keyphrases for each sample from the annotators. We use the Jaccard index to measure the agreement/similarity between annotations (Sarwar, Noor, and Miah 2022). Jaccard index is defined as:…”

Section: Validation Of Annotationmentioning

confidence: 99%

Theme-Driven Keyphrase Extraction to Analyze Social Media Discourse

Romano,

Sharif,

Basak

et al. 2024

ICWSM

View full text Add to dashboard Cite

Social media platforms are vital resources for sharing self-reported health experiences, offering rich data on various health topics. Despite advancements in Natural Language Processing (NLP) enabling large-scale social media data analysis, a gap remains in applying keyphrase extraction to health-related content. Keyphrase extraction is used to identify salient concepts in social media discourse without being constrained by predefined entity classes. This paper introduces a theme-driven keyphrase extraction framework tailored for social media, a pioneering approach designed to capture clinically relevant keyphrases from user-generated health texts. Themes are defined as broad categories determined by the objectives of the extraction task. We formulate this novel task of theme-driven keyphrase extraction and demonstrate its potential for efficiently mining social media text for the use case of treatment for opioid use disorder. This paper leverages qualitative and quantitative analysis to demonstrate the feasibility of extracting actionable insights from social media data and efficiently extracting keyphrases using minimally supervised NLP models. Our contributions include the development of a novel data collection and curation framework for theme-driven keyphrase extraction and the creation of SuboxoPhrase, the first dataset of its kind comprising human-annotated keyphrases from a Reddit community. We also identify the scope of minimally supervised NLP models to extract keyphrases from social media data efficiently. Lastly, we found that a large language model (ChatGPT) outperforms unsupervised keyphrase extraction models, showcasing its efficacy in this task.

show abstract

“…When the available knowledge is in the form of textual documents, this step is referred to as word embedding, whereas, when dealing with graph-shaped knowledge, as graph embedding. Examples about word embedding for semantic relatedness are proposed in [35], [52], and [65]. In particular, [35] aims at achieving a better accuracy on the semantic relatedness of both isolated words and words in contexts.…”

Section: Related Workmentioning

confidence: 99%

“…In particular, [35] aims at achieving a better accuracy on the semantic relatedness of both isolated words and words in contexts. In [52], word embedding is applied to represent keyphrases in a corpus of textual documents in order to find similar news articles. In [65], a semantic relatedness graph is constructed in order to detect sentiment polarities in a long sentence towards multiple aspect categories.…”

Section: Related Workmentioning

confidence: 99%