Proceedings of the 12th International Conference on Natural Language Generation 2019
DOI: 10.18653/v1/w19-8617
|View full text |Cite
|
Sign up to set email alerts
|

KPTimes: A Large-Scale Dataset for Keyphrase Generation on News Documents

Abstract: Keyphrase generation is the task of predicting a set of lexical units that conveys the main content of a source text. Existing datasets for keyphrase generation are only readily available for the scholarly domain and include non-expert annotations. In this paper we present KPTimes, a large-scale dataset of news texts paired with editor-curated keyphrases. Exploring the dataset, we show how editors tag documents, and how their annotations differ from those found in existing datasets. We also train and evaluate … Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1
1

Citation Types

0
32
0

Year Published

2020
2020
2024
2024

Publication Types

Select...
3
3
2

Relationship

2
6

Authors

Journals

citations
Cited by 27 publications
(32 citation statements)
references
References 12 publications
0
32
0
Order By: Relevance
“…That being said, CopyRNN, which is the best overall model, fails to consistently outperform the baselines on all datasets. One reason for that is the limited generalization ability of neural-based models [10,15,34], which means that their performance degrades on documents that differ from the ones encountered during training. This is besides confirmed by the extremely low performance of these models on DUC-2001 and KPCrowd.…”
Section: Resultsmentioning
confidence: 99%
See 2 more Smart Citations
“…That being said, CopyRNN, which is the best overall model, fails to consistently outperform the baselines on all datasets. One reason for that is the limited generalization ability of neural-based models [10,15,34], which means that their performance degrades on documents that differ from the ones encountered during training. This is besides confirmed by the extremely low performance of these models on DUC-2001 and KPCrowd.…”
Section: Resultsmentioning
confidence: 99%
“…Similar to paper abstracts, online news are available in large quantities and can be easily mined from the internet. We selected the following three datasets: DUC-2001 [43], 500N-KPCrowd [33] and KP-Times [15]. The first two datasets provide reader-assigned keyphrases, while KPTimes supplies indexer-assigned keyphrases extracted from metadata and initially intended for search engines.…”
Section: Benchmark Datasetsmentioning
confidence: 99%
See 1 more Smart Citation
“…For future work, we plan to explore directions that would enable us to simultaneously optimize for quality and diversity metrics. *Note that our test set for KPTimes is a combination of 10k records from KPTimes and 10k records from JPTimes (Gallina et al, 2019).…”
Section: Discussionmentioning
confidence: 99%
“…We adopt the standard metric and compute the f-measure at top 5, as it corresponds to the average number of keyphrases in KP20k and NTCIR-2, that is, 5.3 and 4.8, respectively. We also examine cross-domain generalization using the KPTimes news dataset (Gallina et al, 2019), and include a state-of-the-art unsupervised keyphrase extraction model (Boudin, 2018, henceforth mp-rank) for comparison purposes. This latter baseline also provides an additional relevance signal based on graph-based ranking whose usefulness in retrieval will be tested in subsequent experiments.…”
Section: Keyphrase Generationmentioning
confidence: 99%