“…To name just a few examples, experiments are most often conducted on different benchmark datasets, all of which differ in domain, size, language or quality of the gold standard (that is, reference keyphrases supplied by authors, readers or professional indexers). This not only makes the reported results hard to contrast, but also has a profound impact on trained model performance [15]. In addition, and since there is no consensus as to which evaluation metric is most reliable for keyphrase extraction [21,24,49], diverse measures are commonly seen in the literature, thus preventing any further direct comparisons.…”