Large-Scale Evaluation of Keyphrase Extraction Models

Gallina, Ygor; Boudin, Florian; Daille, Béatrice

doi:10.1145/3383583.3398517

Cited by 11 publications

(14 citation statements)

References 30 publications

Supporting

Mentioning

Contrasting

Unclassified

Order By: Relevance

“…However, this test set still allow for evaluation of how much important information these vocabularies can extract. Our performance evaluation uses a widely used exact match evaluation method [3,21], where only the extract matches with the gold standard are considered as true positives. Specifically, P recision = N umber of M atched in Abstract i N umber of Extracted in Abstract i , Recall = N umber of M atched in Abstract i N umber of Annotated in Abstract i .…”

Section: Evaluation Of Phrase Extraction Based On Human Annotated Datamentioning

confidence: 99%

WikiCSSH: Extracting Computer Science Subject Headings from Wikipedia

Han

Yang

Mishra

et al. 2020

ADBIS, TPDL and EDA 2020 Common Workshops and Doctoral Consortium

View full text Add to dashboard Cite

Hierarchical domain-specific classification schemas (or subject heading vocabularies) are often used to identify, classify, and disambiguate concepts that occur in scholarly articles. In this work, we develop, apply, and evaluate a human-in-the-loop workflow that first extracts an initial category tree from crowd-sourced Wikipedia data, and then combines community detection, machine learning, and hand-crafted heuristics or rules to prune the initial tree. This work resulted in WikiC-SSH; a large-scale, hierarchically-organized vocabulary for the domain of computer science (CS). Our evaluation suggests that WikiCSSH outperforms alternative CS vocabularies in terms of vocabulary size as well as the performance of lexicon-based key-phrase extraction from scholarly data. WikiCSSH can further distinguish between coarse-grained versus fine-grained CS concepts. The outlined workflow can serve as a template for building hierarchically-organized subject heading vocabularies for other domains that are covered in Wikipedia.

show abstract

Section: Evaluation Of Phrase Extraction Based On Human Annotated Datamentioning

confidence: 99%

WikiCSSH: Extracting Computer Science Subject Headings from Wikipedia

Han

Yang

Mishra

et al. 2020

ADBIS, TPDL and EDA 2020 Common Workshops and Doctoral Consortium

View full text Add to dashboard Cite

show abstract

“…We follow these guidelines strictly, when it comes to the use of identical datasets and gold-standard keyword sets, but somewhat deviate from them when it comes to the employment of identical preprocessing techniques and parameter settings employed for different approaches. Since all unsupervised approaches operate on a set of keyphrase candidates, extracted from the input document, Gallina et al (2020) argues that the extraction of these candidates and other parameters should be identical (e.g., they select the sequences of adjacent nouns with one or more preceding adjectives of length up to five words in order to extract keyword candidates) for a fair comparison between algorithms. On the other hand, we are more interested in comparison between keyword extraction approaches instead of algorithms alone and argue that the distinct keyword candidate extraction techniques are inseparable from the overall approach and should arguably be optimized for each distinct algorithm.…”

Section: Discussionmentioning

confidence: 99%

TNT-KID: Transformer-based neural tagger for keyword identification

2021

View full text Add to dashboard Cite

With growing amounts of available textual data, development of algorithms capable of automatic analysis, categorization, and summarization of these data has become a necessity. In this research, we present a novel algorithm for keyword identification, that is, an extraction of one or multiword phrases representing key aspects of a given document, called Transformer-Based Neural Tagger for Keyword IDentification (TNT-KID). By adapting the transformer architecture for a specific task at hand and leveraging language model pretraining on a domain-specific corpus, the model is capable of overcoming deficiencies of both supervised and unsupervised state-of-the-art approaches to keyword extraction by offering competitive and robust performance on a variety of different datasets while requiring only a fraction of manually labeled data required by the best-performing systems. This study also offers thorough error analysis with valuable insights into the inner workings of the model and an ablation study measuring the influence of specific components of the keyword identification workflow on the overall performance.

show abstract

“…Πολλές διαδικασίες διαφόρων πεδίων μπορούν να ωφεληθούν από την επιτυχημένη εξαγωγή λέξεων ή φράσεων κλειδιών, όπως η ομαδοποίηση κειμένων (document clustering) (Shubankar et al, 2011;Kim and Gil, 2019;Karpagam and Saradha, 2019), η κατηγοριοποίηση/ταξινόμηση κειμένων (text classification) (Hulth and Megyesi, 2006;Meng et al, 2019), προβλήματα ανάκτησης πληροφορίας (information retrieval) (Ji et al, 2019;Boudin et al, 2020), όπως η διεύρυνση ερωτημάτων (query expansion) (Song et al, 2006) και η πολυσύνθετη αναζήτηση (faceted search) , η εξαγωγή περίληψης από κείμενα (text summarization) (Zhang et al, 2004;Litvak and Last, 2008;Song et al, 2019), η αναγνώριση οντοτήτων (entity recognition) (Du et al, 2018) και ο εντοπισμός γεγονότων (event detection) (Hossny et al, 2020). Ο σημαντικός ρόλος των φράσεων κλειδιών σε μεθόδους διαφόρων πεδίων (όπως οι παραπάνω) σε συνδυασμό με την αύξηση της ποσότητας της ψηφιακής πληροφορίας κειμένου στο Διαδίκτυο (διαδικτυακές ψηφιακές βιβλιοθήκες, ηλεκτρονικές εφημερίδες, διαδικτυακά περιοδικά, κριτικές πελατών σε πλατφόρμες ηλεκτρονικού εμπορίου, κ.α.)…”

Section: Discussionunclassified

“…Several tasks can benefit from accurate keyword or keyphrase extraction outcomes, including document clustering (Shubankar et al, 2011;Kim and Gil, 2019;Karpagam and Saradha, 2019), text classification (Hulth and Megyesi, 2006;Meng et al, 2019), information retrieval tasks (Ji et al, 2019;Boudin et al, 2020), such as query expansion (Song et al, 2006) and faceted search ), text summarization (Zhang et al, 2004;Litvak and Last, 2008;Song et al, 2019), entity recognition (Du et al, 2018), and event detection (Hossny et al, 2020). The crucial role of keyphrases in these tasks along with the increasing online digital textual information (e.g., scientific digital libraries, e-newspapers, online magazines, customer reviews on e-commerce platforms, blogs, etc.)…”

Section: The Task Of Keyphrase Extractionmentioning

confidence: 99%

“…The crucial role of keyphrases in these tasks along with the increasing online digital textual information (e.g., scientific digital libraries, e-newspapers, online magazines, customer reviews on e-commerce platforms, blogs, etc.) has excited the interest of both the research community (Nasar et al, 2019;Çano and Bojar, 2019;Vega-Oliveros et al, 2019;Firoozeh et al, 2020;Gallina et al, 2020;Papagiannopoulou and Tsoumakas, 2020;Merrouni et al, 2019) and companies. Automatic keyphrase extraction has been used to improve customer service and monitor time-evolving product opportunities in real-time (Choi et al, 2020).…”

Section: The Task Of Keyphrase Extractionmentioning

confidence: 99%

See 1 more Smart Citation

Keyphrase extraction techniques

Παπαγιαννοπούλου¹

View full text Add to dashboard Cite

Αυτή η διατριβή συνεισφέρει μια πρωτότυπη έρευνα στο πεδίο της εξαγωγής φράσεων κλειδιών. Η εξαγωγή φράσεων κλειδιών από κείμενα σχετίζεται με την αυτόματη εξαγωγή αντιπροσωπευτικών φράσεων από ένα έγγραφο που εκφράζουν όλες τις βασικές πτυχές του περιεχομένου του. Οι φράσεις κλειδιά αποτελούν μια εννοιολογική περίληψη του εγγράφου, η οποία είναι πολύ χρήσιμη στα ψηφιακά συστήματα διαχείρισης πληροφοριών, στη σημασιολογική δεικτοδότηση, καθώς και στην ομαδοποίηση/ταξινόμηση εγγράφων. Η μελέτη μας επικεντρώνεται σε μεθόδους εξαγωγής φράσεων κλειδιών χωρίς επίβλεψη. Τα βασικά βήματα μιας μη επιβλεπόμενης μεθόδου εξαγωγής φράσεων κλειδιών είναι τα ακόλουθα. Αρχικά, η μέθοδος επιλέγει τις υποψήφιες λεκτικές μονάδες με βάση ορισμένους κανόνες, όπως επιλογή λέξεων που ανήκουν σε συγκεκριμένα μέρη του λόγου. Στη συνέχεια, αναθέτει ένα σκορ στις υποψήφιες λεκτικές μονάδες και σχηματίζει φράσεις επιλέγοντας τις λεκτικές μονάδες με τα πιο υψηλά σκορ. Αν και το πεδίο εφαρμογής της διατριβής είναι το κείμενο, οι συνεισφορές της θα μπορούσαν να επεκταθούν και σε άλλους τομείς εφαρμογών όπου επικρατούν οι γράφοι ως μέσο αναπαράστασης πληροφορίας. Σε αυτή την εργασία μας απασχολούν τα εξής θέματα: (α) η βαθύτερη κατανόηση των μεθόδων φράσεων κλειδιών, (β) η πρόταση εναλλακτικής αναπαράστασης και τρόπου αξιοποίησης της στατιστικής πληροφορίας του υπό εξέταση εγγράφου, (γ) η μελέτη του βαθμού επίδρασης των διαφορετικών μετρικών/προσεγγίσεων αξιολόγησης και συνόλων φράσεων κλειδιών για αξιολόγηση στην εκτίμηση της επίδοσης των μεθόδων, καθώς και η πρόταση νέων μετρικών/προσεγγίσεων αξιολόγησης. (δ) Τέλος, παρουσιάζουμε μια μελέτη για την σημασιολογική εξέλιξη των λέξεων της ελληνικής γλώσσας χρησιμοποιώντας διανυσματικές αναπαραστάσεις λέξεων. Σε αυτή τη διατριβή, παρουσιάζουμε με οργανωμένο τρόπο τις μεθόδους εξαγωγής φράσεων κλειδιών προτείνοντας σχήματα κατηγοριοποίησής τους. Στη συνέχεια, παρουσιάζουμε μια νέα μη επιβλεπόμενη μέθοδο εξαγωγής φράσεων κλειδιών, της οποίας η βασική καινοτομία είναι η χρήση τοπικης διανυσματικης αναπαράστασης λέξεων. Καθώς αυτή είναι η πρώτη φορά που χρησιμοποιείται μία τέτοια τοπική διανυσματική αναπαράσταση λέξεων στο πεδίο της εξαγωγής φράσεων κλειδιών, δίνουμε επίσης ιδιαίτερο βάρος και σε μεθόδους εξαγωγής λέξεων κλειδιών για τη βελτίωση της διαδικασίας κατάταξης/ανάθεσης σκορ των επιμέρους λέξεων. Στη συνέχεια, παρουσιάζουμε μία μελέτη αξιολόγησης της επίδοσης εμπορικών πακέτων λογισμικού και των κυριότερων μεθόδων μη επιβλεπόμενης εξαγωγής φράσεων κλειδιών καθώς και μία ανάλυση για την εκτίμηση της επίδοσης των μεθόδων σε σχέση με τη χρήση διαφορετικών μετρικών/προσεγγίσεων αξιολόγησης και συνόλων φράσεων κλειδιών για αξιολόγηση. Τέλος, στο πλαίσιο του ενδιαφέροντός μας για εξαγωγή φράσεων κλειδιών από κείμενα ελληνικής λογοτεχνίας του 19ου-21ου αιώνα, αξιοποιώντας διανυσματικές αναπαραστάσεις λέξεων, ξεκινήσαμε μια μελέτη για την σημασιολογική εξέλιξη των λέξεων όπως αυτή αποτυπώνεται μέσα από διανύσματα λέξεων.

show abstract

Large-Scale Evaluation of Keyphrase Extraction Models

Cited by 11 publications

References 30 publications

WikiCSSH: Extracting Computer Science Subject Headings from Wikipedia

WikiCSSH: Extracting Computer Science Subject Headings from Wikipedia

TNT-KID: Transformer-based neural tagger for keyword identification

Keyphrase extraction techniques

Contact Info

Product

Resources

About