Introducing orbis: An extendable evaluation pipeline for named entity linking performance drill‐down analyses

Odoni, Fabian; Brașoveanu, Adrian; Kuntschik, Philipp; Weichselbraun, Albert

doi:10.1002/pra2.49

Cited by 4 publications

(3 citation statements)

References 3 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Appropriate benchmarking suites and gold standard data are key towards evaluating content extraction methods, identifying their strengths and weaknesses. We, therefore, have created a gold standard dataset that is used in conjunction with the Open Source Orbis benchmarking framework [23] to evaluate Harvest's performance.…”

Section: Discussionmentioning

confidence: 99%

Harvest -- An Open Source Toolkit for Extracting Posts and Post Metadata from Web Forums

Weichselbraun,

Brasoveanu,

Waldvogel

et al. 2021

Preprint

Self Cite

View full text Add to dashboard Cite

Automatic extraction of forum posts and metadata is a crucial but challenging task since forums do not expose their content in a standardized structure. Content extraction methods, therefore, often need customizations such as adaptations to page templates and improvements of their extraction code before they can be deployed to new forums. Most of the current solutions are also built for the more general case of content extraction from web pages and lack key features important for understanding forum content such as the identification of author metadata and information on the thread structure.This paper, therefore, presents a method that determines the XPath of forum posts, eliminating incorrect mergers and splits of the extracted posts that were common in systems from the previous generation. Based on the individual posts further metadata such as authors, forum URL and structure are extracted. We also introduce Harvest, a new open source toolkit that implements the presented methods and create a gold standard extracted from 52 different Web forums for evaluating our approach. A comprehensive evaluation reveals that Harvest clearly outperforms competing systems.

show abstract

Section: Discussionmentioning

confidence: 99%

Harvest -- An Open Source Toolkit for Extracting Posts and Post Metadata from Web Forums

Weichselbraun,

Brasoveanu,

Waldvogel

et al. 2021

Preprint

Self Cite

View full text Add to dashboard Cite

show abstract

“…GERBIL (Röder et al, 2018) standardizes ED evaluation over multiple datasets in a unifying framework, but does not define the training data and thus only focuses on comparing already-trained models. Similarly, a range of prior works have sought to refine and standardize ED evaluation (Waitelonis et al, 2019;Nait-Hamoud et al, 2021;Noullet et al, 2021;Odoni et al, 2019;van Erp and Groth, 2020;Braşoveanu et al, 2018). In contrast, ZELDA defines the full experimental setup, including training data, the entity vocabulary and other training signals.…”

Section: Related Workmentioning

confidence: 99%

ZELDA: A Comprehensive Benchmark for Supervised Entity Disambiguation

Milich,

Akbik

2023

Proceedings of the 17th Conference of the European Chapter of the Association for Computational Linguistics

View full text Add to dashboard Cite

Entity disambiguation (ED) is the task of disambiguating named entity mentions in text to unique entries in a knowledge base. Due to its industrial relevance, as well as current progress in leveraging pre-trained language models, a multitude of ED approaches have been proposed in recent years. However, we observe a severe lack of uniformity across experimental setups in current ED work, rendering a direct comparison of approaches based solely on reported numbers impossible: Current approaches widely differ in the data set used to train, the size of the covered entity vocabulary, and the usage of additional signals such as candidate lists. To address this issue, we present ZELDA, a novel entity disambiguation benchmark that includes a unified training data set, entity vocabulary, candidate lists, as well as challenging evaluation splits covering 8 different domains. We illustrate its design and construction, and present experiments in which we train and compare current state-of-the-art approaches on our benchmark. To encourage greater direct comparability in the entity disambiguation domain, we open source our benchmark at https: //github.com/flairNLP/zelda.

show abstract

“…Future work will focus on: (i) improving the slot filling performance by enhancing page segmentation, increasing the coverage of the proprietary knowledge graph used for entity linking, and fine-tuning the entity recognition component. Given the importance of the created benchmarking framework for the research and development process, we plan on (ii) further increasing its size and coverage; and (iii) integrating the gold standard with explainable benchmarking frameworks such as Orbis [9] to make it more accessible to third-party researchers.…”

Section: Outlook and Conclusionmentioning

confidence: 99%

Slot Filling for Extracting Reskilling and Upskilling Options from the Web

Weichselbraun

Waldvogel

Fraefel

et al. 2022

Lecture Notes in Computer Science

Self Cite

View full text Add to dashboard Cite

Disturbances in the job market such as advances in science and technology, crisis and increased competition have triggered a surge in reskilling and upskilling programs. Information on suitable continuing education options is distributed across many sites, rendering the search, comparison and selection of useful programs a cumbersome task. This paper, therefore, introduces a knowledge extraction system that integrates reskilling and upskilling options into a single knowledge graph. The system collects educational programs from 488 different providers and uses context extraction for identifying and contextualizing relevant content. Afterwards, entity recognition and entity linking methods draw upon a domain ontology to locate relevant entities such as skills, occupations and topics. Finally, slot filling integrates entities based on their context into the corresponding slots of the continuous education knowledge graph. We also introduce a German gold standard that comprises 169 documents and over 3800 annotations for benchmarking the necessary content extraction, entity linking, entity recognition and slot filling tasks, and provide an overview of the system's performance.

show abstract

Introducing orbis: An extendable evaluation pipeline for named entity linking performance drill‐down analyses

Cited by 4 publications

References 3 publications

Harvest -- An Open Source Toolkit for Extracting Posts and Post Metadata from Web Forums

Harvest -- An Open Source Toolkit for Extracting Posts and Post Metadata from Web Forums

ZELDA: A Comprehensive Benchmark for Supervised Entity Disambiguation

Slot Filling for Extracting Reskilling and Upskilling Options from the Web

Contact Info

Product

Resources

About