EDGAR-CORPUS: Billions of Tokens Make The World Go Round

Loukas, Lefteris; Fergadiotis, Manos; Androutsopoulos, Ion; Malakasiotis, Prodromos

doi:10.18653/v1/2021.econlp-1.2

Cited by 16 publications

(22 citation statements)

References 9 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…In this section, we shortly review the key points of KPI-BERT, a model tailored for key performance indicator extraction, introduced and illustrated in much greater detail in [2] Thereafter, one span-level approach is briefly touched upon, namely the SpERT model introduced by [16]. We then provide two further baselines building on EDGAR-W2V [5] and GloVe [43], leveraging a similar setup like [2]. These four models will be the baselines we provide to other researchers to benchmark their model against on KPI-EDGAR.…”

Section: Methodsmentioning

confidence: 99%

“…Due to EDGAR's popularity, many researchers ( [3], [4], [6]) have developed methods to extract data from EDGAR and used it in their own research. [5] even released a comprehensive corpus comprising annual reports from all the publicly traded companies in the US spanning a period of more than 25 years and accompanied it with a word2vec [37] model titled EDGAR-W2V, which we will also use as a baseline in our experiments. Furthermore, [38], [39], and [40] have applied machine learning methods to EDGAR data and have provided useful results that can be used in real-world financial applications.…”

Section: Related Workmentioning

confidence: 99%

“…As defined in the methodology section IV, the introduced joint NER and RE model consists of three building blocks: a sentence encoder, NER decoder, and RE decoder. For generating further baselines, the sentence encoder of the described model is replaced by EDGAR-W2V [5] and GloVe [43]. The rest of the architecture is left unchanged.…”

Section: Further Token-level Baselinesmentioning

confidence: 99%

“…V. EXPERIMENTS Herein, we briefly evaluate the unmodified KPI-BERT [2] (see subsection IV-A), the same structure but with word embeddings from EDGAR-W2V [5] and GloVe [43] instead of embeddings from BERT [27] (see subsection IV-C for these two), and a competing span-based approach (see subsection IV-B), as introduced by [16] and named SpERT, on our novel dataset. The results should be seen as a baseline on which further research can be benchmarked.…”

Section: Adjusted F 1 Metricmentioning

confidence: 99%

“…• We introduce a novel dataset, named KPI-EDGAR, for joint named entity recognition and relation extraction in the financial text domain based on actual corporate reports submitted to the EDGAR database. In contrast to other resources scraped from EDGAR (see [3], [4], [5], and [6]), we provide manual annotations along with the corpus to allow for joint named entity recognition and relation extraction.…”

Section: Introductionmentioning

confidence: 99%

See 4 more Smart Citations

KPI-EDGAR: A Novel Dataset and Accompanying Metric for Relation Extraction from Financial Documents

Deußer¹,

Ali²,

Hillebrand³

et al. 2022

Preprint

View full text Add to dashboard Cite

We introduce KPI-EDGAR, a novel dataset for Joint Named Entity Recognition and Relation Extraction building on financial reports uploaded to the Electronic Data Gathering, Analysis, and Retrieval (EDGAR) system, where the main objective is to extract Key Performance Indicators (KPIs) from financial documents and link them to their numerical values and other attributes. We further provide four accompanying baselines for benchmarking potential future research. Additionally, we propose a new way of measuring the success of said extraction process by incorporating a word-level weighting scheme into the conventional F1 score to better model the inherently fuzzy borders of the entity pairs of a relation in this domain.

show abstract

Section: Methodsmentioning

confidence: 99%