Abstract:We release edgar-corpus, a novel corpus comprising annual reports from all the publicly traded companies in the us spanning a period of more than 25 years. To the best of our knowledge, edgar-corpus is the largest financial nlp corpus available to date. All the reports are downloaded, split into their corresponding items (sections), and provided in a clean, easy-to-use json format. We use edgar-corpus to train and release edgar-w2v, which are word2vec embeddings for the financial domain. We employ these embedd… Show more
“…In this section, we shortly review the key points of KPI-BERT, a model tailored for key performance indicator extraction, introduced and illustrated in much greater detail in [2] Thereafter, one span-level approach is briefly touched upon, namely the SpERT model introduced by [16]. We then provide two further baselines building on EDGAR-W2V [5] and GloVe [43], leveraging a similar setup like [2]. These four models will be the baselines we provide to other researchers to benchmark their model against on KPI-EDGAR.…”
Section: Methodsmentioning
confidence: 99%
“…Due to EDGAR's popularity, many researchers ( [3], [4], [6]) have developed methods to extract data from EDGAR and used it in their own research. [5] even released a comprehensive corpus comprising annual reports from all the publicly traded companies in the US spanning a period of more than 25 years and accompanied it with a word2vec [37] model titled EDGAR-W2V, which we will also use as a baseline in our experiments. Furthermore, [38], [39], and [40] have applied machine learning methods to EDGAR data and have provided useful results that can be used in real-world financial applications.…”
Section: Related Workmentioning
confidence: 99%
“…As defined in the methodology section IV, the introduced joint NER and RE model consists of three building blocks: a sentence encoder, NER decoder, and RE decoder. For generating further baselines, the sentence encoder of the described model is replaced by EDGAR-W2V [5] and GloVe [43]. The rest of the architecture is left unchanged.…”
Section: Further Token-level Baselinesmentioning
confidence: 99%
“…V. EXPERIMENTS Herein, we briefly evaluate the unmodified KPI-BERT [2] (see subsection IV-A), the same structure but with word embeddings from EDGAR-W2V [5] and GloVe [43] instead of embeddings from BERT [27] (see subsection IV-C for these two), and a competing span-based approach (see subsection IV-B), as introduced by [16] and named SpERT, on our novel dataset. The results should be seen as a baseline on which further research can be benchmarked.…”
Section: Adjusted F 1 Metricmentioning
confidence: 99%
“…• We introduce a novel dataset, named KPI-EDGAR, for joint named entity recognition and relation extraction in the financial text domain based on actual corporate reports submitted to the EDGAR database. In contrast to other resources scraped from EDGAR (see [3], [4], [5], and [6]), we provide manual annotations along with the corpus to allow for joint named entity recognition and relation extraction.…”
We introduce KPI-EDGAR, a novel dataset for Joint Named Entity Recognition and Relation Extraction building on financial reports uploaded to the Electronic Data Gathering, Analysis, and Retrieval (EDGAR) system, where the main objective is to extract Key Performance Indicators (KPIs) from financial documents and link them to their numerical values and other attributes. We further provide four accompanying baselines for benchmarking potential future research. Additionally, we propose a new way of measuring the success of said extraction process by incorporating a word-level weighting scheme into the conventional F1 score to better model the inherently fuzzy borders of the entity pairs of a relation in this domain.
“…In this section, we shortly review the key points of KPI-BERT, a model tailored for key performance indicator extraction, introduced and illustrated in much greater detail in [2] Thereafter, one span-level approach is briefly touched upon, namely the SpERT model introduced by [16]. We then provide two further baselines building on EDGAR-W2V [5] and GloVe [43], leveraging a similar setup like [2]. These four models will be the baselines we provide to other researchers to benchmark their model against on KPI-EDGAR.…”
Section: Methodsmentioning
confidence: 99%
“…Due to EDGAR's popularity, many researchers ( [3], [4], [6]) have developed methods to extract data from EDGAR and used it in their own research. [5] even released a comprehensive corpus comprising annual reports from all the publicly traded companies in the US spanning a period of more than 25 years and accompanied it with a word2vec [37] model titled EDGAR-W2V, which we will also use as a baseline in our experiments. Furthermore, [38], [39], and [40] have applied machine learning methods to EDGAR data and have provided useful results that can be used in real-world financial applications.…”
Section: Related Workmentioning
confidence: 99%
“…As defined in the methodology section IV, the introduced joint NER and RE model consists of three building blocks: a sentence encoder, NER decoder, and RE decoder. For generating further baselines, the sentence encoder of the described model is replaced by EDGAR-W2V [5] and GloVe [43]. The rest of the architecture is left unchanged.…”
Section: Further Token-level Baselinesmentioning
confidence: 99%
“…V. EXPERIMENTS Herein, we briefly evaluate the unmodified KPI-BERT [2] (see subsection IV-A), the same structure but with word embeddings from EDGAR-W2V [5] and GloVe [43] instead of embeddings from BERT [27] (see subsection IV-C for these two), and a competing span-based approach (see subsection IV-B), as introduced by [16] and named SpERT, on our novel dataset. The results should be seen as a baseline on which further research can be benchmarked.…”
Section: Adjusted F 1 Metricmentioning
confidence: 99%
“…• We introduce a novel dataset, named KPI-EDGAR, for joint named entity recognition and relation extraction in the financial text domain based on actual corporate reports submitted to the EDGAR database. In contrast to other resources scraped from EDGAR (see [3], [4], [5], and [6]), we provide manual annotations along with the corpus to allow for joint named entity recognition and relation extraction.…”
We introduce KPI-EDGAR, a novel dataset for Joint Named Entity Recognition and Relation Extraction building on financial reports uploaded to the Electronic Data Gathering, Analysis, and Retrieval (EDGAR) system, where the main objective is to extract Key Performance Indicators (KPIs) from financial documents and link them to their numerical values and other attributes. We further provide four accompanying baselines for benchmarking potential future research. Additionally, we propose a new way of measuring the success of said extraction process by incorporating a word-level weighting scheme into the conventional F1 score to better model the inherently fuzzy borders of the entity pairs of a relation in this domain.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.