Introducing the CLEF 2020 HIPE Shared Task: Named Entity Recognition and Linking on Historical Newspapers

Ehrmann, Maud; Romanello, Matteo; Bircher, Stefan; Clematide, Simon

doi:10.1007/978-3-030-45442-5_68

Cited by 10 publications

(12 citation statements)

References 17 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…The HIPE dataset 1 was created by the organisers of the CLEF 2020 Evaluation Lab HIPE challenge [8]. It is composed of articles from several Swiss, Luxembourgish, and American historical newspapers from 1790 to 2010 [9].…”

Section: Hipe Datasetmentioning

confidence: 99%

“…In order to overcome these problems, we utilised the multilingual end-to-end entity linking (MEL) models described in [18] to process historical documents and disambiguate entities in Finnish, French, German, and Swedish. This system achieved the best results in terms of EL in the CLEF 2020 Evaluation Lab HIPE challenge [8]. To minimise the impact of historical documents on the EL task, this system is composed of modules to overcome problems related to multilingualism and OCR errors.…”

Section: Entity Linkingmentioning

confidence: 99%

See 1 more Smart Citation

A Multilingual Dataset for Named Entity Recognition, Entity Linking and Stance Detection in Historical Newspapers

Hamdi

Pontes

Boroş

et al. 2021

Proceedings of the 44th International ACM SIGIR Conference on Research and Development in Information Retrieval

View full text Add to dashboard Cite

Named entity processing over historical texts is more and more being used due to the massive documents and archives being stored in digital libraries. However, due to the poor annotated resources of historical nature, information extraction performances fall behind those on contemporary texts. In this paper, we introduce the development of the NewsEye resource, a multilingual dataset for named entity recognition and linking enriched with stances towards named entities. The dataset is comprised of diachronic historical newspaper material published between 1850 and 1950 in French, German, Finnish, and Swedish. Such historical resource is essential in the context of developing and evaluating named entity processing systems. It evenly allows enhancing the performances of existing approaches on historical documents which enables adequate and efficient semantic indexing of historical documents on digital cultural heritage collections. CCS CONCEPTS• Information systems → Information retrieval; Digital libraries and archives; • General and reference → Cross-computing tools and techniques.

show abstract

Section: Hipe Datasetmentioning

confidence: 99%

Section: Entity Linkingmentioning

confidence: 99%

A Multilingual Dataset for Named Entity Recognition, Entity Linking and Stance Detection in Historical Newspapers

Hamdi

Pontes

Boroş

et al. 2021

Proceedings of the 44th International ACM SIGIR Conference on Research and Development in Information Retrieval

View full text Add to dashboard Cite

show abstract

“…The HIPE dataset was created by the CLEF 2020 Evaluation Lab HIPE challenge (Ehrmann et al, 2020a). It is composed of articles from several Swiss, Luxembourgish, and American historical newspapers from 1790 to 2010 (Ehrmann et al, 2020b).…”

Section: Datasetsmentioning

confidence: 99%

Alleviating Digitization Errors in Named Entity Recognition for Historical Documents

Boroş¹,

Hamdi²,

Pontes³

et al. 2020

Proceedings of the 24th Conference on Computational Natural Language Learning

View full text Add to dashboard Cite

This paper tackles the task of named entity recognition (NER) applied to digitized historical texts obtained from processing digital images of newspapers using optical character recognition (OCR) techniques. We argue that the main challenge for this task is that the OCR process leads to misspellings and linguistic errors in the output text. Moreover, historical variations can be present in aged documents, which can impact the performance of the NER process. We conduct a comparative evaluation on two historical datasets in German and French against previous state-of-the-art models, and we propose a model based on a hierarchical stack of Transformers to approach the NER task for historical data. Our findings show that the proposed model clearly improves the results on both historical datasets, and does not degrade the results for modern datasets.

show abstract

“…These particularities have then a significant impact on NLP and IR applications over historical documents. To illustrate some of the aforementioned problems, let us consider Figure 1(a) which includes some English documents used in the evaluation campaign CLEF HIPE 2020 [9]. Figure 1(b) and Figure 1(c) are zoomed and cropped portions of most left document presented in Figure 1(a).…”

Section: Introductionmentioning

confidence: 99%

“…Moreover, our EL approach decreases possible bias by not limiting or focusing the explored entities to a specific dataset. We evaluate our methods in two recent historical corpora, CLEF HIPE 2020 [9], and NewsEye datasets, that are composed of documents in English, Finnish, French, German, and Swedish. Our study shows that our techniques improve the performance of EL systems and partially solve the issues of historical data.…”

Section: Introductionmentioning

confidence: 99%

Entity Linking for Historical Documents: Challenges and Solutions

Pontes

Cabrera-Diego

Moreno³

et al. 2020

Digital Libraries at Times of Massive Societal Transition

View full text Add to dashboard Cite

Named entities (NEs) are among the most relevant type of information that can be used to efficiently index and retrieve digital documents. Furthermore, the use of Entity Linking (EL) to disambiguate and relate NEs to knowledge bases, provides supplementary information which can be useful to differentiate ambiguous elements such as geographical locations and peoples' names. In historical documents, the detection and disambiguation of NEs is a challenge. Most historical documents are converted into plain text using an optical character recognition (OCR) system at the expense of some noise. Documents in digital libraries will, therefore, be indexed with errors that may hinder their accessibility. OCR errors affect not only document indexing but the detection, disambiguation, and linking of NEs. This paper aims at analysing the performance of different EL approaches on two multilingual historical corpora, CLEF HIPE 2020 (English, French, German) and NewsEye (Finnish, French, German, Swedish), while proposes several techniques for alleviating the impact of historical data problems on the EL task. Our findings indicate that the proposed approaches not only outperform the baseline in both corpora but additionally they considerably reduce the impact of historical document issues on different subjects and languages.

show abstract

Introducing the CLEF 2020 HIPE Shared Task: Named Entity Recognition and Linking on Historical Newspapers

Cited by 10 publications

References 17 publications

A Multilingual Dataset for Named Entity Recognition, Entity Linking and Stance Detection in Historical Newspapers

A Multilingual Dataset for Named Entity Recognition, Entity Linking and Stance Detection in Historical Newspapers

Alleviating Digitization Errors in Named Entity Recognition for Historical Documents

Entity Linking for Historical Documents: Challenges and Solutions

Contact Info

Product

Resources

About