Impresso Inspect and Compare. Visual Comparison of Semantically Enriched Historical Newspaper Articles

Düring, Marten; Kalyakin, Roman; Bunout, Estelle; Guido, Daniele

doi:10.3390/info12090348

Cited by 6 publications

(4 citation statements)

References 34 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…At this stage of development, six evaluators found them "either difficult to read or [they] did not provide useful insights." In addition, recommendations for future development addressed the already foreseen integration of impresso's Inspect & Compare component (Düring et al, 2021) for side-by-side comparisons of article sets, higher speed for the creation of collections, API access to the data, and new filters based on a yet to be created taxonomy of text reuse types.…”

Section: Discussion Of Evaluation Resultsmentioning

confidence: 99%

impresso Text Reuse at Scale. An interface for the exploration of text reuse data in semantically enriched historical newspapers

Düring,

Romanello,

Ehrmann

et al. 2023

Front. Big Data

Self Cite

View full text Add to dashboard Cite

Text Reuse reveals meaningful reiterations of text in large corpora. Humanities researchers use text reuse to study, e.g., the posterior reception of influential texts or to reveal evolving publication practices of historical media. This research is often supported by interactive visualizations which highlight relations and differences between text segments. In this paper, we build on earlier work in this domain. We present impresso Text Reuse at Scale, the to our knowledge first interface which integrates text reuse data with other forms of semantic enrichment to enable a versatile and scalable exploration of intertextual relations in historical newspaper corpora. The Text Reuse at Scale interface was developed as part of the impresso project and combines powerful search and filter operations with close and distant reading perspectives. We integrate text reuse data with enrichments derived from topic modeling, named entity recognition and classification, language and document type detection as well as a rich set of newspaper metadata. We report on historical research objectives and common user tasks for the analysis of historical text reuse data and present the prototype interface together with the results of a user evaluation.

show abstract

Section: Discussion Of Evaluation Resultsmentioning

confidence: 99%

impresso Text Reuse at Scale. An interface for the exploration of text reuse data in semantically enriched historical newspapers

Düring,

Romanello,

Ehrmann

et al. 2023

Front. Big Data

Self Cite

View full text Add to dashboard Cite

show abstract

“…Compared to standard OCR results, these models achieve good layout segmentation, but they lack the article-level information that is required to improve searchability in historical collections. Just as text data can be classified according to its characteristics, or content, illustrations can also be classified according to their context [58], location, and features such as color or shape [59], enabling the evaluation of visual content.…”

Section: Digitization and Extractionmentioning

confidence: 99%

Context-Aware Querying, Geolocalization, and Rephotography of Historical Newspaper Images

et al. 2022

View full text Add to dashboard Cite

Newspapers contain a wealth of historical information in the form of articles and illustrations. Libraries and cultural heritage institutions have been digitizing their collections for decades to enable web-based access to and retrieval of information. A number of challenges arise when dealing with digitized collections, such as those of KBR, the Royal Library of Brussels (used in this study), which contain only page-level metadata, making it difficult to extract information from specific contexts. A context-aware search relies heavily on metadata enhancement. Therefore, when using metadata at the page level, it is even more challenging to geolocalize less-known landmarks. To overcome this challenge, we have developed a pipeline for geolocalization and visualization of historical photographs. The first step of this pipeline consists of converting page-level metadata to article-level metadata. In the next step, all articles with building images were classified based on image classification algorithms. Moreover, to correctly geolocalize historical photographs, we propose a hybrid approach that uses both textual metadata and image features. We conclude this research paper by addressing the challenge of visualizing historical content in a way that adds value to humanities research. It is noteworthy that a number of historical urban scenes are visualized using rephotography, which is notoriously challenging to get right. This study serves as an important step towards enriching historical metadata and facilitating cross-collection linkages, geolocalization, and the visualization of historical newspaper images. Furthermore, the proposed methodology is generic and can be used to process untagged photographs from social media, including Flickr and Instagram.

show abstract

“…This has been further emphasized by the global COVID-19 pandemic (Samaroudi et al, 2020;Sułkowski, 2020). In particular, the digitization of large-scale textual collections, such as historical newspapers, has sparked much interest from the digital humanities community (Allen, 2015;Düring et al, 2021;Oberbichler et al, 2022). Some of the remarkable initiatives for digitization of newspaper collections, their conservation in digital format and access provision using digital platforms has been undertaken by Google Newspaper Search (Chaudhury et al, 2009), Europeana (Pekárek and Willems, 2012;Willems and Atanassova, 2015), Bibliothèque nationale du Luxembourg (Zaagsma, 2019), KB Lab of Research department of the Koninklijke Bibliotheek, National Library of the Netherlands (Smits and Faber, 2018;Wevers and Lonij, 2017), Bibliothèque nationale de France, National Library of France (Moreux 2017), Library of Congress -Chronicling America (Lee et al, 2020), Biblioteca Digitale Italiana -BDI, an Italian digital library promoted by the Ministry for Cultural Heritage and Activities (Leombroni, 2004;Paoli, 2005), Australian Newspaper Digitization program (Holley, 2009), British Library (Hiltunen, 2021) and The National Library of Sweden (Nilsson, 2012).…”

Section: Introductionmentioning

confidence: 99%

Computer vision and machine learning approaches for metadata enrichment to improve searchability of historical newspaper collections

Ali

Milleville²,

Verstockt³

et al. 2023

View full text Add to dashboard Cite

PurposeHistorical newspaper collections provide a wealth of information about the past. Although the digitization of these collections significantly improves their accessibility, a large portion of digitized historical newspaper collections, such as those of KBR, the Royal Library of Belgium, are not yet searchable at article-level. However, recent developments in AI-based research methods, such as document layout analysis, have the potential for further enriching the metadata to improve the searchability of these historical newspaper collections. This paper aims to discuss the aforementioned issue.Design/methodology/approachIn this paper, the authors explore how existing computer vision and machine learning approaches can be used to improve access to digitized historical newspapers. To do this, the authors propose a workflow, using computer vision and machine learning approaches to (1) provide article-level access to digitized historical newspaper collections using document layout analysis, (2) extract specific types of articles (e.g. feuilletons – literary supplements from Le Peuple from 1938), (3) conduct image similarity analysis using (un)supervised classification methods and (4) perform named entity recognition (NER) to link the extracted information to open data.FindingsThe results show that the proposed workflow improves the accessibility and searchability of digitized historical newspapers, and also contributes to the building of corpora for digital humanities research. The AI-based methods enable automatic extraction of feuilletons, clustering of similar images and dynamic linking of related articles.Originality/valueThe proposed workflow enables automatic extraction of articles, including detection of a specific type of article, such as a feuilleton or literary supplement. This is particularly valuable for humanities researchers as it improves the searchability of these collections and enables corpora to be built around specific themes. Article-level access to, and improved searchability of, KBR's digitized newspapers are demonstrated through the online tool (https://tw06v072.ugent.be/kbr/).

show abstract

Impresso Inspect and Compare. Visual Comparison of Semantically Enriched Historical Newspaper Articles

Cited by 6 publications

References 34 publications

impresso Text Reuse at Scale. An interface for the exploration of text reuse data in semantically enriched historical newspapers

impresso Text Reuse at Scale. An interface for the exploration of text reuse data in semantically enriched historical newspapers

Context-Aware Querying, Geolocalization, and Rephotography of Historical Newspaper Images

Computer vision and machine learning approaches for metadata enrichment to improve searchability of historical newspaper collections

Contact Info

Product

Resources

About