This paper presents experiments on Optical character recognition (OCR) of historical newspapers and journals published in Finland. The corpus has two main languages: Finnish and Swedish and is written in both Blackletter and Antiqua fonts. Here we experiment with how much training data is enough to train high accuracy models, and try to train a joint model for both languages and all fonts. So far we have not been successful in getting one best model for all, but it is promising that with the mixed model we get the best results on the Finnish test set with 95 % CAR, which clearly surpasses previous results on this data set. CCS CONCEPTS • Applied computing → Optical character recognition.
The optical character recognition (OCR) quality of the historical part of the Finnish newspaper and journal corpus is rather low for reliable search and scientific research on the OCRed data. The estimated character error rate (CER) of the corpus, achieved with commercial software, is between 8 and 13%. There have been earlier attempts to train high-quality OCR models with open-source software, like Ocropy (https://github.com/tmbdev/ocropy) and Tesseract (https://github.com/tesseract-ocr/tesseract), but so far, none of the methods have managed to successfully train a mixed model that recognizes all of the data in the corpus, which would be essential for an efficient re-OCRing of the corpus. The difficulty lies in the fact that the corpus is printed in the two main languages of Finland (Finnish and Swedish) and in two font families (Blackletter and Antiqua). In this paper, we explore the training of a variety of OCR models with deep neural networks (DNN). First, we find an optimal DNN for our data and, with additional training data, successfully train high-quality mixed-language models. Furthermore, we revisit the effect of confidence voting on the OCR results with different model combinations. Finally, we perform post-correction on the new OCR results and perform error analysis. The results show a significant boost in accuracy, resulting in 1.7% CER on the Finnish and 2.7% CER on the Swedish test set. The greatest accomplishment of the study is the successful training of one mixed language model for the entire corpus and finding a voting setup that further improves the results.
Semanttinen parlamentti -hankkeessa 2020–2022 luodaan eduskunnan tietokannoista ja niihin liittyvistä muista aineistoista uudenlainen linkitetyn avoimen datan (Linked Open Data, LOD) palvelu, tietoinfrastruktuuri ja semanttinen portaali Parlamenttisampo – eduskunta semanttisessa webissä, joiden avulla tutkitaan poliittista kulttuuria ja kieltä. Dataa linkittämällä voi-daan rikastaa eduskuntadataa muilla tietolähteillä kuten biografisella tiedolla, terminologioilla ja lainsäädännön dokumenteilla. Parlamenttisampo on kieli- ja semanttisen webin teknologioihin perustuva palvelukokonaisuus tutkijoita, kansalaisia, mediaa ja valtionhallintoa varten. Artikkelissa esitellään hankkeen visio, ensimmäisiä tuloksia ja niiden hyödyntämismahdollisuuksia: Eduskunnan kaikkien täysistuntojen 1907–2021 yli 900 000 puheesta on valmistunut linkitetyn datan tietämysgraafi (knowledge graph); data on myös saatavilla XML-muodossa, jossa hyödynnetään uutta kansainvälistä Parla-CLARIN-formaattia. Ensimmäistä kertaa eduskunnan puheiden koko aikasarja on muunnettu dataksi ja datapalveluksi yhtenäisessä muodossa. Lisäksi puheet on yhdistetty eduskunnan kansanedustajien tietokannasta luotuun ja muista tietolähteistä rikastettuun toiseen tietämysgraafiin laajemmaksi ontologiaperustaiseksi datapalveluksi Fin- Parla. Datapalvelua voidaan käyttää eduskuntatutkimukseen parlamentaarisesta ja edustuksel-lisesta kulttuurista sekä poliittisen kielen käytöstä analysoimalla kansanedustajien täysistunnoissa pitämiä puheita ja poliitikkojen verkostoja data-analyysin keinoin. Palvelun rajapinnan avulla voidaan myös kehittää eri käyttäjäryhmille sovelluksia, kuten hankkeessa valmistuva Parlamenttisampo.fi-portaali.
Abstract. The paper presents and evaluates various NLP tools that have been created using the open source library HFST-Helsinki FiniteState Technology and outlines the minimal extensions that this has required to a pure finite-state system. In particular, the paper describes an implementation and application of Pmatch presented by Karttunen at SFCM 2011.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
hi@scite.ai
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.