Senka Drobac scite author profile

Kauppinen

2019

This paper presents experiments on Optical character recognition (OCR) of historical newspapers and journals published in Finland. The corpus has two main languages: Finnish and Swedish and is written in both Blackletter and Antiqua fonts. Here we experiment with how much training data is enough to train high accuracy models, and try to train a joint model for both languages and all fonts. So far we have not been successful in getting one best model for all, but it is promising that with the mixed model we get the best results on the Finnish test set with 95 % CAR, which clearly surpasses previous results on this data set. CCS CONCEPTS • Applied computing → Optical character recognition.

Optical character recognition with neural networks and post-correction with finite state methods

2020

IJDAR

The optical character recognition (OCR) quality of the historical part of the Finnish newspaper and journal corpus is rather low for reliable search and scientific research on the OCRed data. The estimated character error rate (CER) of the corpus, achieved with commercial software, is between 8 and 13%. There have been earlier attempts to train high-quality OCR models with open-source software, like Ocropy (https://github.com/tmbdev/ocropy) and Tesseract (https://github.com/tesseract-ocr/tesseract), but so far, none of the methods have managed to successfully train a mixed model that recognizes all of the data in the corpus, which would be essential for an efficient re-OCRing of the corpus. The difficulty lies in the fact that the corpus is printed in the two main languages of Finland (Finnish and Swedish) and in two font families (Blackletter and Antiqua). In this paper, we explore the training of a variety of OCR models with deep neural networks (DNN). First, we find an optimal DNN for our data and, with additional training data, successfully train high-quality mixed-language models. Furthermore, we revisit the effect of confidence voting on the OCR results with different model combinations. Finally, we perform post-correction on the new OCR results and perform error analysis. The results show a significant boost in accuracy, resulting in 1.7% CER on the Finnish and 2.7% CER on the Swedish test set. The greatest accomplishment of the study is the successful training of one mixed language model for the entire corpus and finding a voting setup that further improves the results.

Parlamenttisampo: eduskunnan aineistojen linkitetyn avoimen datan palvelu ja sen käyttömahdollisuudet

Hyvönen¹,

Sinikallio²,

Leskinen³

et al. 2021

INF

Semanttinen parlamentti -hankkeessa 2020–2022 luodaan eduskunnan tietokannoista ja niihin liittyvistä muista aineistoista uudenlainen linkitetyn avoimen datan (Linked Open Data, LOD) palvelu, tietoinfrastruktuuri ja semanttinen portaali Parlamenttisampo – eduskunta semanttisessa webissä, joiden avulla tutkitaan poliittista kulttuuria ja kieltä. Dataa linkittämällä voi-daan rikastaa eduskuntadataa muilla tietolähteillä kuten biografisella tiedolla, terminologioilla ja lainsäädännön dokumenteilla. Parlamenttisampo on kieli- ja semanttisen webin teknologioihin perustuva palvelukokonaisuus tutkijoita, kansalaisia, mediaa ja valtionhallintoa varten. Artikkelissa esitellään hankkeen visio, ensimmäisiä tuloksia ja niiden hyödyntämismahdollisuuksia: Eduskunnan kaikkien täysistuntojen 1907–2021 yli 900 000 puheesta on valmistunut linkitetyn datan tietämysgraafi (knowledge graph); data on myös saatavilla XML-muodossa, jossa hyödynnetään uutta kansainvälistä Parla-CLARIN-formaattia. Ensimmäistä kertaa eduskunnan puheiden koko aikasarja on muunnettu dataksi ja datapalveluksi yhtenäisessä muodossa. Lisäksi puheet on yhdistetty eduskunnan kansanedustajien tietokannasta luotuun ja muista tietolähteistä rikastettuun toiseen tietämysgraafiin laajemmaksi ontologiaperustaiseksi datapalveluksi Fin- Parla. Datapalvelua voidaan käyttää eduskuntatutkimukseen parlamentaarisesta ja edustuksel-lisesta kulttuurista sekä poliittisen kielen käytöstä analysoimalla kansanedustajien täysistunnoissa pitämiä puheita ja poliitikkojen verkostoja data-analyysin keinoin. Palvelun rajapinnan avulla voidaan myös kehittää eri käyttäjäryhmille sovelluksia, kuten hankkeessa valmistuva Parlamenttisampo.fi-portaali.

HFST — A System for Creating NLP Tools

Axelson

et al. 2013

Using HFST for Creating Computational Linguistic Applications

Axelson

et al. 2013