Matjaž Kragelj scite author profile

Automatic classification of older electronic texts into the Universal Decimal Classification–UDC

Kragelj

¹

,

Borštnar

²

2020

View full text Add to dashboard Cite

PurposeThe purpose of this study is to develop a model for automated classification of old digitised texts to the Universal Decimal Classification (UDC), using machine-learning methods.Design/methodology/approachThe general research approach is inherent to design science research, in which the problem of UDC assignment of the old, digitised texts is addressed by developing a machine-learning classification model. A corpus of 70,000 scholarly texts, fully bibliographically processed by librarians, was used to train and test the model, which was used for classification of old texts on a corpus of 200,000 items. Human experts evaluated the performance of the model.FindingsResults suggest that machine-learning models can correctly assign the UDC at some level for almost any scholarly text. Furthermore, the model can be recommended for the UDC assignment of older texts. Ten librarians corroborated this on 150 randomly selected texts.Research limitations/implicationsThe main limitations of this study were unavailability of labelled older texts and the limited availability of librarians.Practical implicationsThe classification model can provide a recommendation to the librarians during their classification work; furthermore, it can be implemented as an add-on to full-text search in the library databases.Social implicationsThe proposed methodology supports librarians by recommending UDC classifiers, thus saving time in their daily work. By automatically classifying older texts, digital libraries can provide a better user experience by enabling structured searches. These contribute to making knowledge more widely available and useable.Originality/valueThese findings contribute to the field of automated classification of bibliographical information with the usage of full texts, especially in cases in which the texts are old, unstructured and in which archaic language and vocabulary are used.

show abstract

Problematika ohranjanja zasebnosti pri podatkovnem rudarjenju dokumentov z občutljivimi podatki

Kragelj¹,

Borštnar²,

Brezavšček³

2022

UI

0

View full text Add to dashboard Cite

V prispevku obravnavamo problem, s katerim se soočamo pri uporabi dokumentov, ki poleg vsebinskih podatkov vsebujejo tudi občutljive podatke o posamezniku, ki omogočajo njegovo razkritje tudi, ko to ni zaželeno. Med področja, kjer nastane veliko podatkov te vrste, štejemo zdravstveno varstvo, transport, kazenski pregon in nacionalno varnost, izobraževanje, sodobne internetne storitve, področje sodobnih aplikacijskih ekosistemov, internet stvari, finančni sektor in odprte podatke državne uprave. Cilj je zaščititi zasebnost subjekta ter hkrati zagotoviti kakovostne podatke za nadaljnje poglobljene analize in s tem nudenje novih znanj za naprej. Za reševanje omenjenih izzivov na področju podatkovnega rudarjenja se je razvilo posebno podpodročje, imenovano PPDM – Privacy Preserving Data Mining, ki se ukvarja z ohranjanjem zasebnosti pri tem procesu. Sistematično smo pregledali relevantno literaturo podpodročja PPDM in opisali glavne metode in tehnike. Tehnike PPDM so zasnovane tako, da zagotavljajo določeno raven zasebnosti, obenem pa ohranjajo uporabnost podatkov, da se lahko uporaba še vedno učinkovito izvaja na transformiranih podatkih. Metode, s katerimi dosegamo zaščito posameznika na eni in uporabno vrednost podatkov na drugi strani v grobem delimo na metode razprševanja podatkov, metode izkrivljanja (z uporabo anonimizacije, randomizacije, vrtenja in vnašanjem šuma v podatke) ter metode šifriranja podatkov. Za doseganje višje zaščite lahko uporabimo tudi kombinacije teh metod. Poleg pregleda metod smo podali nekaj praktičnih primerov ter našteli domene oz. področja, kjer se kaže potreba po nadaljnji analizi in ponovni uporabi podatkov, a hkrati potreba po anonimizaciji oz. prikritju lastnika (subjekta) in njegovih podatkov (atributov).

show abstract

Problematika ohranjanja zasebnosti pri podatkovnem rudarjenju dokumentov z občutljivimi podatki

Kragelj¹,

Borštnar²,

Brezavšček³

2022

UI

0

View full text Add to dashboard Cite

V prispevku obravnavamo problem, s katerim se soočamo pri uporabi dokumentov, ki poleg vsebinskih podatkov vsebujejo tudi občutljive podatke o posamezniku, ki omogočajo njegovo razkritje tudi, ko to ni zaželeno. Med področja, kjer nastane veliko podatkov te vrste, štejemo zdravstveno varstvo, transport, kazenski pregon in nacionalno varnost, izobraževanje, sodobne internetne storitve, področje sodobnih aplikacijskih ekosistemov, internet stvari, finančni sektor in odprte podatke državne uprave. Cilj je zaščititi zasebnost subjekta ter hkrati zagotoviti kakovostne podatke za nadaljnje poglobljene analize in s tem nudenje novih znanj za naprej. Za reševanje omenjenih izzivov na področju podatkovnega rudarjenja se je razvilo posebno podpodročje, imenovano PPDM – Privacy Preserving Data Mining, ki se ukvarja z ohranjanjem zasebnosti pri tem procesu. Sistematično smo pregledali relevantno literaturo podpodročja PPDM in opisali glavne metode in tehnike. Tehnike PPDM so zasnovane tako, da zagotavljajo določeno raven zasebnosti, obenem pa ohranjajo uporabnost podatkov, da se lahko uporaba še vedno učinkovito izvaja na transformiranih podatkih. Metode, s katerimi dosegamo zaščito posameznika na eni in uporabno vrednost podatkov na drugi strani v grobem delimo na metode razprševanja podatkov, metode izkrivljanja (z uporabo anonimizacije, randomizacije, vrtenja in vnašanjem šuma v podatke) ter metode šifriranja podatkov. Za doseganje višje zaščite lahko uporabimo tudi kombinacije teh metod. Poleg pregleda metod smo podali nekaj praktičnih primerov ter našteli domene oz. področja, kjer se kaže potreba po nadaljnji analizi in ponovni uporabi podatkov, a hkrati potreba po anonimizaciji oz. prikritju lastnika (subjekta) in njegovih podatkov (atributov).

show abstract

From Scans to Digital Repository: Digitization Process Rationalization for Culture Public Sector in the Field of Culture

Kragelj

¹

2014

knj

0

View full text Add to dashboard Cite

Public institutions hold important physical collections of materials from the fields of culture and science. For more than a decade, they have been facing the problem of digitization and archiving requirements as a part of the digitization process when generating e-content. They are using more or less the same procedures when they cope with those difficulties. At the same time, we note a much larger set of institutions, which still deal, more or less successfully, with the initial problems of the digitization process. The main reasons for the gap between the two can be found among different factors, including: lack of knowledge in the field of the science in informatics, limited financial and staff resources provided by public institutions, and in particular, limited resources for the preservation management and the archive maintenance sub processes. The National and University Library is developing a business model that will provide a successful implementation of the digitization process with all main sub processes (submission and/or digitization, archiving, dissemination). Thus, the end product for digital preservation and provision of public access to e-content for similar institutions will be ensured.The main component parts of the digitization process are as follows: selection and preparation of physical collections, determination of access/use restrictions or copyright, application of Creative Commons license, preparation of project parameters and bibliographic information, digitization, integration of bibliographic data into e-contents, automatic text recognition, archiving and providing access to digitized collections. Some activities, such as access and long-term preservation, need to be ongoing throughout all the phases of the digitization workflow, whereas some of them can be traced only in individual phases of the process. In special cases, it might be necessary to change some bibliographic data, i.e. license terms and conditions or access modality. The digital copy status can change over time (expiration of licence, the digitized materials fall in the Public Domain) or by modification of agreement with the copyright holder.By implementing a business model, we wish to contribute to the rationalization of the use of public funds for this purpose, but at the same time, to encourage a greater scope of online available e-content and to ensure a safe digital preservation of e-content material. It should be pointed out that the problems, related to long-term preservation of physical collections and their digital representation are not terminated by the digitization process, they are only expanded from the physical to digital environment.

show abstract

Matjaž Kragelj

Automatic classification of older electronic texts into the Universal Decimal Classification–UDC

Problematika ohranjanja zasebnosti pri podatkovnem rudarjenju dokumentov z občutljivimi podatki

Problematika ohranjanja zasebnosti pri podatkovnem rudarjenju dokumentov z občutljivimi podatki

From Scans to Digital Repository: Digitization Process Rationalization for Culture Public Sector in the Field of Culture

Contact Info

Product

Resources

About