PurposeThe purpose of this study is to develop a model for automated classification of old digitised texts to the Universal Decimal Classification (UDC), using machine-learning methods.Design/methodology/approachThe general research approach is inherent to design science research, in which the problem of UDC assignment of the old, digitised texts is addressed by developing a machine-learning classification model. A corpus of 70,000 scholarly texts, fully bibliographically processed by librarians, was used to train and test the model, which was used for classification of old texts on a corpus of 200,000 items. Human experts evaluated the performance of the model.FindingsResults suggest that machine-learning models can correctly assign the UDC at some level for almost any scholarly text. Furthermore, the model can be recommended for the UDC assignment of older texts. Ten librarians corroborated this on 150 randomly selected texts.Research limitations/implicationsThe main limitations of this study were unavailability of labelled older texts and the limited availability of librarians.Practical implicationsThe classification model can provide a recommendation to the librarians during their classification work; furthermore, it can be implemented as an add-on to full-text search in the library databases.Social implicationsThe proposed methodology supports librarians by recommending UDC classifiers, thus saving time in their daily work. By automatically classifying older texts, digital libraries can provide a better user experience by enabling structured searches. These contribute to making knowledge more widely available and useable.Originality/valueThese findings contribute to the field of automated classification of bibliographical information with the usage of full texts, especially in cases in which the texts are old, unstructured and in which archaic language and vocabulary are used.
V prispevku obravnavamo problem, s katerim se soočamo pri uporabi dokumentov, ki poleg vsebinskih podatkov vsebujejo tudi občutljive podatke o posamezniku, ki omogočajo njegovo razkritje tudi, ko to ni zaželeno. Med področja, kjer nastane veliko podatkov te vrste, štejemo zdravstveno varstvo, transport, kazenski pregon in nacionalno varnost, izobraževanje, sodobne internetne storitve, področje sodobnih aplikacijskih ekosistemov, internet stvari, finančni sektor in odprte podatke državne uprave. Cilj je zaščititi zasebnost subjekta ter hkrati zagotoviti kakovostne podatke za nadaljnje poglobljene analize in s tem nudenje novih znanj za naprej. Za reševanje omenjenih izzivov na področju podatkovnega rudarjenja se je razvilo posebno podpodročje, imenovano PPDM – Privacy Preserving Data Mining, ki se ukvarja z ohranjanjem zasebnosti pri tem procesu. Sistematično smo pregledali relevantno literaturo podpodročja PPDM in opisali glavne metode in tehnike. Tehnike PPDM so zasnovane tako, da zagotavljajo določeno raven zasebnosti, obenem pa ohranjajo uporabnost podatkov, da se lahko uporaba še vedno učinkovito izvaja na transformiranih podatkih. Metode, s katerimi dosegamo zaščito posameznika na eni in uporabno vrednost podatkov na drugi strani v grobem delimo na metode razprševanja podatkov, metode izkrivljanja (z uporabo anonimizacije, randomizacije, vrtenja in vnašanjem šuma v podatke) ter metode šifriranja podatkov. Za doseganje višje zaščite lahko uporabimo tudi kombinacije teh metod. Poleg pregleda metod smo podali nekaj praktičnih primerov ter našteli domene oz. področja, kjer se kaže potreba po nadaljnji analizi in ponovni uporabi podatkov, a hkrati potreba po anonimizaciji oz. prikritju lastnika (subjekta) in njegovih podatkov (atributov).
V prispevku obravnavamo problem, s katerim se soočamo pri uporabi dokumentov, ki poleg vsebinskih podatkov vsebujejo tudi občutljive podatke o posamezniku, ki omogočajo njegovo razkritje tudi, ko to ni zaželeno. Med področja, kjer nastane veliko podatkov te vrste, štejemo zdravstveno varstvo, transport, kazenski pregon in nacionalno varnost, izobraževanje, sodobne internetne storitve, področje sodobnih aplikacijskih ekosistemov, internet stvari, finančni sektor in odprte podatke državne uprave. Cilj je zaščititi zasebnost subjekta ter hkrati zagotoviti kakovostne podatke za nadaljnje poglobljene analize in s tem nudenje novih znanj za naprej. Za reševanje omenjenih izzivov na področju podatkovnega rudarjenja se je razvilo posebno podpodročje, imenovano PPDM – Privacy Preserving Data Mining, ki se ukvarja z ohranjanjem zasebnosti pri tem procesu. Sistematično smo pregledali relevantno literaturo podpodročja PPDM in opisali glavne metode in tehnike. Tehnike PPDM so zasnovane tako, da zagotavljajo določeno raven zasebnosti, obenem pa ohranjajo uporabnost podatkov, da se lahko uporaba še vedno učinkovito izvaja na transformiranih podatkih. Metode, s katerimi dosegamo zaščito posameznika na eni in uporabno vrednost podatkov na drugi strani v grobem delimo na metode razprševanja podatkov, metode izkrivljanja (z uporabo anonimizacije, randomizacije, vrtenja in vnašanjem šuma v podatke) ter metode šifriranja podatkov. Za doseganje višje zaščite lahko uporabimo tudi kombinacije teh metod. Poleg pregleda metod smo podali nekaj praktičnih primerov ter našteli domene oz. področja, kjer se kaže potreba po nadaljnji analizi in ponovni uporabi podatkov, a hkrati potreba po anonimizaciji oz. prikritju lastnika (subjekta) in njegovih podatkov (atributov).
Public institutions hold important physical collections of materials from the fields of culture and science. For more than a decade, they have been facing the problem of digitization and archiving requirements as a part of the digitization process when generating e-content. They are using more or less the same procedures when they cope with those difficulties. At the same time, we note a much larger set of institutions, which still deal, more or less successfully, with the initial problems of the digitization process. The main reasons for the gap between the two can be found among different factors, including: lack of knowledge in the field of the science in informatics, limited financial and staff resources provided by public institutions, and in particular, limited resources for the preservation management and the archive maintenance sub processes. The National and University Library is developing a business model that will provide a successful implementation of the digitization process with all main sub processes (submission and/or digitization, archiving, dissemination). Thus, the end product for digital preservation and provision of public access to e-content for similar institutions will be ensured.The main component parts of the digitization process are as follows: selection and preparation of physical collections, determination of access/use restrictions or copyright, application of Creative Commons license, preparation of project parameters and bibliographic information, digitization, integration of bibliographic data into e-contents, automatic text recognition, archiving and providing access to digitized collections. Some activities, such as access and long-term preservation, need to be ongoing throughout all the phases of the digitization workflow, whereas some of them can be traced only in individual phases of the process. In special cases, it might be necessary to change some bibliographic data, i.e. license terms and conditions or access modality. The digital copy status can change over time (expiration of licence, the digitized materials fall in the Public Domain) or by modification of agreement with the copyright holder.By implementing a business model, we wish to contribute to the rationalization of the use of public funds for this purpose, but at the same time, to encourage a greater scope of online available e-content and to ensure a safe digital preservation of e-content material. It should be pointed out that the problems, related to long-term preservation of physical collections and their digital representation are not terminated by the digitization process, they are only expanded from the physical to digital environment.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
hi@scite.ai
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.