Purpose This paper aims to present a methodology for the semantic enrichment on the scanned collection of Migne’s Patrologia Graeca (PG), attempting to easily locate on the Web domain the scanned PG source, when a reference of this source is described and commented on another scanned or textual document, and to semantically enrich PG through related scanned or textual documents named “satellite texts” published by third people. The present enrichment of PG uses as satellite texts the Dorotheos Scholarios's Synoptic Index (DSSI) which act as metadata for PG. Design/methodology/approach The methodology consists of two parts. The first part addresses the DSSI transcription via a proper web tool. The second part is divided into two subsections: the accomplishment of interlinking the printed column numbers of each scanned PG page with its actual filename, which is the build of a matching function, and the build of a web interface for PG, based on the generated Uniform Resource Identifiers (URIs) of the above first subsection. Findings The result of the implemented methodology is a Web portal, capable of providing server-less search of topics with direct (single click) navigation to sources. The produced system is static, scalable, easy to be managed and requires minimal cost to be completed and maintained. The produced data sets of transcribed DSSI and the JavaScript Object Notation (JSON) matching functions are available for personal use of students and scholars under Creative Commons license (CC-BY-NC-SA). Social implications Scholars or anyone interested in a particular subject can easily locate topics in PG and reference them, using URIs that are easy to remember. This fact contributes significantly to the related scientific dialogue. Originality/value The methodology uses the transcribed satellite texts of DSSI, which act as metadata for PG, to semantically enrich PG collection. Furthermore, the built PG Web interface can be used by other satellite texts as a reference basis to further enrich PG, as it provides a direct identification of sources. The presented methodology is general and can be applied to any scanned collection using its own satellite texts.
A wealth of knowledge is kept behind libraries and cultural institutions in various digital forms without however the possibility of a simple term search, let alone of a substantial semantic search. One such important collection that contains knowledge, accumulated in the passage of the ages and remain inaccessible for the greater part, is Patrologia Graeca. So far, little research has been conducted to make this digital collection searchable to a certain degree, in order to retrieve and reveal its gathered knowledge in an efficient way. In this study, a novel approach is proposed which strives towards recognizing words from large printed corpora such as Patrologia Graeca. The proposed framework firstly applies an efficient segmentation process at word level and transforms the word-images of Greek polytonic script of the Patrologia Graeca into special compact shapes. Afterwards the contours of these shapes are extracted and compared with the contour of a similarly transformed query wordimage in order to locate the specific word in the digitized documents. For the comparison, we use a series of three descriptors, Hu's invariant moments for discarding unlikely similar matches, Shape Context for the contour similarity and the Pearson's correlation coefficient for final pruning of the dissimilar words and additional verification. Comparative results are presented by using instead of Pearson's correlation coefficient the Long-Short Term Memory Neural Network engine of Tesseract Optical Character Recognition system. The described framework due to the simplicity and efficiency that provides, can be applied for massive creation of search indexes and consequently semantic enrichment of Patrologia Graeca. The framework has the potential to be applicable for other printed collections with proper configuration of the parameters. An additional and very significant consequence of our method's effectiveness and simplicity is that it can be used as a pre-stage to provide a large number of word-image and label pairs, These pairs can be used for training neural networks or common classifiers such as k-nearest neighbor or state vector machine.
Purpose This study aims to provide a system capable of static searching on a large number of unstructured texts directly on the Web domain while keeping costs to a minimum. The proposed framework is applied to the unstructured texts of Migne’s Patrologia Graeca (PG) collection, setting PG as an implementation example of the method. Design/methodology/approach The unstructured texts of PG have automatically transformed to a read-only not only Structured Query Language (NoSQL) database with a structure identical to that of a representational state transfer access point interface. The transformation makes it possible to execute queries and retrieve ranked results based on a specialized application of the extended Boolean model. Findings Using a specifically built Web-browser-based search tool, the user can quickly locate ranked relevant fragments of texts with the ability to navigate back and forth. The user can search using the initial part of words and by ignoring the diacritics of the Greek language. The performance of the search system is comparatively examined when different versions of hypertext transfer protocol (Http) are used for various network latencies and different modes of network connections. Queries using Http-2 have by far the best performance, compared to any of Http-1.1 modes. Originality/value The system is not limited to the case study of PG and has a generic application in the field of humanities. The expandability of the system in terms of semantic enrichment is feasible by taking into account synonyms and topics if they are available. The system’s main advantage is that it is totally static which implies important features such as simplicity, efficiency, fast response, portability, security and scalability.
A wealth of knowledge is kept behind libraries and cultural institutions in various digital forms without however the possibility of a simple term search, let alone of a substantial semantic search. One such important collection that contains knowledge, accumulated in the passage of the ages and remain inaccessible for the greater part, is Patrologia Graeca. So far, little research has been conducted to make this digital collection searchable to a certain degree, in order to retrieve and reveal its gathered knowledge in an efficient way. In this study, a novel approach is proposed which strives towards recognizing words from large printed corpora such as Patrologia Graeca. The proposed framework firstly applies an efficient segmentation process at word level and transforms the word-images of Greek polytonic script of the Patrologia Graeca into special compact shapes. Afterwards the contours of these shapes are extracted and compared with the contour of a similarly transformed query wordimage in order to locate the specific word in the digitized documents. For the comparison, we use a series of three descriptors, Hu's invariant moments for discarding unlikely similar matches, Shape Context for the contour similarity and the Pearson's correlation coefficient for final pruning of the dissimilar words and additional verification. Comparative results are presented by using instead of Pearson's correlation coefficient the Long-Short Term Memory Neural Network engine of Tesseract Optical Character Recognition system. The described framework due to the simplicity and efficiency that provides, can be applied for massive creation of search indexes and consequently semantic enrichment of Patrologia Graeca. The framework has the potential to be applicable for other printed collections with proper configuration of the parameters. An additional and very significant consequence of our method's effectiveness and simplicity is that it can be used as a pre-stage to provide a large number of word-image and label pairs, These pairs can be used for training neural networks or common classifiers such as k-nearest neighbor or state vector machine.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
hi@scite.ai
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.