Content-based Image Retrieval using Tesseract OCR Engine and Levenshtein Algorithm

Adjetey, Charles; Adu-Manu, Kofi Sarpong

doi:10.14569/ijacsa.2021.0120776

Cited by 7 publications

(5 citation statements)

References 18 publications

Supporting

Mentioning

Contrasting

Unclassified

Order By: Relevance

“…The improvement in the quality of the content-based search for the digital correspondence document was achieved through the availability of the required criteria such as the classification and display of the ontology relationships information to ease the users' understanding of the hierarchy of the letter found and also to display the required document. This facility is unavailable in conventional search which is designed based on the document name or annotations [3] as well as unclassified content [10]- [15]. A trial was conducted as an example by searching for document names through the input of the query "bantuan pemerintah" and no document was shown and this complicated the search process.…”

Section: Resultsmentioning

confidence: 99%

“…Several research [10]- [15] have been conducted about content-based image document search using OCR technology but they are only limited to searching base content for scanned documents without focusing on classified documents for a more specific search. Meanwhile, the increasing number and diversity of documents are making the classification process important to direct, summarize, and organize the documents easily, with efficient and cost-effective solutions [16].…”

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

Advanced content-based retrieval for digital correspondence documents with ontology classification

et al. 2022

View full text Add to dashboard Cite

The growth of digital correspondence documents with various types, different naming rules, and no sufficient search system complicates the search process with certain content, especially if there are unclassified documents, the search becomes inaccurate and takes a long time. This research proposed archiving method with automatic hierarchical classification and the content-based search method which displays ontology classification information as the solution to the content-based search problems. The method consists of preprocessing (creation of automatic hierarchical classification model using a combination of convolutional neural network (CNN) and regular expression method), archiving (document archiving with automatic classification), and retrieval (content-based search by displaying ontology relationships from the document classification). The archiving of 100 documents using the automatic hierarchical classification was found to be 79% accurate as indicated by the 99% accuracy for CNN and 80% for Regex. Moreover, the search results for classified content-based documents through the display of ontology relationships were discovered to be 100% accurate. This research succeeded in improving the quality of search results for digital correspondence documents as indicated by its higher specificity, accuracy, and speed compared to conventional methods based on file names, annotations, and unclassified content.

show abstract

Section: Resultsmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

Advanced content-based retrieval for digital correspondence documents with ontology classification

et al. 2022

View full text Add to dashboard Cite

show abstract

“…where TLD is the Total Levenshtein Distance. The Levenshtein Distance, also known as the Edit-Distance algorithm, measures the number of characters that must be changed, added, or deleted in the predicted word so that it matches the true word [31]. Total Levenshtein distance does not apply to a word; it applies to the whole text.…”

Section: Evaluation Metricsmentioning

confidence: 99%

Instance Segmentation of Characters Recognized in Palmyrene Aramaic Inscriptions

Hamplová,

Lyavdansky,

Novák

et al. 2024

CMES

View full text Add to dashboard Cite

This study presents a single-class and multi-class instance segmentation approach applied to ancient Palmyrene inscriptions, employing two state-of-the-art deep learning algorithms, namely YOLOv8 and Roboflow 3.0. The goal is to contribute to the preservation and understanding of historical texts, showcasing the potential of modern deep learning methods in archaeological research. Our research culminates in several key findings and scientific contributions. We comprehensively compare the performance of YOLOv8 and Roboflow 3.0 in the context of Palmyrene character segmentation-this comparative analysis mainly focuses on the strengths and weaknesses of each algorithm in this context. We also created and annotated an extensive dataset of Palmyrene inscriptions, a crucial resource for further research in the field. The dataset serves for training and evaluating the segmentation models. We employ comparative evaluation metrics to quantitatively assess the segmentation results, ensuring the reliability and reproducibility of our findings and we present custom visualization tools for predicted segmentation masks. Our study advances the state of the art in semi-automatic reading of Palmyrene inscriptions and establishes a benchmark for future research. The availability of the Palmyrene dataset and the insights into algorithm performance contribute to the broader understanding of historical text analysis.

show abstract

“…Adjetey&Sarpong proposed a novel algorithm for recognizing text [34], making documents editable and searchable in images, extracting text from images using Levenshtein Algorithm and Tesseract OCR to searchtext from images to find in the document. Begin by locating and comparing the texts extracted from the images using the Levenshtein text-matching algorithm.…”

Section: Optical Character Recognitionmentioning

confidence: 99%

OCR-based Hybrid Image Text Summarizer using Luhn Algorithm with FinetuneTransformer Modelsfor Long Document

Singco¹,

Trillo²,

Abalorio³

et al. 2023

IJETAE

View full text Add to dashboard Cite

The accessibility of an enormous number of image text documents on the internet has expanded the opportunities to develop a system for image text recognition with text summarization. Several approaches used in ATS in the literature are based on extractive and abstractive techniques; however, few implementations of the hybrid approach were observed. This paper employed state-of-the-art transformer models with the Luhn algorithm for extracted texts using Tesseract OCR. Nine models were generated and tested using the hybrid text summarization approach. Using ROUGE metrics, we compared the proposed system finetune abstractive models against existing abstractive models that use the same dataset Xsum. As a result, the finetune model got the highest ROUGE score during evaluation; in ROUGE-1 score was 57%, the ROUGE-2 score was 43%, and the ROUGE-L score was 42%. Furthermore, even when better algorithms and models were available for summarization, the Luhn algorithm and T5 finetune model provided significant results.

show abstract

Content-based Image Retrieval using Tesseract OCR Engine and Levenshtein Algorithm

Cited by 7 publications

References 18 publications

Advanced content-based retrieval for digital correspondence documents with ontology classification

Advanced content-based retrieval for digital correspondence documents with ontology classification

Instance Segmentation of Characters Recognized in Palmyrene Aramaic Inscriptions

OCR-based Hybrid Image Text Summarizer using Luhn Algorithm with FinetuneTransformer Modelsfor Long Document

Contact Info

Product

Resources

About