Analysis of the Logical Layout of Documents

Dengel, Andreas

doi:10.1007/978-0-85729-859-1_6

Cited by 14 publications

(8 citation statements)

References 42 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…In addition, Dengel and Shafait [1] pointed out that understanding books was researched in a different way, by analysing page sections and generating a table of contents to make the digitised books searchable. They also found that all the research data sets they were aware of were not publicly available.…”

Section: Related Workmentioning

confidence: 99%

“…Dengel and Shafait [1] offered a review of the state of the art, which included six main approaches for logical labelling. Many of them require the existence of additional information like OCR results or document domain knowledge, for example, knowledge about the layout of business letters or invoices.…”

Section: Related Workmentioning

confidence: 99%

“…Finding a method to reveal the reading order of text paragraphs is one of many additional concerns of logical analysis research. According to Dengel and Shafait [1], document logical layout analysis provides other higher-level functionality options like automatic routing of business letters, automatic processing of invoices and within-book navigation facility.…”

Section: Introductionmentioning

confidence: 99%

See 2 more Smart Citations

Text and metadata extraction from scanned Arabic documents using support vector machines

Qin

Elanwar

Betke

2020

Journal of Information Science

View full text Add to dashboard Cite

Text information in scanned documents becomes accessible only when extracted and interpreted by a text recognizer. For a recognizer to work successfully, it must have detailed location information about the regions of the document images that it is asked to analyse. It will need focus on page regions with text skipping non-text regions that include illustrations or photographs. However, text recognizers do not work as logical analyzers. Logical layout analysis automatically determines the function of a document text region, that is, it labels each region as a title, paragraph, or caption, and so on, and thus is an essential part of a document understanding system. In the past, rule-based algorithms have been used to conduct logical layout analysis, using limited size data sets. We here instead focus on supervised learning methods for logical layout analysis. We describe LABA, a system based on multiple support vector machines to perform logical Layout Analysis of scanned Books pages in Arabic. The system detects the function of a text region based on the analysis of various images features and a voting mechanism. For a baseline comparison, we implemented an older but state-of-the-art neural network method. We evaluated LABA using a data set of scanned pages from illustrated Arabic books and obtained high recall and precision values. We also found that the F-measure of LABA is higher for five of the tested six classes compared to the state-of-the-art method.

show abstract

Section: Related Workmentioning

confidence: 99%

Section: Related Workmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

Text and metadata extraction from scanned Arabic documents using support vector machines

Qin

Elanwar

Betke

2020

Journal of Information Science

View full text Add to dashboard Cite

show abstract

“…Some of these have been addressed by the software developed for the Alberti Magni e-corpus project. 8 In particular, after preparing the scholar's texts in a suitable XML-tagged form, a system built on top of sgrep for search and Dragoman for display can address many of those needs. 9 Alternative XML-aware search engines (such as BaseX [20], eXist [21], Wumpus [3], or XQEngine [16]) could equally well have been used in this project, simplifying some solutions but requiring more effort to address other concerns.…”

Section: Retrospective and Further Workmentioning

confidence: 99%

“…The most difficult part of starting a project with a new corpus is to convert the text into XML that reflects its logical structure, an extremely challenging task when physical layout must be interpreted [8], but also quite challenging when the input is plain text with embedded font information. 11 In most of the text, each feature to be tagged can be recognized fairly easily, but unexpected difficulties arise when the features overlap in unanticipated ways.…”

Section: Retrospective and Further Workmentioning

confidence: 99%

Fashioning a Search Engine to Support Humanities Research

Tompa

2018

Proceedings of the ACM Symposium on Document Engineering 2018

View full text Add to dashboard Cite

Scholarship in the humanities often requires the ability to search curated electronic corpora and to display search results in a variety of formats. Challenges that need to be addressed include transforming the texts into a suitable form, typically XML, and catering to the scholars' search and display needs. We describe our experience in creating such a search and display facility. CCS CONCEPTS • Applied computing → Document searching; Extensible Markup Language (XML); • Information systems → Digital libraries and archives;

show abstract

Clipping the Page – Automatic Article Detection and Marking Software in Production of Newspaper Clippings of a Digitized Historical Journalistic Collection

Kettunen

Pääkkönen

Liukkonen

2019

Digital Libraries for Open Knowledge

View full text Add to dashboard Cite

This paper describes utilization of article detection and extraction on the Finnish Digi 1 newspaper material of the National Library of Finland (NLF) using data of one newspaper, Uusi Suometar 1869-1918. We use PIVAJ software [1] for detection and marking of articles in our collection. Out of the separated articles we can produce automatic clippings for the user. The user can collect clippings for own use both as images and as OCRed text. Together these functionalities improve usability of the digitized journalistic collection by providing a structured access to the contents of a page.

show abstract

Analysis of the Logical Layout of Documents

Cited by 14 publications

References 42 publications

Text and metadata extraction from scanned Arabic documents using support vector machines

Text and metadata extraction from scanned Arabic documents using support vector machines

Fashioning a Search Engine to Support Humanities Research

Clipping the Page – Automatic Article Detection and Marking Software in Production of Newspaper Clippings of a Digitized Historical Journalistic Collection

Contact Info

Product

Resources

About