Reconstructing Manual Information Extraction with DB-to-Document Backprojection: Experiments in the Life Science Domain

Müller, Mark-Christoph; Ghosh, Sucheta; Rey, Maja; Wittig, Ulrike; Müller, Wolfgang; Strube, Michael

doi:10.18653/v1/2020.sdp-1.9

Search citation statements

Order By: Relevance

Paper Sections

Select...

Highlighting Detection1

Citation Types

Supporting

Mentioning

Contrasting

Year Published

2021

Publication Types

Select...

Other1

Relationship

Self Cite1

Independent0

Authors

Journals

Cited by 1 publication

(1 citation statement)

References 19 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…What is more, inpainted backgrounds are only required if highlighting detection is desired: For text-only alignment, plain scans are sufficient. The actual highlighting extraction works as follows (see Müller et al (2020) for details): Since document highlighting comes mostly in strong colors, which are characterized by large differences among their three component values in the RGB color model, we create a binarized version of each page by going over all pixels in the background image and setting each pixel to 1 if the pairwise differences between the R, G, and B components are above a certain threshold (50), and to 0 otherwise. This yields an image with regions of higher and lower density of black pixels.…”

Section: Highlighting Detectionmentioning

confidence: 99%

Word-Level Alignment of Paper Documents with their Electronic Full-Text Counterparts

Müller

Ghosh

Wittig

et al. 2021

Proceedings of the 20th Workshop on Biomedical Language Processing

Self Cite

View full text Add to dashboard Cite

We describe a simple procedure for the automatic creation of word-level alignments between printed documents and their respective full-text versions. The procedure is unsupervised, uses standard, off-the-shelf components only, and reaches an F-score of 85.01 in the basic setup and up to 86.63 when using pre-and post-processing. Potential areas of application are manual database curation (incl. document triage) and biomedical expression OCR.

show abstract

Section: Highlighting Detectionmentioning

confidence: 99%