Multi-Modal Association based Grouping for Form Structure Extraction

Aggarwal, Milan; Sarkar, Mausoom; Gupta, Hiresh; Krishnamurthy, Balaji

doi:10.1109/wacv45572.2020.9093376

Cited by 10 publications

(5 citation statements)

References 33 publications

(46 reference statements)

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Sarkar et al [16] predicts all levels of the document hierarchy in parallel, making it quite efficient. Aggarwal et al [2] offers an approach that is architecturally like a languagebased approach but uses contextual pooling of CNN features like [5]. They determine a context window by identifying a neighborhood of form elements and use a CNN to extract image features from this context window.…”

Section: Prior Workmentioning

confidence: 99%

Visual FUDGE: Form Understanding via Dynamic Graph Editing

Davis¹,

Morse²,

Price³

et al. 2021

Preprint

View full text Add to dashboard Cite

We address the problem of form understanding: finding text entities and the relationships/links between them in form images. The proposed FUDGE model formulates this problem on a graph of text elements (the vertices) and uses a Graph Convolutional Network to predict changes to the graph. The initial vertices are detected text lines and do not necessarily correspond to the final text entities, which can span multiple lines. Also, initial edges contain many false-positive relationships. FUDGE edits the graph structure by combining text segments (graph vertices) and pruning edges in an iterative fashion to obtain the final text entities and relationships. While recent work in this area has focused on leveraging large-scale pre-trained Language Models (LM), FUDGE achieves the same level of entity linking performance on the FUNSD dataset by learning only visual features from the (small) provided training set. FUDGE can be applied on forms where text recognition is difficult (e.g. degraded or historical forms) and on forms in resource-poor languages where pre-training such LMs is challenging. FUDGE is stateof-the-art on the historical NAF dataset.

show abstract

Section: Prior Workmentioning

confidence: 99%

Visual FUDGE: Form Understanding via Dynamic Graph Editing

Davis¹,

Morse²,

Price³

et al. 2021

Preprint

View full text Add to dashboard Cite

show abstract

“…NLP-based methods work on low-level elements (e.g., tokens) and model layout analysis as a sequence labeling task. MMPAN [1] is presented to recognize form structures. DocBank [20] is proposed as a large scale dataset of multimodal layout analysis and several NLP baselines have been released.…”

Section: Related Workmentioning

confidence: 99%

“…Some regions (e.g.,Figure , Table) can be easily identified by visual features, while semantic features are important for separating visually similar regions (e.g.,Abstract and Paragraph). Therefore, some recent efforts try to combine both modalities [1,20,39,3]. Here we summarize them into two categories.…”

Section: Introductionmentioning

confidence: 99%

“…NLP-based methods (Fig 1 (a)) model layout analysis as a sequence labeling task and apply a bottom-up strategy. They first serialize texts into 1D token sequence 1 . Then, using both semantic and visual features (such as coordinates and image embedding) of each token, they determine token labels sequentially through a sequence labeling model.…”

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

VSR: A Unified Framework for Document Layout Analysis combining Vision, Semantics and Relations

Zhang¹,

Li²,

Liang³

et al. 2021

Preprint

View full text Add to dashboard Cite

Document layout analysis is crucial for understanding document structures. On this task, vision and semantics of documents, and relations between layout components contribute to the understanding process. Though many works have been proposed to exploit the above information, they show unsatisfactory results. NLP-based methods model layout analysis as a sequence labeling task and show insufficient capabilities in layout modeling. CV-based methods model layout analysis as a detection or segmentation task, but bear limitations of inefficient modality fusion and lack of relation modeling between layout components. To address the above limitations, we propose a unified framework VSR for document layout analysis, combining vision, semantics and relations. VSR supports both NLP-based and CV-based methods. Specifically, we first introduce vision through document image and semantics through text embedding maps. Then, modality-specific visual and semantic features are extracted using a two-stream network, which are adaptively fused to make full use of complementary information. Finally, given component candidates, a relation module based on graph neural network is incorported to model relations between components and output final results. On three popular benchmarks, VSR outperforms previous models by large margins. Code will be released soon.

show abstract

“…The above mentioned methods are CVbased,considering layout analysis as detection or segmentation tasks. There are also some NLP-based methods [10,40],viewing layout analysis as a sequence-labeling task.These methods usually obtain text information through PDF parsing or OCR recognition.Thetextinformationprovides auxiliary NLP modality enhancement when mixed with CV-based methods, while for CV-based unimodal, the performance depends heavily on optimized visual feature representation.…”

Section: Introductionmentioning

confidence: 99%

HSCA-Net: A Hybrid Spatial-Channel Attention Network in Multi-Scale Feature Pyramid for Document Layout Analysis

Zhang

Shi

et al. 2022

JAIT

View full text Add to dashboard Cite

Document images often contain various page components and complex logical structures, which makes document layout analysis task challenging. For most deep learning based document layout analysis methods, convolutional neural networks (CNNs) are adopted as the image feature extraction networks. In this paper, a hybrid spatial-channel attention network (HSCA-Net) is proposed to improve feature extraction capability by exerting attention mechanism to explore more salient properties within document pages. The HSCA-Net contains two modules: spatial attention module (SAM) and channel attention module (CAM). They are embedded in the multi-scale feature network by lateral attention connection. SAM extracts contextual information with learning offset in spatial dimension and CAM performs feature recalibration by focusing more on feature channels with important contents. The lateral attention connection is to incorporate SAM and CAM into multi-scale feature pyramid network and retain more of the original feature information. The effectiveness and adaptability of HSCA-Net are evaluated through multiple experiments on publicly available datasets PubLayNet, ICDAR-POD and Article Regions. The mAP on these datasets is as high as 0.940,0.939 and 0.967 respectively, which demonstrate that our HSCA-Net achieves competitive results on document layout analysis task. Text.

show abstract

Multi-Modal Association based Grouping for Form Structure Extraction

Cited by 10 publications

References 33 publications

Visual FUDGE: Form Understanding via Dynamic Graph Editing

Visual FUDGE: Form Understanding via Dynamic Graph Editing

VSR: A Unified Framework for Document Layout Analysis combining Vision, Semantics and Relations

HSCA-Net: A Hybrid Spatial-Channel Attention Network in Multi-Scale Feature Pyramid for Document Layout Analysis

Contact Info

Product

Resources

About