2022
DOI: 10.3390/app12031457
|View full text |Cite
|
Sign up to set email alerts
|

EmmDocClassifier: Efficient Multimodal Document Image Classifier for Scarce Data

Abstract: Document classification is one of the most critical steps in the document analysis pipeline. There are two types of approaches for document classification, known as image-based and multimodal approaches. Image-based document classification approaches are solely based on the inherent visual cues of the document images. In contrast, the multimodal approach co-learns the visual and textual features, and it has proved to be more effective. Nonetheless, these approaches require a huge amount of data. This paper pre… Show more

Help me understand this report
View preprint versions

Search citation statements

Order By: Relevance

Paper Sections

Select...
5

Citation Types

14
10
0

Year Published

2022
2022
2024
2024

Publication Types

Select...
2
2
2
1

Relationship

0
7

Authors

Journals

citations
Cited by 16 publications
(24 citation statements)
references
References 50 publications
(78 reference statements)
14
10
0
Order By: Relevance
“…For example, the two classes Presentation and Scientific Report have an overlap of 3-4%. This finding is similar to that reported by Kanchi et al (2022) [45,Fig. 9] on their multimodal approach.…”
Section: B Overall Evaluationsupporting
confidence: 92%
See 2 more Smart Citations
“…For example, the two classes Presentation and Scientific Report have an overlap of 3-4%. This finding is similar to that reported by Kanchi et al (2022) [45,Fig. 9] on their multimodal approach.…”
Section: B Overall Evaluationsupporting
confidence: 92%
“…For example, the Scientific class is mainly confused with the Report and News classes, which makes perfect sense since these classes usually have similar visual semantics. This is again very similar to the results of Kanchi et al (2022) [45,Fig. 10] who found a large overlap between the Scientific and Report classes.…”
Section: B Overall Evaluationsupporting
confidence: 91%
See 1 more Smart Citation
“…For example, the two classes Presentation and Scientific Report have an overlap of 3-4%. This finding is similar to that reported by Kanchi et al (2022) [48,Fig. 9] on their multimodal approach.…”
Section: B Overall Evaluationsupporting
confidence: 92%
“…It is interesting to note that even our lightest variant DocXClassifier-B achieved a comparable accuracy of 94.00%, and performed significantly better than all existing image-based models as well as some of the more sophisticated multimodal approaches [35], [46], [47], thus representing a good trade-off between accuracy and computational cost. It is important to note that two of the best performing multimodal solutions, those of Kanchi et al (2022) [48] and Bakkali et al (2020) [17], simply combined ConvNetbased visual backbones (EfficientNet and NasNet, respectively) with a Transformer-based textual backbone (BERT) to achieve extraordinary improvements in document classification. We suspect that using our improved ConvNet models as visual backbones in such multimodal approaches could lead to even better results.…”
Section: B Overall Evaluationmentioning
confidence: 99%