EmmDocClassifier: Efficient Multimodal Document Image Classifier for Scarce Data

Kanchi, Shrinidhi; Pagani, Alain; Mokayed, Hamam; Liwicki, Marcus; Stricker, Didier; Afzal, Muhammad Zeshan

doi:10.3390/app12031457

Cited by 16 publications

(24 citation statements)

References 50 publications

(78 reference statements)

Supporting

Mentioning

Contrasting

Order By: Relevance

“…For example, the two classes Presentation and Scientific Report have an overlap of 3-4%. This finding is similar to that reported by Kanchi et al (2022) [45,Fig. 9] on their multimodal approach.…”

Section: B Overall Evaluationsupporting

confidence: 92%

“…For example, the Scientific class is mainly confused with the Report and News classes, which makes perfect sense since these classes usually have similar visual semantics. This is again very similar to the results of Kanchi et al (2022) [45,Fig. 10] who found a large overlap between the Scientific and Report classes.…”

Section: B Overall Evaluationsupporting

confidence: 91%

“…The lighter variant DocXClassifier- B performed slightly worse than the multimodal approach mentioned above, but still outperformed all existing imagebased approaches by a wide margin. With only ImageNet pre-training, we achieved an accuracy of 90.14% on the Tobacco3482 dataset, which is not only the highest reported image-based classification accuracy, but also comparable to the recently presented multimodal approach [45] based on the combination of EfficientNet and Hierarchical Attention VOLUME 4, 2016 Networks, which achieved an accuracy of 90.3%. We also analyzed the class distribution of the DocXClassifier-XL with RVL-CDIP pre-training on the To-bacco3482 dataset, as shown in Fig.…”

Section: B Overall Evaluationsupporting

confidence: 70%

See 2 more Smart Citations

DocXClassifier: Towards a Robust and Interpretable Deep Neural Network for Document Image Classification

Saifullah¹

2023

Preprint

View full text Add to dashboard Cite

show abstract

Section: B Overall Evaluationsupporting

confidence: 92%

Section: B Overall Evaluationsupporting

confidence: 91%

Section: B Overall Evaluationsupporting

confidence: 70%

See 1 more Smart Citation

DocXClassifier: Towards a Robust and Interpretable Deep Neural Network for Document Image Classification

Saifullah¹

2023

Preprint

View full text Add to dashboard Cite

show abstract

“…For example, the two classes Presentation and Scientific Report have an overlap of 3-4%. This finding is similar to that reported by Kanchi et al (2022) [48,Fig. 9] on their multimodal approach.…”

Section: B Overall Evaluationsupporting

confidence: 92%

“…It is interesting to note that even our lightest variant DocXClassifier-B achieved a comparable accuracy of 94.00%, and performed significantly better than all existing image-based models as well as some of the more sophisticated multimodal approaches [35], [46], [47], thus representing a good trade-off between accuracy and computational cost. It is important to note that two of the best performing multimodal solutions, those of Kanchi et al (2022) [48] and Bakkali et al (2020) [17], simply combined ConvNetbased visual backbones (EfficientNet and NasNet, respectively) with a Transformer-based textual backbone (BERT) to achieve extraordinary improvements in document classification. We suspect that using our improved ConvNet models as visual backbones in such multimodal approaches could lead to even better results.…”

Section: B Overall Evaluationmentioning

confidence: 99%

DocXClassifier: Towards an Interpretable Deep Convolutional Neural Network for Document Image Classification

Saifullah¹,

Agne²,

Dengel³

et al. 2022

Preprint

View full text Add to dashboard Cite

<p> Convolutional Neural Networks (ConvNets) have been thoroughly researched for document image classification and are known for their exceptional performance in unimodal image-based document classification. Recently, however, there has been a sudden shift in the field towards multimodal approaches that simultaneously learn from the visual and textual features of the documents. While this has led to significant advances in the field, it has also led to a waning interest in improving pure ConvNets-based approaches. This is not desirable, as many of the multimodal approaches still use ConvNets as their visual backbone, and thus improving ConvNets is essential to improving these approaches. In this paper, we present DocXClassifier, a ConvNet-based approach that, using state-of-the-art model design patterns together with modern data augmentation and training strategies, not only achieves significant performance improvements in image-based document classification, but also outperforms some of the recently proposed multimodal approaches. Moreover, DocXClassifier is capable of generating transformer-like attention maps, which makes it inherently interpretable, a property not found in previous image-based classification models. Our approach achieves a new peak performance in image-based classification on two popular document datasets, namely RVL-CDIP and Tobacco3482, with a top-1 classification accuracy of 94.17% and 95.57% on the two datasets, respectively. Moreover, it sets a new record for the highest image-based classification accuracy of 90.14% on Tobacco3482 without transfer learning from RVL-CDIP. Finally, our proposed model may serve as a powerful visual backbone for future multimodal approaches, by providing much richer visual features than existing counterparts. </p>

show abstract

Anime Sketch Colourization Using Enhanced Pix2pix GAN

Mudhalwadkar,

Mokayed,

Alkhaled

et al. 2023

Lecture Notes in Computer Science

View full text Add to dashboard Cite

EmmDocClassifier: Efficient Multimodal Document Image Classifier for Scarce Data

Cited by 16 publications

References 50 publications

DocXClassifier: Towards a Robust and Interpretable Deep Neural Network for Document Image Classification

DocXClassifier: Towards a Robust and Interpretable Deep Neural Network for Document Image Classification

DocXClassifier: Towards an Interpretable Deep Convolutional Neural Network for Document Image Classification

Anime Sketch Colourization Using Enhanced Pix2pix GAN

Contact Info

Product

Resources

About