Can Pre-trained Vision and Language Models Answer Visual Information-Seeking Questions?

Chen, Yang; Hu, Hexiang; Luan, Yi; Sun, Haitian; Changpinyo, Soravit; Ritter, Alan; Chang, Ming‐Wei

doi:10.48550/arxiv.2302.11713

Cited by 2 publications

(1 citation statement)

References 45 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…A few works focus on direct question answering on charts such as DVQA [11], FigureQA [9] and PlotQA [10], and made their dataset public. Meanwhile Chen et al [22] introduced a benchmark data for visual information-seeking questions in natural images. The work by Samira et al [9] works with five distinct types of charts: line, dot-line, vertical and horizontal bar charts, and pie plots.…”

Section: B Question Answering On Chartsmentioning

confidence: 99%

Data Extraction and Question Answering on Chart Images Towards Accessibility and Data Interpretation

Joshi,

2023

IEEE Open J. Comput. Soc.

View full text Add to dashboard Cite

Graphical representations such as chart images are integral to web pages and documents. Automating data extraction from charts is possible by reverse-engineering the visualization pipeline. This study proposes a framework that automates data extraction from bar charts and integrates it with questionanswering. The framework employs an object detector to recognize visual cues in the image, followed by text recognition. Mask-RCNN for plot element detection achieves a mean average precision of 95.04% at a threshold of 0.5 which decreases as the Intersection over Union (IoU) threshold increases. A contour approximation-based approach is proposed for extracting the bar coordinates, even at a higher IoU of 0.9. The textual and visual cues are associated with the legend text and preview, and the chart data is finally extracted in tabular format. We introduce an extension to the TAPAS model, called TAPAS++, by incorporating new operations and table question answering is done using TAPAS++ model. The chart summary or description is also produced in an audio format. In the future, this approach could be expanded to enable interactive question answering on charts by accepting audio inquiries from individuals with visual impairments and do more complex reasoning using Large Language Models.

show abstract