Knowledge-rich image gist understanding beyond literal meaning

Weiland, Lydia; Hulpuș, Ioana; Ponzetto, Simone Paolo; Effelsberg, Wolfgang; Dietz, Laura

doi:10.1016/j.datak.2018.07.006

Cited by 11 publications

(4 citation statements)

References 66 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Without this knowledge and previous experience, it is not possible to correctly classify or understand the image, because it is a completely new pattern to be recognized. Previously gained experience allows you to compare new observed patterns with previously recognized ones and refer to them in the process of understanding [6]. In the described model of knowledgebased perception, when trying to recognize patterns that are already known but are being shown in an unusual situation (due to the expectations associated with such a pattern), it is also difficult to make a correct classification, because the expectations generated in the cognitive model are completely different.…”

Section: Perceptual Inference Modelmentioning

confidence: 99%

Evaluation of Human Perception Thresholds Using Knowledge-Based Pattern Recognition

Ogiela,

Ogiela

2024

Electronics

View full text Add to dashboard Cite

This paper presents research on determining individual perceptual thresholds in cognitive analyses and the understanding of visual patterns. Such techniques are based on the processes of cognitive resonance and can be applied to the division and reconstruction of images using threshold algorithms. The research presented here considers the most important parameters that affect the determination of visual perception thresholds. These parameters are the thematic knowledge and personal expectations that arise at the time of image observation and recognition. The determination of perceptual thresholds has been carried out using visual pattern splitting techniques through threshold methods. The reconstruction of the divided patterns was carried out by combining successive components that, as information was gathered, allowed more and more details to become apparent in the image until the observer could recognize it correctly. The study being carried out in this way made it possible to determine individual perceptual thresholds for dozens of test subjects. The results of the study also showed strong correlations between the determined perceptual thresholds and the participants’ accumulated thematic knowledge, expectations and experiences from a previous recognition of similar image patterns.

show abstract

Section: Perceptual Inference Modelmentioning

confidence: 99%

Evaluation of Human Perception Thresholds Using Knowledge-Based Pattern Recognition

Ogiela,

Ogiela

2024

Electronics

View full text Add to dashboard Cite

show abstract

“…They detect if the image and the text make the same point, if one modality is unclear without the other, if the modalities, when considered separately, imply opposing ideas, and if one of the modalities is sufficient to convey the message. Weiland et al (2018) focus on detecting if captions of images contain complementary information. Vempala and Preoţiuc-Pietro (2019) infer relationship categories between the text and image of Twitter posts to see how the meaning of the entire tweet is composed.…”

Section: Related Workmentioning

confidence: 99%

On the Complementarity of Images and Text for the Expression of Emotions in Social Media

Khlyzova¹,

Silberer²,

Klinger³

2022

Preprint

View full text Add to dashboard Cite

Authors of posts in social media communicate their emotions and what causes them with text and images. While there is work on emotion and stimulus detection for each modality separately, it is yet unknown if the modalities contain complementary emotion information in social media. We aim at filling this research gap and contribute a novel, annotated corpus of English multimodal Reddit posts. On this resource, we develop models to automatically detect the relation between image and text, an emotion stimulus category and the emotion class. We evaluate if these tasks require both modalities and find for the imagetext relations, that text alone is sufficient for most categories (complementary, illustrative, opposing): the information in the text allows to predict if an image is required for emotion understanding. The emotions of anger and sadness are best predicted with a multimodal model, while text alone is sufficient for disgust, joy, and surprise. Stimuli depicted by objects, animals, food, or a person are best predicted by image-only models, while multimodal models are most effective on art, events, memes, places, or screenshots.

show abstract

“…Collected multimodal corpora. Recent computational work has examined diverse multimodal corpora collected from in-vivo social processes, e.g., visual/textual advertisements (Hussain et al, 2017;Ye and Kovashka, 2018;, images with non-literal captions in news articles (Weiland et al, 2018), and image/text instructions in cooking how-to documents (Alikhani et al, 2019). In these cases, multimodal classification tasks are often proposed over these corpora as a means of testing different theories from semiotics (Barthes, 1988;O'Toole, 1994;Lemke, 1998;O'Halloran, 2004, inter alia); unlike many VQA-style datasets, they are generally not specifically balanced to force models to learn crossmodal interactions.…”

Section: Related Workmentioning

confidence: 99%

Does my multimodal model learn cross-modal interactions? It’s harder to tell than you might think!

Hessel¹,

Lee²

2020

Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP)

View full text Add to dashboard Cite

Modeling expressive cross-modal interactions seems crucial in multimodal tasks, such as visual question answering. However, sometimes high-performing black-box algorithms turn out to be mostly exploiting unimodal signals in the data. We propose a new diagnostic tool, empirical multimodally-additive function projection (EMAP), for isolating whether or not cross-modal interactions improve performance for a given model on a given task. This function projection modifies model predictions so that cross-modal interactions are eliminated, isolating the additive, unimodal structure. For seven image+text classification tasks (on each of which we set new state-ofthe-art benchmarks), we find that, in many cases, removing cross-modal interactions results in little to no performance degradation. Surprisingly, this holds even when expressive models, with capacity to consider interactions, otherwise outperform less expressive models; thus, performance improvements, even when present, often cannot be attributed to consideration of cross-modal feature interactions. We hence recommend that researchers in multimodal machine learning report the performance not only of unimodal baselines, but also the EMAP of their best-performing model.

show abstract

Knowledge-rich image gist understanding beyond literal meaning

Cited by 11 publications

References 66 publications

Evaluation of Human Perception Thresholds Using Knowledge-Based Pattern Recognition

Evaluation of Human Perception Thresholds Using Knowledge-Based Pattern Recognition

On the Complementarity of Images and Text for the Expression of Emotions in Social Media

Does my multimodal model learn cross-modal interactions? It’s harder to tell than you might think!

Contact Info

Product

Resources

About