Engaging Image Captioning via Personality

Shuster, Kurt; Humeau, Samuel; Hu, Hexiang; Bordes, Antoine; Weston, Jason

doi:10.1109/cvpr.2019.01280

Cited by 131 publications

(110 citation statements)

References 49 publications

Supporting

Mentioning

108

Contrasting

Unclassified

Order By: Relevance

“…Again, the results reveal that the ranking of the three systems is identical across all evaluation scores, even though BISON measures different aspects of the system than the captioning scores. In line with prior work [50], we find that the UpDown captioning system outperforms its competitors in terms of all evaluation measures, including BISON.…”

Section: Resultssupporting

confidence: 89%

“…As a result, the evaluations may be sensitive to changes in the reference caption set and incorrectly assess the semantics of the generated caption. We perform an analysis designed to study these effects on the COCO captions validation set by asking human annotators to assess image captions generated by the state-of-the-art UpDown [4,50] captioning system 1 . Specifically, we followed the COCO guidelines for human evaluation [1] and asked annotators to evaluate the "correctness" of image-caption pairs on a Likert scale from 1 (low) to 5 (high).…”

Section: Figurementioning

confidence: 99%

See 1 more Smart Citation

Evaluating Text-to-Image Matching using Binary Image Selection (BISON)

Misra

Maaten

2019

2019 IEEE/CVF International Conference on Computer Vision Workshop (ICCVW)

Self Cite

View full text Add to dashboard Cite

Providing systems the ability to relate linguistic and visual content is one of the hallmarks of computer vision. Tasks such as text-based image retrieval and image captioning were designed to test this ability, but come with evaluation measures that have high variance or are difficult to interpret. We study an alternative task for systems that match text and images: given a text query, the system is asked to select the image that best matches the query from a pair of semantically similar images. The system's accuracy on this Binary Image SelectiON (BISON) task is interpretable, eliminates the reliability problems of retrieval evaluations, and focuses on the system's ability to understand fine-grained visual structure. We gather a BISON dataset that complements the COCO dataset and use it to evaluate modern text-based image retrieval and image captioning systems. Our results provide novel insights into the performance of these systems. * This work was performed while Hexiang Hu was at Facebook.Text query: Plates filled with carrots and beets on a white table.

show abstract

Section: Resultssupporting

confidence: 89%

Section: Figurementioning

confidence: 99%

Evaluating Text-to-Image Matching using Binary Image Selection (BISON)

Misra

Maaten

2019

2019 IEEE/CVF International Conference on Computer Vision Workshop (ICCVW)

Self Cite

View full text Add to dashboard Cite

show abstract

“…Thus, methods developed on such datasets might not be easily adopted in the wild. Nevertheless, great efforts have been made to extend captioning to out-of-domain data [3,9,69] or different styles beyond mere factual descriptions [22,55]. In this work we explore unsupervised captioning, where image and language sources are independent.…”

Section: Language Domainmentioning

confidence: 99%

Towards Unsupervised Image Captioning With Shared Multimodal Embeddings

Laina

Rupprecht

Navab

2019

2019 IEEE/CVF International Conference on Computer Vision (ICCV)

View full text Add to dashboard Cite

Understanding images without explicit supervision has become an important problem in computer vision. In this paper, we address image captioning by generating language descriptions of scenes without learning from annotated pairs of images and their captions. The core component of our approach is a shared latent space that is structured by visual concepts. In this space, the two modalities should be indistinguishable. A language model is first trained to encode sentences into semantically structured embeddings. Image features that are translated into this embedding space can be decoded into descriptions through the same language model, similarly to sentence embeddings. This translation is learned from weakly paired images and text using a loss robust to noisy assignments and a conditional adversarial component. Our approach allows to exploit large text corpora outside the annotated distributions of image/caption data. Our experiments show that the proposed domain alignment learns a semantically meaningful representation which outperforms previous work.

show abstract

“…jealous girlfriend) versus high-level personality models such as the Big Five. We believe that TV Tropes is better for our purpose of fictional character modeling than data sources used in works such as Shuster et al (2019) because TV Tropes' content providers are rewarded for correctly providing content through community acknowledgement.…”

Section: Human Level Attributes (Hla)mentioning

confidence: 99%

ALOHA: Artificial Learning of Human Attributes for Dialogue Agents

Jiang

Feng

et al. 2020

AAAI

View full text Add to dashboard Cite

For conversational AI and virtual assistants to communicate with humans in a realistic way, they must exhibit human characteristics such as expression of emotion and personality. Current attempts toward constructing human-like dialogue agents have presented significant difficulties. We propose Human Level Attributes (HLAs) based on tropes as the basis of a method for learning dialogue agents that can imitate the personalities of fictional characters. Tropes are characteristics of fictional personalities that are observed recurrently and determined by viewers' impressions. By combining detailed HLA data with dialogue data for specific characters, we present a dataset, HLA-Chat, that models character profiles and gives dialogue agents the ability to learn characters' language styles through their HLAs. We then introduce a three-component system, ALOHA (which stands for Artificial Learning of Human Attributes), that combines character space mapping, character community detection, and language style retrieval to build a character (or personality) specific language model. Our preliminary experiments demonstrate that two variations of ALOHA, combined with our proposed dataset, can outperform baseline models at identifying the correct dialogue responses of chosen target characters, and are stable regardless of the character's identity, the genre of the show, and the context of the dialogue.

show abstract

Engaging Image Captioning via Personality

Cited by 131 publications

References 49 publications

Evaluating Text-to-Image Matching using Binary Image Selection (BISON)

Evaluating Text-to-Image Matching using Binary Image Selection (BISON)

Towards Unsupervised Image Captioning With Shared Multimodal Embeddings

ALOHA: Artificial Learning of Human Attributes for Dialogue Agents

Contact Info

Product

Resources

About