Nikolai Ilinykh scite author profile

Nikolai Ilinykh

5Publications

17Citation Statements Received

132Citation Statements Given

How they've been cited

How they cite others

120

128

Affiliations

University of Gothenburg

Publications

Order By: Most citations

Tell Me More: A Dataset of Visual Scene Description Sequences

Ilinykh¹,

Zarrieß

Schlangen

2019

View full text Add to dashboard Cite

We present a dataset consisting of what we call image description sequences. These multisentence descriptions of the contents of an image were collected in a pseudo-interactive setting, where the describer was told to describe the given image to a listener who needs to identify the image within a set of images, and who successively asks for more information. As we show, this setup produced nicely structured data that, we think, will be useful for learning models capable of planning and realising such description discourses.

show abstract

The Task Matters: Comparing Image Captioning and Task-Based Dialogical Image Description

Ilinykh¹,

Zarrieß

Schlangen

2018

View full text Add to dashboard Cite

Image captioning models are typically trained on data that is collected from people who are asked to describe an image, without being given any further task context. As we argue here, this context independence is likely to cause problems for transferring to task settings in which image description is bound by task demands. We demonstrate that careful design of data collection is required to obtain image descriptions which are contextually bounded to a particular meta-level task. As a task, we use MeetUp!, a text-based communication game where two players have the goal of finding each other in a visual environment. To reach this goal, the players need to describe images representing their current location. We analyse a dataset from this domain and show that the nature of image descriptions found in MeetUp! is diverse, dynamic and rich with phenomena that are not present in descriptions obtained through a simple image captioning task, which we ran for comparison.

show abstract

Describe me an Aucklet: Generating Grounded Perceptual Category Descriptions

Noble¹,

Ilinykh²

2023

Preprint

View full text Add to dashboard Cite

Human language users can generate descriptions of perceptual concepts beyond instancelevel representations and also use such descriptions to learn provisional class-level representations. However, the ability of computational models to learn and operate with class representations is under-investigated in the language-and-vision field. In this paper, we train separate neural networks to generate and interpret class-level descriptions. We then use the zero-shot classification performance of the interpretation model as a measure of communicative success and class-level conceptual grounding. We investigate the performance of prototypeand exemplar-based neural representations grounded category description. Finally, we show that communicative success reveals performance issues in the generation model that are not captured by traditional intrinsic NLG evaluation metrics, and argue that these issues can be traced to a failure to properly ground language in vision at the class level. We observe that the interpretation model performs better with descriptions that are low in diversity on the class level, possibly indicating a strong reliance on frequently occurring features.

show abstract

We went to look for meaning and all we got were these lousy representations: aspects of meaning representation for computational semantics

Dobnik¹,

Cooper²,

Ek³

et al. 2021

Preprint

View full text Add to dashboard Cite

What Does a Language-And-Vision Transformer See: The Impact of Semantic Information on Visual Representations

Ilinykh

Dobnik

2021

Front. Artif. Intell.

View full text Add to dashboard Cite

Neural networks have proven to be very successful in automatically capturing the composition of language and different structures across a range of multi-modal tasks. Thus, an important question to investigate is how neural networks learn and organise such structures. Numerous studies have examined the knowledge captured by language models (LSTMs, transformers) and vision architectures (CNNs, vision transformers) for respective uni-modal tasks. However, very few have explored what structures are acquired by multi-modal transformers where linguistic and visual features are combined. It is critical to understand the representations learned by each modality, their respective interplay, and the task’s effect on these representations in large-scale architectures. In this paper, we take a multi-modal transformer trained for image captioning and examine the structure of the self-attention patterns extracted from the visual stream. Our results indicate that the information about different relations between objects in the visual stream is hierarchical and varies from local to a global object-level understanding of the image. In particular, while visual representations in the first layers encode the knowledge of relations between semantically similar object detections, often constituting neighbouring objects, deeper layers expand their attention across more distant objects and learn global relations between them. We also show that globally attended objects in deeper layers can be linked with entities described in image descriptions, indicating a critical finding - the indirect effect of language on visual representations. In addition, we highlight how object-based input representations affect the structure of learned visual knowledge and guide the model towards more accurate image descriptions. A parallel question that we investigate is whether the insights from cognitive science echo the structure of representations that the current neural architecture learns. The proposed analysis of the inner workings of multi-modal transformers can be used to better understand and improve on such problems as pre-training of large-scale multi-modal architectures, multi-modal information fusion and probing of attention weights. In general, we contribute to the explainable multi-modal natural language processing and currently shallow understanding of how the input representations and the structure of the multi-modal transformer affect visual representations.

show abstract

scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.

Contact Info

hi@scite.ai

10624 S. Eastern Ave., Ste. A-614

Henderson, NV 89052, USA

Blog Terms and Conditions API Terms Privacy Policy Contact Cookie Preferences Do Not Sell or Share My Personal Information

Made with 💙 for researchers

Part of the Research Solutions Family.

Nikolai Ilinykh

Tell Me More: A Dataset of Visual Scene Description Sequences

The Task Matters: Comparing Image Captioning and Task-Based Dialogical Image Description

Describe me an Aucklet: Generating Grounded Perceptual Category Descriptions

We went to look for meaning and all we got were these lousy representations: aspects of meaning representation for computational semantics

What Does a Language-And-Vision Transformer See: The Impact of Semantic Information on Visual Representations

Contact Info

Product

Resources

About