Towards situated speech understanding: visual context priming of language models

Roy, Deb; Mukherjee, Nilanjana

doi:10.1016/j.csl.2004.08.003

Cited by 53 publications

(51 citation statements)

References 14 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…In a follow-on simulation, it moreover developed the behavior that we observed in people for Experiment 2-a greater relative priority of the immediate events over stereotypical thematic role knowledge in thematic role assignment shortly after the verb is encountered. A further computational model of spoken language comprehension suitable for modeling our findings appears to be Fuse by Roy and Mukherjee (2005). It includes a dynamic model of visual attention that enables anticipating the most likely objects in a scene based on processing of the unfolding utterance.…”

Section: Discussionmentioning

confidence: 99%

The Coordinated Interplay of Scene, Utterance, and World Knowledge: Evidence From Eye Tracking

Knoeferle¹,

Crocker²

2006

Cognitive Science

153

183

View full text Add to dashboard Cite

Two studies investigated the interaction between utterance and scene processing by monitoring eye movements in agent-action-patient events, while participants listened to related utterances. The aim of Experiment 1 was to determine if and when depicted events are used for thematic role assignment and structural disambiguation of temporarily ambiguous English sentences. Shortly after the verb identified relevant depicted actions, eye movements in the event scenes revealed disambiguation. Experiment 2 investigated the relative importance of linguistic/world knowledge and scene information. When the verb identified either only the stereotypical agent of a (nondepicted) action, or the (nonstereotypical) agent of a depicted action as relevant, verb-based thematic knowledge and depicted action each rapidly influenced comprehension. In contrast, when the verb identified both of these agents as relevant, the gaze pattern suggested a preferred reliance of comprehension on depicted events over stereotypical thematic knowledge for thematic interpretation. We relate our findings to language comprehension and acquisition theories.

show abstract

Section: Discussionmentioning

confidence: 99%

The Coordinated Interplay of Scene, Utterance, and World Knowledge: Evidence From Eye Tracking

Knoeferle¹,

Crocker²

2006

Cognitive Science

153

183

View full text Add to dashboard Cite

show abstract

“…Future research on this phenomenon will benefit greatly from combining human experimentation with the further development of explicit computational models of this kind of interaction between language and vision. As implemented models of spoken word recognition are interfaced with implemented models of visual processing (see, e.g., Roy & Mukherjee, 2005;Spivey, Grosjean, & Knoblich, 2005), we can begin to formulate a richer understanding of exactly how language comprehension and visual perception manage to interact so fluidly. …”

Section: Discussionmentioning

confidence: 99%

Inefficient conjunction search made efficient by concurrent spoken delivery of target identity

Reali

Spivey

Tyler

et al. 2006

Perception & Psychophysics

View full text Add to dashboard Cite

“…It is, at the time of writing, not yet fully implemented in a robot, and as specified makes no attempt to deal with asynchronous changes to representations in different parts of the system. In other systems [13,2] binding can occur at a very early stage in processing, allowing even information from the speech signal to influence visual hypotheses for object references, and vice-versa.…”

Section: Background and Motivationmentioning

confidence: 99%

Crossmodal content binding in information-processing architectures

Jacobsson

Hawes

Kruijff

et al. 2008

Proceedings of the 3rd ACM/IEEE International Conference on Human Robot Interaction

View full text Add to dashboard Cite

Operating in a physical context, an intelligent robot faces two fundamental problems. First, it needs to combine information from its different sensors to form a representation of the environment that is more complete than any representation a single sensor could provide. Second, it needs to combine high-level representations (such as those for planning and dialogue) with sensory information, to ensure that the interpretations of these symbolic representations are grounded in the situated context. Previous approaches to this problem have used techniques such as (lowlevel) information fusion, ontological reasoning, and (highlevel) concept learning. This paper presents a framework in which these, and related approaches, can be used to form a shared representation of the current state of the robot in relation to its environment and other agents. Preliminary results from an implemented system are presented to illustrate how the framework supports behaviours commonly required of an intelligent robot.

show abstract

Towards situated speech understanding: visual context priming of language models

Cited by 53 publications

References 14 publications

The Coordinated Interplay of Scene, Utterance, and World Knowledge: Evidence From Eye Tracking

The Coordinated Interplay of Scene, Utterance, and World Knowledge: Evidence From Eye Tracking

Inefficient conjunction search made efficient by concurrent spoken delivery of target identity

Crossmodal content binding in information-processing architectures

Contact Info

Product

Resources

About