Emotional facial expressions play an important role in social communication across primates. Despite major progress made in our understanding of categorical information processing such as for objects and faces, little is known, however, about how the primate brain evolved to process emotional cues. In this study, we used functional magnetic resonance imaging (fMRI) to compare the processing of emotional facial expressions between monkeys and humans. We used a 2 × 2 × 2 factorial design with species (human and monkey), expression (fear and chewing) and configuration (intact versus scrambled) as factors. At the whole brain level, selective neural responses to conspecific emotional expressions were anatomically confined to the superior temporal sulcus (STS) in humans. Within the human STS, we found functional subdivisions with a face-selective right posterior STS area that also responded selectively to emotional expressions of other species and a more anterior area in the right middle STS that responded specifically to human emotions. Hence, we argue that the latter region does not show a mere emotion-dependent modulation of activity but is primarily driven by human emotional facial expressions. Conversely, in monkeys, emotional responses appeared in earlier visual cortex and outside face-selective regions in inferior temporal cortex that responded also to multiple visual categories. Within monkey IT, we also found areas that were more responsive to conspecific than to non-conspecific emotional expressions but these responses were not as specific as in human middle STS. Overall, our results indicate that human STS may have developed unique properties to deal with social cues such as emotional expressions.
Abstract-Humans are able to merge information from multiple perceptional modalities and formulate a coherent representation of the world. Our thesis is that robots need to do the same in order to operate robustly and autonomously in an unstructured environment. It has also been shown in several fields that multiple sources of information can complement each other, overcoming the limitations of a single perceptual modality. Hence, in this paper we introduce a data set of actions that includes both visual data (RGB-D video and 6DOF object pose estimation) and acoustic data. We also propose a method for recognizing and segmenting actions from continuous audiovisual data. The proposed method is employed for extensive evaluation of the descriptive power of the two modalities, and we discuss how they can be used jointly to infer a coherent interpretation of the recorded action.
We present a real-time technique for the spatiotemporal segmentation of color/depth movies. Images are segmented using a parallel Metropolis algorithm implemented on a GPU utilizing both color and depth information, acquired with the Microsoft Kinect. Segments represent the equilibrium states of a Potts model, where tracking of segments is achieved by warping obtained segment labels to the next frame using real-time optical flow, which reduces the number of iterations required for the Metropolis method to encounter the new equilibrium state. By including depth information into the framework, true objects boundaries can be found more easily, improving also the temporal coherency of the method. The algorithm has been tested for videos of medium resolutions showing human manipulations of objects. The framework provides an inexpensive visual front end for visual preprocessing of videos in industrial settings and robot labs which can potentially be used in various applications.
This paper discusses options for testing correspondence algorithms in stereo or motion analysis that are designed or considered for vision-based driver assistance. It introduces a globally available database, with a main focus on testing on video sequences of real-world data. We suggest the classification of recorded video data into situations defined by a cooccurrence of some events in recorded traffic scenes. About 100-400 stereo frames (or 4-16 s of recording) are considered a basic sequence, which will be identified with one particular situation. Future testing is expected to be on data that report on hours of driving, and multiple hours of long video data may be segmented into basic sequences and classified into situations. This paper prepares for this expected development. This paper uses three different evaluation approaches (prediction error, synthesized sequences, and labeled sequences) for demonstrating ideas, difficulties, and possible ways in this future field of extensive performance tests in vision-based driver assistance, particularly for cases where the ground truth is not available. This paper shows that the complexity of real-world data does not support the identification of general rankings of correspondence techniques on sets of basic sequences that show different situations. It is suggested that correspondence techniques should adaptively be chosen in real time using some type of statistical situation classifiers.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.