The modeling of the human visual attention into a computational attention model leads to the split of visual features into several independent channels. Then, a difficult problem arises to combine these maps, having different dynamic ranges or distribution. When several maps are considered, such process is mandatory in order to compute a single measure of interest for each location, regardless of which features contributed to the salience. Several strategies of cue combination are proposed in this paper for the spatial cues as well as the temporal saliency. Finally, some user tests on still image and video databases leads to highlight one operator.Index Terms-Visual attention, computational model, map fusion, user experiments, eye-tracker.
A purely bottom-up model of visual attention is proposed and compared to five state-of-the-art models. The role of the low-level visual features is examined in two contexts. Two datasets are used: one containing data coming from an eye tracking experiment obtained in a free-viewing task and a second containing 5000 hand-label pictures (observers had to enclose the most visually interesting objects in a rectangle). The relevancy of the bottom-up models, i.e. the ability of a model to predict where the salient areas are located, is evaluated. Whatever the metrics and the datasets, the degree of similarity between predictions and ground truth is significantly above chance. The proposed model, resting on a small number of features, is shown to be a good predictor of the human visual fixations but also a good predictor of the objects chosen as interesting by observers. This study suggests that the low-level of visual features have a significant role in a free-viewing task but also in a high-level visual task, such as the choice of the object of interest in a complex visual scene. Another outcome concerns the viewing duration used in eye tracking experiments. Results suggest that this parameter is finally not as critical as one would expect.
Abstract. Most cell phones today can receive and display video content. Nonetheless, we are still signicantly behind the point where premium made for mobile content is mainstream, largely available, and aordable. Signicant issues must be overcome. The small screen size is one of them. Indeed, the direct transfer of conventional contents (i.e. not specically shot for mobile devices) will provide a video in which the main characters or objects of interest may become indistinguishable from the rest of the scene. Therefore, it is required to retarget the content. Dierent solutions exist, either based on distortion of the image, on removal of redundant areas, or cropping. The most ecient ones are based on dynamic adaptation of the cropping window. They signicantly improve the viewing experience by zooming in the regions of interest. Currently, there is no common agreement on how to compare dierent solutions. A retargeting metric is proposed in order to gauge its quality. Eye-tracking experiments, zooming eect through coverage ratio and temporal consistency are introduced and discussed.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.