A person seeking another person's attention is normally able to quickly assess how interruptible the other person currently is. Such assessments allow behavior that we consider natural, socially appropriate, or simply polite. This is in sharp contrast to current computer and communication systems, which are largely unaware of the social situations surrounding their usage and the impact that their actions have on these situations. If systems could model human interruptibility, they could use this information to negotiate interruptions at appropriate times, thus improving human computer interaction.This article presents a series of studies that quantitatively demonstrate that simple sensors can support the construction of models that estimate human interruptibility as well as people do. These models can be constructed without using complex sensors, such as vision-based techniques, and therefore their use in everyday office environments is both practical and affordable. Although currently based on a demographically limited sample, our results indicate a substantial opportunity for future research to validate these results over larger groups of office workers. Our results also motivate the development of systems that use these models to negotiate interruptions at socially appropriate times.
This article considers tools to support remote gesture in video systems being used to complete collaborative physical tasks-tasks in which two or more individuals work together manipulating three-dimensional objects in the real world. We first discuss the process of conversational grounding during collaborative physical tasks, particularly the role of two types of gestures in the grounding process: pointing gestures, which are used to refer to task objects and locations, and rep- HUMAN-COMPUTER INTERACTION, 2004, Volume 19, pp. 273-309 Copyright © 2004 resentational gestures, which are used to represent the form of task objects and the nature of actions to be used with those objects. We then consider ways in which both pointing and representational gestures can be instantiated in systems for remote collaboration on physical tasks. We present the results of two studies that use a "surrogate" approach to remote gesture, in which images are intended to express the meaning of gestures through visible embodiments, rather than direct views of the hands. In Study 1, we compare performance with a cursor-based 274 FUSSELL ET AL.
A person seeking someone else's attention is normally able to quickly assess how interruptible they are. This assessment allows for behavior we perceive as natural, socially appropriate, or simply polite. On the other hand, today's computer systems are almost entirely oblivious to the human world they operate in, and typically have no way to take into account the interruptibility of the user. This paper presents a Wizard of Oz study exploring whether, and how, robust sensor-based predictions of interruptibility might be constructed, which sensors might be most useful to such predictions, and how simple such sensors might be.The study simulates a range of possible sensors through human coding of audio and video recordings. Experience sampling is used to simultaneously collect randomly distributed self-reports of interruptibility. Based on these simulated sensors, we construct statistical models predicting human interruptibility and compare their predictions with the collected self-report data. The results of these models, although covering a demographically limited sample, are very promising, with the overall accuracy of several models reaching about 78%. Additionally, a model tuned to avoiding unwanted interruptions does so for 90% of its predictions, while retaining 75% overall accuracy.
A user's focus of attention plays an important role in human-computer interaction applications, such as a ubiquitous computing environment and intelligent space, where the user's goal and intent have to be continuously monitored. We are interested in modeling people's focus of attention in a meeting situation. We propose to model participants' focus of attention from multiple cues. We have developed a system to estimate participants' focus of attention from gaze directions and sound sources. We employ an omnidirectional camera to simultaneously track participants' faces around a meeting table and use neural networks to estimate their head poses. In addition, we use microphones to detect who is speaking. The system predicts participants' focus of attention from acoustic and visual information separately. The system then combines the output of the audio- and video-based focus of attention predictors. We have evaluated the system using the data from three recorded meetings. The acoustic information has provided 8% relative error reduction on average compared to only using one modality. The focus of attention model can be used as an index for a multimedia meeting record. It can also be used for analyzing a meeting.
We introduce the first visual dataset of fast foods with a total of 4,545 still images, 606 stereo pairs, 303 360 0 videos for structure from motion, and 27 privacy-preserving videos of eating events of volunteers. This work was motivated by research on fast food recognition for dietary assessment. The data was collected by obtaining three instances of 101 foods from 11 popular fast food chains, and capturing images and videos in both restaurant conditions and a controlled lab setting. We benchmark the dataset using two standard approaches, color histogram and bag of SIFT features in conjunction with a discriminative classifier. Our dataset and the benchmarks are designed to stimulate research in this area and will be released freely to the research community.
Abstract-In this paper, we present an approach to automatic detection and recognition of signs from natural scenes, and its application to a sign translation task. The proposed approach embeds multiresolution and multiscale edge detection, adaptive searching, color analysis, and affine rectification in a hierarchical framework for sign detection, with different emphases at each phase to handle the text in different sizes, orientations, color distributions and backgrounds. We use affine rectification to recover deformation of the text regions caused by an inappropriate camera view angle. The procedure can significantly improve text detection rate and optical character recognition (OCR) accuracy. Instead of using binary information for OCR, we extract features from an intensity image directly. We propose a local intensity normalization method to effectively handle lighting variations, followed by a Gabor transform to obtain local features, and finally a linear discriminant analysis (LDA) method for feature selection. We have applied the approach in developing a Chinese sign translation system, which can automatically detect and recognize Chinese signs as input from a camera, and translate the recognized text into English.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
hi@scite.ai
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.