A large-scale benchmark dataset for event recognition in surveillance video

Oh, Sung Jong; Hoogs, Anthony; Perera, A. G. Amitha; Cuntoor, Naresh P.; Chen, Chia-Chih; Lee, Jong Taek; Mukherjee, Saurajit; Aggarwal, J. K.; Lee, Hyungtae; Davis, Larry S.; Swears, Eran; Wang, Xioyang; Ji, Qiang; Reddy, K. Ashoka; Shah, Mubarak; Vondrick, Carl; Pirsiavash, Hamed; Ramanan, Deva; Yuen, Jenny; Torralba, Antonio; Bi, Shuping; Fong, Anesco; Roy-Chowdhury, Amit K.; Desai, Mihir A.

doi:10.1109/cvpr.2011.5995586

Cited by 583 publications

(396 citation statements)

References 18 publications

Supporting

Mentioning

387

Contrasting

Unclassified

Order By: Relevance

“…Popular datasets for this task include the Pascal dataset (4), the LabelMe dataset (24), and the Lotus Hill dataset (25), all populated by relatively unconstrained natural images, but varying considerably in size and in the level of annotation, ranging from a few keywords to hierarchical representations (Lotus Hill). Finally, a few other datasets have been assembled and annotated to evaluate the quality of detected object attributes such as color, orientation, and activity; examples are the Core dataset (26), with annotated object parts and attributes, and the Virat dataset (27) for event detection in videos.…”

Section: Current Evaluation Practicementioning

confidence: 99%

Visual Turing test for computer vision systems

Geman

Hallonquist

et al. 2015

Proc. Natl. Acad. Sci. U.S.A.

251

182

View full text Add to dashboard Cite

Today, computer vision systems are tested by their accuracy in detecting and localizing instances of objects. As an alternative, and motivated by the ability of humans to provide far richer descriptions and even tell a story about an image, we construct a "visual Turing test": an operator-assisted device that produces a stochastic sequence of binary questions from a given test image. The query engine proposes a question; the operator either provides the correct answer or rejects the question as ambiguous; the engine proposes the next question ("just-in-time truthing"). The test is then administered to the computer-vision system, one question at a time. After the system's answer is recorded, the system is provided the correct answer and the next question. Parsing is trivial and deterministic; the system being tested requires no natural language processing. The query engine employs statistical constraints, learned from a training set, to produce questions with essentially unpredictable answers-the answer to a question, given the history of questions and their correct answers, is nearly equally likely to be positive or negative. In this sense, the test is only about vision. The system is designed to produce streams of questions that follow natural story lines, from the instantiation of a unique object, through an exploration of its properties, and on to its relationships with other uniquely instantiated objects. G oing back at least to the mid-20th century there has been an active debate about the state of progress in artificial intelligence and how to measure it. Alan Turing (1) proposed that the ultimate test of whether a machine could "think," or think at least as well as a person, was for a human judge to be unable to tell which was which based on natural language conversations in an appropriately cloaked scenario. In a much-discussed variation (sometimes called the "standard interpretation"), the objective is to measure how well a computer can imitate a human (2) in some circumscribed task normally associated with intelligent behavior, although the practical utility of "imitation" as a criterion for performance has also been questioned (3). In fact, the overwhelming focus of the modern artificial intelligence (AI) community has been to assess machine performance more directly by dedicated tests for specific tasks rather than debating about general "thinking" or Turing-like competitions between people and machines.In this paper we implement a new, query-based test for computer vision, one of the most vibrant areas of modern AI research. Throughout this paper we use "computer vision" more or less synonymously with semantic image interpretation-"images to words." However, of course computer vision encompasses a great many other activities; it includes the theory and practice of image formation ("sensors to images"), image processing ("images to images"), mathematical representations, video processing, metric scene reconstruction, and so forth. In fact, it may not be possible to interpret scenes at a semantic leve...

show abstract

Section: Current Evaluation Practicementioning

confidence: 99%

Visual Turing test for computer vision systems

Geman

Hallonquist

et al. 2015

Proc. Natl. Acad. Sci. U.S.A.

251

182

View full text Add to dashboard Cite

show abstract

“…Four surveillance videos with a resolution of 1920 × 1080 and a length of one minute were used. They were taken from the VIRAT database, which was designed for performance assessment of activity detection algorithms [6]. Representative frames from the four scenarios are shown in Figure 3.…”

Section: Resultsmentioning

confidence: 99%

A Real-Time Smart Assistant for Video Surveillance Through Handheld Devices

Kuang

Guthier

Saini

et al. 2014

Proceedings of the 22nd ACM International Conference on Multimedia

View full text Add to dashboard Cite

In a remote surveillance system, a high resolution surveillance camera streams its video to a user's handheld device. Such devices are unable to make use of the high resolution video due to their limited display size and bandwidth. In this paper, we propose a method to assist the mobile operator of the surveillance camera in focusing on sensitive regions of the video. Our system automatically identifies relevant regions. We introduce a pan and zoom strategy to ensure that the operator is able to see fine details in these areas while maintaining contextual knowledge. Regions of interest are identified using foreground detection as well as face and body detection. The efficacy of the proposed method is demonstrated through a user study. Our proposed method was reported to be more useful than two comparable approaches for getting an understanding of the activities in a surveillance scene while maintaining context.

show abstract

“…The baseline method used for comparison is very similar to the one used by the authors of [9]. This method is the usual BoW pipeline [10] where the STIPs are detected with 3-D Harris corners and HOG/HOF is used as descriptor.…”

Section: B Protocolmentioning

confidence: 99%

Real-time visual play-break detection in sport events using a context descriptor

Carbonneau

Raymond

Granger

et al. 2015

2015 IEEE International Symposium on Circuits and Systems (ISCAS)

View full text Add to dashboard Cite

Abstract-This paper presents a two-stage hierarchical method for play-break detection on non-edited team sports video feed. Unlike most existing methods, this algorithm uses modern action and event recognition method thus does not rely on production cues of broadcast feeds, but instead concentrates on the content of the video. Moreover, the method does not require player tracking, can be used in real-time and can be easily adapted to different sports. In the first stage, bag-of-words event detectors are trained to recognize key events such as line changes, face-offs and preliminary play-breaks. In the second stage, the output of the detectors along with a novel feature based on the number of detected spatio-temporal interest points are used to create a context descriptor. The final classification is performed on this context descriptor. Experiments demonstrate the benefits of using this context descriptor by reducing the frame classification error by 18% when compared to the baseline method. The efficiency of the proposed method is demonstrated on a real hockey game (accuracy over 88%).

show abstract

A large-scale benchmark dataset for event recognition in surveillance video

Cited by 583 publications

References 18 publications

Visual Turing test for computer vision systems

Visual Turing test for computer vision systems

A Real-Time Smart Assistant for Video Surveillance Through Handheld Devices

Real-time visual play-break detection in sport events using a context descriptor

Contact Info

Product

Resources

About