Language for learning complex human-object interactions

Patel, Mitesh; Ek, Carl Henrik; Kyriazis, Nicholas; Argyros, Antonis A.; Miró, Jaime Valls; Kragić, Danica

doi:10.1109/icra.2013.6631291

Cited by 9 publications

(5 citation statements)

References 21 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Language in Vision: The community has recently been incorporating natural language into computer vision, such as generating sentences from images [20,15,36], producing visual models from sentences [44,38], and aiding in contextual models [26,22]. In our work, we seek to mine language models trained on a massive text corpus to extract some knowledge that can assist computer vision systems.…”

Section: Related Workmentioning

confidence: 99%

Predicting Motivations of Actions by Leveraging Text

Vondrick

Oktay

Pirsiavash

et al. 2016

2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR)

View full text Add to dashboard Cite

Understanding human actions is a key problem in computer vision. However, recognizing actions is only the first step of understanding what a person is doing. In this paper, we introduce the problem of predicting why a person has performed an action in images. This problem has many applications in human activity understanding, such as anticipating or explaining an action. To study this problem, we introduce a new dataset of people performing actions annotated with likely motivations. However, the information in an image alone may not be sufficient to automatically solve this task. Since humans can rely on their lifetime of experiences to infer motivation, we propose to give computer vision systems access to some of these experiences by using recently developed natural language models to mine knowledge stored in massive amounts of text. While we are still far away from fully understanding motivation, our results suggest that transferring knowledge from language into vision can help machines understand why people in images might be performing an action.

show abstract

Section: Related Workmentioning

confidence: 99%

Predicting Motivations of Actions by Leveraging Text

Vondrick

Oktay

Pirsiavash

et al. 2016

2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR)

View full text Add to dashboard Cite

show abstract

“…-were located, not the actual interactions between user and objects. Grasping and object manipulation activities have also been studied with RGB-D cameras within the context of HHMM frameworks [22] in small 3D working envelopes. A framework based on an H-DBN was able to infer user's mode of transportation and desired destination [23] in an urban setting, and guidance cues where proposed when it was felt the user was deviating from his normal activities.…”

Section: Related Workmentioning

confidence: 99%

A probabilistic approach to learn activities of daily living of a mobility aid device user

Patel

Miró

Dissanayake

2014

2014 IEEE International Conference on Robotics and Automation (ICRA)

Self Cite

View full text Add to dashboard Cite

Abstract-The problem of inferring human behaviour is naturally complex: people interact with the environment and each other in many different ways, and dealing with the often incomplete and uncertain sensed data by which the actions are perceived only compounds the difficulty of the problem. In this paper, we propose a framework whereby these elaborate behaviours can be naturally simplified by decomposing them into smaller activities, whose temporal dependencies can be more efficiently represented via probabilistic hierarchical learning models. In this regard, patterns of a number of activities typically carried out by users of an ambulatory aid device have been identified with the aid of a Hierarchical Hidden Markov Model (HHMM) framework. By decomposing the complex behaviours into multiple layers of abstraction the approach is shown capable of modelling and learning these tightly coupled humanmachine interactions. The inference accuracy of the proposed model is proven to compare favourably against more traditional discriminative models, as well as other compatible generative strategies to provide a complete picture that highlights the benefits of the proposed approach, and opens the door to more intelligent assistance with a robotic mobility aid.

show abstract

“…Each image depicts the output of hand-object tracking algorithm. For more details please refer to [77], i.e. the origin of this figure.…”

Section: 9mentioning

confidence: 99%

“…Here, we present two approaches that exploit our 3D hand tracking methods to perform higher level inference, which regards understanding of hand motion, in the context of object manipulation. These approaches have been proposed by Song et al [92] and Patel et al [77].…”

Section: Higher Level Inferencementioning

confidence: 99%

See 1 more Smart Citation

A computational framework for observing and understanding the interaction of humans with objects of their environment

Kyriazis¹,

Κυριαζής²

View full text Add to dashboard Cite

Εστιάζουμε στο πρόβλημα της κατανόησης μιας δυναμικής σκηνής με βάση οπτική πληροφορία, δηλαδή στον μετασχηματισμό μιας τέτοιας σκηνής σε μια συμβολική αναπαράσταση, την οποία να μπορεί να επεξεργαστεί ένα υπολογιστικό σύστημα. Ενδιαφερόμαστε για σκηνές εσωτερικού χώρου, στις οποίες ένας άνθρωπος αλληλεπιδρά σκόπιμα με το περιβάλλον.Παρατηρούμε ότι οι έως τώρα σχετικές προσεγγίσεις πραγματοποιούσαν κατανόηση σκηνής μέσω κυρίως χονδρικής μοντελοποίησης της παρατηρούμενης διαδικασίας, καθώς λεπτομερέστερες μοντελοποιήσεις είναι πολύ απαιτητικές σε υπολογιστικούς πόρους και δυσκολεύουν την απαιτούμενη ενοποίηση με υπολογιστικές μεθόδους όρασης.Υποστηρίζουμε πως αυτήν τη στιγμή είναι όντως δυνατό να εκμεταλλευτούμε λεπτομερείς μοντελοποιήσεις, που να ενοποιούνται εύκολα με υπολογιστικές τεχνικές όρασης και να ανταπεξέρχονται στις σχετικές υπολογιστικές απαιτήσεις. Σε ότι αφορά την κατανόηση σκηνής, είμαστε σε θέση να μοντελοποιήσουμε και να προσομοιώσουμε τόσο την διαδικασία ανάκτησης εικόνων μέσω 3D rendering (παρουσιαστικό), όσο και την δυναμική των παρατηρούμενων διεργασιών μέσω προσομοίωσης φυσικής (συμπεριφορά). Έτσι, προσδιορίζουμε το 3D rendering και την προσομοίωση φυσικής σαν δύο σημαντικές διεργασίες για την κατανόηση σκηνής και προτείνουμε τον συνδυασμό της προσομοιωτικής δυνατότητας σχετικών υπολογιστικών μεθόδων με ισχυρές μεθόδους βελτιστοποίησης προς την ανάδειξη αποτελεσματικών εργαλείων συμπερασμού.Ειδικότερα, θεωρούμε την διαδικασία κατανόησης μιας δυναμικής σκηνής ως ένα πρόβλημα βελτιστοποίησης. Σχεδιάζουμε παραμετρικά μοντέλα που περιγράφουν το τι είναι δυνατόν να διαδραματιστεί σε μια σκηνή και πως αυτό μπορεί να παρατηρηθεί από τα διαθέσιμα οπτικά μέσα. Ορίζουμε σαν πεδίο ορισμού του προβλήματος βελτιστοποίησης τις παραμέτρους καθαυτές. Η βελτιστοποίηση πραγματοποιείται σε ξεχωριστή διαδικασία από αυτή της μοντελοποίησης, με υπόθεση-και-δοκιμή, μέσω μεθόδων βελτιστοποίησης black-box. Το αποτέλεσμα της βελτιστοποίησης είναι εκείνη η παραμετροποίηση των μοντέλων που «εξηγούν» με βέλτιστο τρόπο τις παρατηρήσεις. Οι υποθέσεις που δοκιμάζονται είναι σε συμφωνία με κανόνες φυσικής αφού πηγάζουν από προσομοιωτές φυσικής. Για κάθε υπόθεση αποτιμάται η συμβατότητά της με πραγματικές παρατηρήσεις μέσω 3D rendering. Έτσι, η πρότασή μας εστιάζει σε τρία σημεία: (α) μοντελοποίηση σκηνής, (β) ενσωμάτωση προσομοίωσης φυσικής και (γ) εκμετάλλευση των μεθόδων βελτιστοποίησης black-box.Έχουμε αναπτύξει ένα υπολογιστικό πλαίσιο που βασίζεται στα παραπάνω για να επιτυγχάνει επίλυση επιμέρους προβλημάτων κατανόησης μιας τρισδιάστατης σκηνής. Παρουσιάζουμε αυτό το πλαίσιο και τις εφαρμογές του σε προβλήματα τρισδιάστατης παρακολούθησης και εκτίμησης κίνησης σε σκηνές εσωτερικού χώρου. Δίνουμε έμφαση στην αναγκαιότητα για ένταξη φυσικής. Πιο ειδικά, δείχνουμε ότι με το να αναγνωρίζουμε ότι οι οπτικές παρατηρήσεις αφορούν φυσικά φαινόμενα που εξηγούνται από κανόνες φυσικής, μπορούμε να εφαρμόσουμε συμπερασμό ακόμα και σε αρχικά «κρυφέ .» παραμέτρους. Επομένως, μπορούμε να εφαρμόσουμε λογισμό σε παραμέτρους που πριν την ένταξη φυσικής δεν ήταν άμεσα παρατηρήσιμες και τις οποίες μπορούμε να ανακτήσουμε μόνο με τη θεώρηση φυσικών φαινομένων και των συνεπειών τους. Το προτεινόμενο υπολογιστικό πλαίσιο έχει χρησιμοποιηθεί για τη λύση προβλημάτων που ποικίλουν από την παρακολούθηση ενός αντικειμένου έως την παρακολούθηση δύο χεριών καθώς αυτά αλληλεπιδρούν με πολλά αντικείμενα, στις τρεις διαστάσεις και με βάση παρατηρήσεις που προέρχονται από διάφορα οπτικά μέσα. Μέσα από μια σειρά πειραμάτων δείχνουμε τη σημασία της ενσωμάτωσης γραφικών υπολογιστών και προσομοίωσης φυσικής στην τρισδιάστατη κατανόηση σκηνής. Οι ανωτέρω διαδικασίες χρησιμοποιήθηκαν επιτυχώς σαν προσομοιωτές black-box, χωρίς η εγγενής πολυπλοκότητά τους να εμποδίσει την ενοποίηση με υπολογιστικές μεθόδους όρασης, χάρη στη σχεδιαστική επιλογή της εμπλοκής μεθόδων βελτιστοποίησης black-box. Δείχνουμε επίσης ότι το προτεινόμενο πλαίσιο επιδεικνύει καλά χαρακτηριστικά ως προς την κλιμακώσιμη αντιμετώπιση προβλημάτων μεγάλης πολυπλοκότητας. Μέσω προσεκτικής σχεδίασης, η επίκληση έως τώρα υπολογιστικά ακριβών προσομοιώσεων μπορεί να επιτευχθεί τόσο αποδοτικά ώστε να επιτυγχάνεται επεξεργασία σε γρήγορους ρυθμούς. Τα παραπάνω συνηγορούν υπέρ μιας αρθρωτής υπολογιστικής λύσης σε προβλήματα τρισδιάστατης παρακολούθησης σκηνής, με ξεκάθαρη δυνατότητα για βελτίωση ή γενίκευση: αντικαθιστώντας μέρη με καλύτερες ή γενικότερες υλοποιήσεις βελτιώνεται αυτόματα το σύνολο.

show abstract

Language for learning complex human-object interactions

Cited by 9 publications

References 21 publications

Predicting Motivations of Actions by Leveraging Text

Predicting Motivations of Actions by Leveraging Text

A probabilistic approach to learn activities of daily living of a mobility aid device user

A computational framework for observing and understanding the interaction of humans with objects of their environment

Contact Info

Product

Resources

About