Semantic Object Parsing with Graph LSTM

Liang, Xiaodan; Shen, Xiaohui; Feng, Jiashi; Li, Lin; Yan, Shuicheng

doi:10.1007/978-3-319-46448-0_8

Cited by 299 publications

(198 citation statements)

References 49 publications

Supporting

Mentioning

197

Contrasting

Order By: Relevance

“…Recurrent Neural Networks (RNN) have been widely used to address the sequential prediction problems in literatures, such as speech recognition [9], human action/activity recognition [34], [41], scene labeling [39], [45], image caption [19], and object segmentation [31]. RNN and its variants LSTM [34], GRNN [4], etc.…”

Section: Related Workmentioning

confidence: 99%

Early Action Prediction by Soft Regression

Zheng

et al. 2019

IEEE Trans. Pattern Anal. Mach. Intell.

View full text Add to dashboard Cite

Abstract-We propose a novel approach for predicting on-going action with the assistance of a low-cost depth camera. Our approach introduces a soft regression-based early prediction framework. In this framework, we estimate soft labels for the subsequences at different progress levels, jointly learned with an action predictor. Our formulation of soft regression framework 1) overcomes a usual assumption in existing early action prediction systems that the progress level of on-going sequence is given in the testing stage; and 2) presents a theoretical framework to better resolve the ambiguity and uncertainty of subsequences at early performing stage. The proposed soft regression framework is further enhanced in order to take the relationships among subsequences and the discrepancy of soft labels over different classes into consideration, so that a Multiple Soft labels Recurrent Neural Network (MSRNN) is finally developed. For real-time performance, we also introduce a new RGB-D feature called "local accumulative frame feature (LAFF)", which can be computed efficiently by constructing an integral feature map. Our experiments on three RGB-D benchmark datasets and an unconstrained RGB action set demonstrate that the proposed regression-based early action prediction model outperforms existing models significantly and also show that the early action prediction on RGB-D sequence is more accurate than that on RGB channel.

show abstract

Section: Related Workmentioning

confidence: 99%

Early Action Prediction by Soft Regression

Zheng

et al. 2019

IEEE Trans. Pattern Anal. Mach. Intell.

View full text Add to dashboard Cite

show abstract

“…Hierarchical/Graphical Models in Computer Vision: Hierarchical/graphical models are powerful for building structured representations, which can reflect task-specific relations and constraints. From early distributional semantic models, part-based models [16,17], MRF/CRF [31], And-Or grammar model [59], to deep structural networks [30,15], graph neural networks [20], trainable CRF [79], etc., hierarchical/graphical models have found applications in a wide variety of core computer vision tasks, such as object recognition [55], human parsing [40,41,81], pose estimation [34,66,61,68,35], visual dialog etc., to the extent that they are now ubiquitous in the field. Inspired by their general success, we leverage structural information to design our approach.…”

Section: Related Workmentioning

confidence: 99%

“…With the renaissance of connectionism in the computer vision community, recent research efforts take deep neural networks as their main building blocks [70,54,78,50,49,77,45]. More specifically, some efforts address the task as an active template regression problem [39], propagate semantic information from a retrieved, annotated image corpus [44], merge multi-level image context in a unified convolutional neural network [42], or use Graph LSTMs to model human configurations [40,41]. Some others leverage extra pose information to assist the task [72,22,71,14,54].…”

Section: Related Workmentioning

confidence: 99%

Learning Compositional Neural Information Fusion for Human Parsing

Wang

Zhang

et al. 2019

2019 IEEE/CVF International Conference on Computer Vision (ICCV)

116

View full text Add to dashboard Cite

This work proposes to combine neural networks with the compositional hierarchy of human bodies for efficient and complete human parsing. We formulate the approach as a neural information fusion framework. Our model assembles the information from three inference processes over the hierarchy: direct inference (directly predicting each part of a human body using image information), bottom-up inference (assembling knowledge from constituent parts), and top-down inference (leveraging context from parent nodes). The bottom-up and top-down inferences explicitly model the compositional and decompositional relations in human bodies, respectively. In addition, the fusion of multi-source information is conditioned on the inputs, i.e., by estimating and considering the confidence of the sources. The whole model is end-to-end differentiable, explicitly modeling information flows and structures. Our approach is extensively evaluated on four popular datasets, outperforming the state-of-the-arts in all cases, with a fast processing speed of 23fps. Our code and results have been released to help ease future research in this direction. * Equal contribution. † Corresponding author: Yanwei Pang.

show abstract

“…Human-image parsing is a specific semantic object segmentation task for which various CNN-based methods have been proposed [15][16][17][18][19][20]. In particular, some CNN-based methods use training datasets with different domains.…”

Section: Related Workmentioning

confidence: 99%

Transferring pose and augmenting background for deep human-image parsing and its applications

et al. 2018

View full text Add to dashboard Cite

Parsing of human images is a fundamental task for determining semantic parts such as the face, arms, and legs, as well as a hat or a dress. Recent deep-learning-based methods have achieved significant improvements, but collecting training datasets with pixel-wise annotations is labor-intensive. In this paper, we propose two solutions to cope with limited datasets. Firstly, to handle various poses, we incorporate a pose estimation network into an end-to-end humanimage parsing network, in order to transfer common features across the domains. The pose estimation network can be trained using rich datasets and can feed valuable features to the human-image parsing network. Secondly, to handle complicated backgrounds, we increase the variation in image backgrounds automatically by replacing the original backgrounds of human images with others obtained from large-scale scenery image datasets. Individually, each solution is versatile and beneficial to human-image parsing, while their combination yields further improvement. We demonstrate the effectiveness of our approach through comparisons and various applications such as garment recoloring, garment texture transfer, and visualization for fashion analysis.

show abstract

Semantic Object Parsing with Graph LSTM

Cited by 299 publications

References 49 publications

Early Action Prediction by Soft Regression

Early Action Prediction by Soft Regression

Learning Compositional Neural Information Fusion for Human Parsing

Transferring pose and augmenting background for deep human-image parsing and its applications

Contact Info

Product

Resources

About