2019 IEEE/CVF International Conference on Computer Vision (ICCV) 2019
DOI: 10.1109/iccv.2019.00691
|View full text |Cite
|
Sign up to set email alerts
|

Frame-to-Frame Aggregation of Active Regions in Web Videos for Weakly Supervised Semantic Segmentation

Abstract: When a deep neural network is trained on data with only image-level labeling, the regions activated in each image tend to identify only a small region of the target object. We propose a method of using videos automatically harvested from the web to identify a larger region of the target object by using temporal information, which is not present in the static image. The temporal variations in a video allow different regions of the target object to be activated. We obtain an activated region in each frame of a v… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
3
1
1

Citation Types

0
62
0

Year Published

2020
2020
2023
2023

Publication Types

Select...
5
1

Relationship

0
6

Authors

Journals

citations
Cited by 47 publications
(62 citation statements)
references
References 47 publications
(125 reference statements)
0
62
0
Order By: Relevance
“…Previous works on weakly-supervised semantic segmentation have used image-level annotations [17,20,26,27,42,50,52], points/clicks [4], scribbles [29,47,48,49], bounding box annotations [11,19,37,41,46,55] and adversarial training [3,21]. We take a closer look at some of these methods and categorize them based on the labels required and their methodology.…”
Section: Related Workmentioning
confidence: 99%
See 2 more Smart Citations
“…Previous works on weakly-supervised semantic segmentation have used image-level annotations [17,20,26,27,42,50,52], points/clicks [4], scribbles [29,47,48,49], bounding box annotations [11,19,37,41,46,55] and adversarial training [3,21]. We take a closer look at some of these methods and categorize them based on the labels required and their methodology.…”
Section: Related Workmentioning
confidence: 99%
“…Our segmentation network architecture is similar to UPerNet [51] where the encoder backbone is ResNet-101 [16], and decoders consist of 2 convolutional layers. We employ the ResNet-101 backbone to ensure fair comparison with the three most recent SOTA works, SDI [23], Li et al [28], and BCM [46] as well as 4 other recent methods [27,47,48,49] in Table 1. We have three decoders, one each for the y, α, and β branches.…”
Section: Implementation Detailsmentioning
confidence: 99%
See 1 more Smart Citation
“…Various weak annotations have been adopted in this research field, such as bounding boxes [17][18][19], scribbles [20], points [21], and image-level labels [5,11,13]. Moreover, some research studies, e.g., [22,23], improve the performance with additional and unlabeled data. Usually, the data are obtained from the Internet called web data.…”
Section: Related Workmentioning
confidence: 99%
“…Usually, the data are obtained from the Internet called web data. So these are also called weblysupervised segmentation methods [22]. In this paper, we utilize the image-level labels which are very cheap to obtain and do not provide any localization information about the object in the image.…”
Section: Related Workmentioning
confidence: 99%