Rethinking BiSeNet For Real-time Semantic Segmentation

Fan, Mingyuan; Lai, Shupeng; Huang, Junshi; Wei, Xiaoming; Chai, Zhenhua; Luo, Junfeng; Wei, Xiaolin

doi:10.1109/cvpr46437.2021.00959

Cited by 350 publications

(182 citation statements)

References 16 publications

Supporting

Mentioning

173

Contrasting

Order By: Relevance

“…To obtain more representative features, FCN-based models [60], Encoder-Decoder [3,81], Coarse-to-Fine [96], Predict-Refine [78,90], Vision Transformer [118] and so on are developed. Besides, many real-time models are designed [27,44,51,70,71,107,114] to balance the performance and the time costs. Other methods, such as weights regularization [37], dropout [86], dense supervision [49,77,102], and hybrid loss [61,78,116], focus on alleviating the over-fitting.…”

Section: Related Workmentioning

confidence: 99%

“…Competitors. To provide comprehensive evaluations, we compared our IS-Net with 16 popular networks designed for different segmentation tasks, including (i) popular medical image segmentation model, U-Net [81]; (ii) salient object detection models such as BASNet [78], GateNet [117], F 3 Net [99], GCPA [10] and U 2 -Net [77]; (iii) models designed for COD like SINet-V2 [24] and PFNet [66]; (iv) semantic segmentation models: PSPNet [115], DeepLab-V3+ [7] and HRNet [93]; (v) real-time semantic segmentation models: BiSeNetV1 [107], ICNet [114], MobileNet-V3-Large [43], STDC [28] and HyperSegM [70]. All models are re-trained using DIS-TR set (on Tesla V100 or RTX A6000) and the time costs in Tab.2 are all tested on RTX A6000.…”

Section: Dis5k Benchmarkmentioning

confidence: 99%

“…Many different deep architectures have been proposed to achieve better performance, such as FCNbased [60] feature aggregation models [9,42,62,93,99,110,111,117], Encoder-Decoder architectures [3,10,77,81], Coarse-to-Fine (or Predict-Refine) models [13,18,55,78,90,95,96], Vision Transformers [58,118], etc. Besides, many real-time models [27,44,51,70,71,107,114] are developed to balance the performance and time costs. To achieve highly accurate results in our DIS, the models are expected to capture fine details (and complicated structures) and large components of the diversified objects from largesize (e.g., 2K, 4K or even larger) images with affordable memory, computation and time costs.…”

Section: Existing Modelsmentioning

confidence: 99%

See 2 more Smart Citations

Highly Accurate Dichotomous Image Segmentation

Qin¹,

Dai²,

Hu³

et al. 2022

Preprint

View full text Add to dashboard Cite

show abstract

Section: Related Workmentioning

confidence: 99%

Section: Dis5k Benchmarkmentioning

confidence: 99%

Section: Existing Modelsmentioning

confidence: 99%

See 1 more Smart Citation

Highly Accurate Dichotomous Image Segmentation

Qin¹,

Dai²,

Hu³

et al. 2022

Preprint

View full text Add to dashboard Cite

show abstract

“…To test the effect of different backbones on the experimental results, we replace the backbone module with the previous grasp detection networks GGCNN2 [43], GRCNN [44] and semantic segmentation networks UNet [45], SegNet [46], DANet [47], DeepLabv3+ [48] and STDC [49]. The results are reported in Sec.V-B.…”

Section: A Network Architecturementioning

confidence: 99%

On-Policy Pixel-Level Grasping Across the Gap Between Simulation and Reality

Wang¹,

Chang²,

Liu³

et al. 2022

Preprint

View full text Add to dashboard Cite

Grasp detection in cluttered scenes is a very challenging task for robots. Generating synthetic grasping data is a popular way to train and test grasp methods, as is Dex-net and GraspNet; yet, these methods generate training grasps on 3D synthetic object models, but evaluate at images or point clouds with different distributions, which reduces performance on real scenes due to sparse grasp labels and covariate shift. To solve existing problems, we propose a novel on-policy grasp detection method, which can train and test on the same distribution with dense pixel-level grasp labels generated on RGB-D images. A Parallel-Depth Grasp Generation (PDG-Generation) method is proposed to generate a parallel depth image through a new imaging model of projecting points in parallel; then this method generates multiple candidate grasps for each pixel and obtains robust grasps through flatness detection, force-closure metric and collision detection. Then, a large comprehensive Pixel-Level Grasp Pose Dataset (PLGP-Dataset) is constructed and released; distinguished with previous datasets with off-policy data and sparse grasp samples, this dataset is the first pixellevel grasp dataset, with the on-policy distribution where grasps are generated based on depth images. Lastly, we build and test a series of pixel-level grasp detection networks with a data augmentation process for imbalance training, which learn grasp poses in a decoupled manner on the input RGB-D images. Extensive experiments show that our on-policy grasp method can largely overcome the gap between simulation and reality, and achieves the state-of-the-art performance. Code and data are provided at https://github.com/liuchunsense/PLGP-Dataset.

show abstract

“…For instance, the PASCAL VOC segmentation dataset only contains about 2k images, while the BDD100K [114] focuses on road scenes. Numerous approaches have achieved impressive results on these restricted environments [13,14,31,100,131,125,66,61,65,116]. Significantly scale up the problem often results in research modality change, e.g., from PASCAL VOC [28] to ImageNet [84].…”

Section: Introductionmentioning

confidence: 99%

Large-scale Unsupervised Semantic Segmentation

Gao¹,

Li²,

Yang³

et al. 2021

Preprint

View full text Add to dashboard Cite

Powered by the ImageNet dataset, unsupervised learning on large-scale data has made significant advances for classification tasks. There are two major challenges to allow such an attractive learning modality for segmentation tasks: i) a large-scale benchmark for assessing algorithms is missing; ii) unsupervised shape representation learning is difficult. We propose a new problem of largescale unsupervised semantic segmentation (LUSS) with a newly created benchmark dataset to track the research progress. Based on the ImageNet dataset, we propose the ImageNet-S dataset with 1.2 million training images and 40k high-quality semantic segmentation annotations for evaluation. Our benchmark has a high data diversity and a clear task objective. We also present a simple yet effective baseline method that works surprisingly well for LUSS. In addition, we benchmark related un/weakly supervised methods accordingly, identifying the challenges and possible directions of LUSS. The benchmark is available on https://github.com/UnsupervisedSemanticSegmentation.

show abstract

Rethinking BiSeNet For Real-time Semantic Segmentation

Cited by 350 publications

References 16 publications

Highly Accurate Dichotomous Image Segmentation

Highly Accurate Dichotomous Image Segmentation

On-Policy Pixel-Level Grasping Across the Gap Between Simulation and Reality

Large-scale Unsupervised Semantic Segmentation

Contact Info

Product

Resources

About