“…Currently, state-of-the-art methods handle image semantic segmentation as a dense prediction task and adopt fully convolutional networks to make predictions [26], [27]. To make pixel-level dense predictions, encoder-decoder structures [28], [29], [30], [31], [17], [32], [33], [34], [35], [36], [37], [38], [39] are widely used to reconstruct high-resolution prediction maps. Typically an encoder gradually downsamples the feature maps, aiming to acquire large field-of-view and capture the semantic object information.…”