“…Many different deep architectures have been proposed to achieve better performance, such as FCNbased [60] feature aggregation models [9,42,62,93,99,110,111,117], Encoder-Decoder architectures [3,10,77,81], Coarse-to-Fine (or Predict-Refine) models [13,18,55,78,90,95,96], Vision Transformers [58,118], etc. Besides, many real-time models [27,44,51,70,71,107,114] are developed to balance the performance and time costs. To achieve highly accurate results in our DIS, the models are expected to capture fine details (and complicated structures) and large components of the diversified objects from largesize (e.g., 2K, 4K or even larger) images with affordable memory, computation and time costs.…”