“…In recent years, many efforts [19], e.g., developing novel network architectures [20,21,22,23,24,25] and pipelines [26,27,28,29], publishing large-scale datasets [30,31], introducing multi-modal and multi-temporal data [32,33,34,35], have been deployed to address this task, and most of them treat it as a single-label classification problem. A common assumption shared by these researches is that an aerial image belongs to only one scene category, while in real-world scenarios, it is more often that there exist various scenes in a single image (cf.…”