Primary object segmentation plays an important role in understanding videos generated by unmanned aerial vehicles. In this paper, we propose a large-scale dataset with 500 aerial videos and manually annotated primary objects. To the best of our knowledge, it is the largest dataset to date for primary object segmentation in aerial videos. From this dataset, we find most aerial videos contain large-scale scenes, small primary objects as well as consistently varying scales and viewpoints. Inspired by that, we propose a hierarchical deep co-segmentation approach that repeatedly divides a video into two sub-videos formed by the odd and even frames, respectively. In this manner, the primary objects shared by sub-videos can be cosegmented by training two-stream CNNs and finally refined within the neighborhood reversible flows. Experimental results show that our approach remarkably outperforms 17 state-of-the-art methods in segmenting primary objects in various types of aerial videos.Recently, unmanned aerial vehicles (drones) have become very popular since it provides a new way to observe and explore the world. As a result, aerial videos generated by drones have been growing explosively. For these videos, one of the key tasks is to segment the primary objects, which can be used to facilitate subsequent tasks such as event understanding, scene reconstruction, drone navigation and visual tracking.Hundreds of models have been proposed in the past decade to segment primary objects 15 , which can be roughly divided into two categories. The first category contains image-based models that focus on detecting salient (primary) objects in images. In this category, classic models 1-4 focus on designing rules to pop-out salient targets and suppress distractors, while recent models 5-8 usually adopt the deep learning framework due to the availability of large-scale image datasets (e.g., the XPIE dataset 4 ). The second category contains video-based models 16 that aim to segment a sequence of primary/foreground objects that consistently pop-out in the whole video. Similar to the image-based category, classic video-based models also design rules to segment primary objects by jointly considering the per-frame accuracy and inter-frame consistency 9 . Recently, with the presence of large-scale video datasets 17 , several deep learning models 10, 11 have been proposed as well. In addition, some video object co-segmentation approaches 12,13 have been proposed as well to simultaneously segment a common category of objects from two or more videos.