Multi-class detection in remote sensing images (RSIs) has garnered wide attention and introduced several service applications in many fields, including civil and military fields. However, several reasons make detection from aerial images very challenging and more difficult than nature scene images: Objects do not have a fixed size, often appear at very various scales and sometimes appear in dense groups, like vehicles and storage tanks, and have different surroundings or background areas. Furthermore, all of this makes the manual annotation of objects very complex and costly. The powerful effect of the feature extraction methods on object detection and the successes of deep convolutional neural networks (CNN) extract deep features more than traditional methods. This study introduced a novel network structure and designed a unique feature extraction which employs squeeze and excitation network (SENet) and residual network (ResNet) to obtain feature maps, named a shallow-deep feature extraction (SDFE), that improves the resolution and the localization at the same time. Furthermore, this novel model reduces the loss of dense groups and small objects, and provides higher and more stable detection accuracy which is not significantly affected by changing the value of the threshold of the intersection over union (IoU) and overcomes the difficulties of RSIs. Moreover, this study introduced strong evidence about the factors that affect the detection of RSIs. The proposed shallow-deep and multi-scale (SD-MS) method outperforms other approaches for the given ten classes of the NWPU VHR-10 dataset.