Deep-learning object detection methods that are designed for computer vision applications tend to under-perform when applied to remote sensing data. This is because, contrary to computer vision, in remote sensing training data are harder to collect and targets can be very small, occupying only a few pixels in the entire image, and exhibit arbitrary perspective transformations. Detection performance can improve by fusing data from multiple remote sensing modalities, including RGB, IR, hyper-spectral, multi-spectral, synthetic aperture radar, and LiDAR, to name a few. In this work, we propose YOLOrs: a new convolutional neural network, specifically designed for realtime object detection in multimodal remote sensing imagery. YOLOrs can detect objects at multiple scales, with smaller receptive fields to account for small targets, as well as predict target orientations. In addition, YOLOrs introduces a novel midlevel fusion architecture that renders it applicable to multimodal aerial imagery. Our experimental studies compare YOLOrs with contemporary alternatives and corroborate its merits.