Object recognition at different scales has been a fundamental problem in computer vision. In particular, small object recognition attracts increasing attention recently. However, because of working on a single frame only, many recognizers' performances become unacceptable in many practical application scenarios: very low resolutions, invisible small targets, extremely similar appearances etc. Motivated by the way humans deal with these challenging scenarios of object recognition, this paper introduces frame sequence and attention mechanism to compensate for mutilated information. Specifically, this paper proposes a spatio-temporal neural network (dubbed STNet) for small object recognition. STNet fixes the regions of interest with a super-resolution module, and focuses on the discriminative region with a spatio-temporal attention module. In addition, STNet applies a double layer long short-term memory subnet to make full use of the interframe information. Furthermore, this paper presents a challenging air-target recognition dataset ATSETC4 for evaluating the performance of each method in identifying small targets. Our model outperforms many state-of-the-art models on ATSETC4, including MobileNetV2 and SENet. In particular, STNet surpasses VGG11 at an average of 3.67%, even reaches 87.50% and 82.50% on 28 scale and 14 scale on AT-SETC4 respectively.