The control problem for quadrotor UAVs is difficult and challenging due to the complex nonlinear dynamics and ever-changing disturbances. In this paper, a supplementary controller based on reinforcement learning (RL) is proposed to improve the control performance of quadrotor UAVs. The proposed RL method is constructed by an actor-critic structure and some improved technologies, e.g., Q-learning, temporal difference, and experience replay. With the proposed method, the speed and stability of training can be improved greatly. On one hand, the supplementary controller can work together with the traditional controller online, which can guarantee the stability of the system. On the other hand, the model uncertainties and external disturbances could be restrained through online RL training. The Lyapunov theory is used to prove the convergence of the RL controller's weights theoretically. Finally, three simulations are provided to illustrate the effectiveness of the proposed controller. INDEX TERMS Quadrotor, UAVs, reinforcement learning, ADP, control system. I. INTRODUCTION Recently, quadrotor UAVs have acquired much attention in many areas [1]-[4], especially in the domain of logistics and agriculture. One of the their advantages is low-cost and low fault rates, which is based on their simple mechanical structure. Another advantage that they enjoy abilities of Vertical TakeOff and Landing (VTOL), stable hovering and high maneuverability extend application range. However, high performance control for quadrotor UAVs is a challenge for the reason that the dynamic model of them is Multi-Input Multi-Output (MIMO), strong coupling, nonlinear and underactuated; they suffer from model-uncertainties and unknown external disturbances. Many linear or nonlinear methods were designed for quadrotor UAVs with coupling and nonlinear dynamic. The linear methods were convenient to be applied, e.g. PID and Linear Quadratic Regulator (LQR).But the liner methods simplified many nonlinear specialties and could only guaranteed the convergence near the equilibrium [5], [6]. In order to get larger convergence range and higher performance, several nonlinear methods were developed, like Nonlinear The associate editor coordinating the review of this manuscript and approving it for publication was Okyay Kaynak.