In this paper, a novel Q-learning based approach is proposed for estimating the parameters of synchronous generators using PMU measurements. Event playback is used to generate model outputs under different parameters for training the agent in Q-learning. We assume that the exact values of some parameters in the model are not known by the agent in Q-learning. Then, an optimal history-dependent policy for the exploration-exploitation trade-off is planned. With given prior knowledge, the parameter vector can be viewed as states with a specific reward, which is a function of the fitting error compared with the measurements. The agent takes an action (either increasing or decreasing the parameter) and the estimated parameter will move to a new state. Based on the reward function, the optimal action policy will move the parameter set to a state with the highest reward. If multiple events are available, they will be used sequentially so that the updated Q-value can be utilized to improve the computational efficiency. The effectiveness of the proposed approach is validated through estimating the parameters of the dynamic model of a synchronous generator.