Most modern multi‐object tracking (MOT) systems for videos follow the tracking‐by‐detection paradigm, where objects of interest are first located in each frame then associated correspondingly to form their intact trajectories. In this setting, the appearance features of objects usually provide the most important cues for data association, but it is very susceptible to occlusions, illumination variations, and inaccurate detections, thus easily resulting in incorrect trajectories. To address this issue, in this study we propose to make full use of the neighboring information. Our motivations derive from the observations that people tend to move in a group. As such, when an individual target's appearance is remarkably changed, the observer can still identify it with its neighbor context. To model the contextual information from neighbors, we first utilize the spatiotemporal relations among trajectories to efficiently select suitable neighbors for targets. Subsequently, we construct neighbor graph for each target and corresponding neighbors then employ the graph convolutional networks (GCNs) to model their relations and learn the graph features. To the best of our knowledge, it is the first time to explicitly leverage neighbor cues via GCN in MOT. Finally, standardized evaluations on the MOT16 and MOT17 data sets demonstrate that our approach can remarkably reduce the identity switches whilst achieve state‐of‐the‐art overall performance.