In the real world, complex dynamic scenes often arise from the composition of simpler parts. The visual system exploits this structure by hierarchically decomposing dynamic scenes: when we see a person walking on a train or an animal running in a herd, we recognize the individual's movement as nested within a reference frame that is itself moving. Despite its ubiquity, surprisingly little is understood about the computations underlying hierarchical motion perception. To address this gap, we develop a novel class of stimuli that grant tight control over statistical relations among object velocities in dynamic scenes. We first demonstrate that structured motion stimuli benefit human multiple object tracking performance. This performance advantage is well-explained by a Bayesian observer model that exploits hierarchical structure. A second experiment, using a motion prediction task, reinforced this conclusion and provided fine-grained information about how the visual system flexibly exploits motion structure.The visual scenes our brains perceive in everyday life are filled with complex dynamics. Information hitting the retina changes not only with every motion in the scene, but also with every head movement and saccade. Noise, occlusions, and ambiguities further make visual information inherently unreliable. In order to maintain stable, coherent percepts in the face of complex and unreliable inputs, our brains exploit the spatially and temporally structured nature of the environment [1].Motion structure refers to statistical relations of velocities. One form of structure common in natural scenes is motion grouping: when a rigid object is set in motion, all of its visual features move coherently. This structure allows us to infer the existence of objects based on the coherent motion of features-the Gestalt grouping cue known as common fate [2]. Grouping based on common fate has been shown to influence our ability to track objects [3][4][5][6], to search displays [7], and to store information in short-term memory [8].However, the strict definition of common fate is too brittle to accommodate natural scenes in which visual features do not move together rigidly and yet are still grouped together. In some cases this is because we perceive objects as deforming non-rigidly. In other cases, we perceive objects that are hierarchically structured [9][10][11]: the parts of an object move rigidly relative to a reference frame (the object), which itself could be a rigidly moving part of another object, and so on. For example, we perceive the motion of hands relative to the motion of arms, and the motion of arms relative to the motion of the torso (Fig. 1A,B). The entire body may be moving relative to another reference frame (a train or escalator). The perception of hierarchically organized motion suggests a powerful "divide-and-conquer" strategy for parsing complex dynamic scenes.Recent work has formalized the representation and discovery of hierarchical motion structures [10]. However, we still know relatively little about if a...