Crowd formation of aesthetic transformation is considered to have extremely high artistic value and is widely applied in large-scale performances. In this paper, a spatio-temporal hierarchical model that parts the crowd formation transform into multiple granularities is proposed. Its core idea is to add spatio-temporal constraints created by directors into transformation process after multi-level division. In this model, average hash value and energy optimization are used to achieve reasonable crowd formation arrangement, while smooth and collision-free formation transformations are presented by constrained region growth and Kuhn-Munkres algorithm. We have also proposed a framework to achieve the generation of visually pleasing crowd formation transform performance based on the constraints. Besides, a virtual crowd formation transformation simulation was built to verify the effect of the proposed model. Through simulation experiments and comparisons, it was demonstrated that this hierarchical model can generate aesthetic crowd formation transformation with a satisfactory process.