Action quality assessment (AQA) automatically evaluates how well humans perform actions in a given video, a technique widely used in fields such as rehabilitation medicine, athletic competitions, and specific skills assessment. However, existing works that uniformly divide the video sequence into small clips of equal length suffer from intra-clip confusion and inter-clip incoherence, hindering the further development of AQA. To address this issue, we propose a hierarchical graph convolutional network (GCN). First, semantic information confusion is corrected through clip refinement, generating the 'shot' as the basic action unit. We then construct a scene graph by combining several consecutive shots into meaningful scenes to capture local dynamics. These scenes can be viewed as different procedures of a given action, providing valuable assessment cues. The video-level representation is finally extracted via sequential action aggregation among scenes to regress the predicted score distribution, enhancing discriminative features and improving assessment performance. Experiments on the AQA-7, MTL-AQA, and JIGSAWS datasets demonstrate the superiority of the proposed hierarchical GCN over state-of-the-art methods.