The human understanding of things is based on prediction which is made through concepts formed by the categorization of experience. To mimic this mechanism in robots, multimodal categorization, which enables the robot to form concepts, has been studied. On the other hand, segmentation and categorization of human motions have also been studied to recognize and predict future motions. This paper addresses the issue on how these different kinds of concepts are integrated to generate higher level concepts and, more importantly, on how the higher level concepts affect the formation of each lower level concept. To this end, we propose the multi-layered multimodal latent Dirichlet allocation (mMLDA), which is an expansion of the MLDA to learn and represent the hierarchical structure of concepts. We also examine a simple integration model and compare it with the mMLDA. The experimental results reveal that the mMLDA leads to a better inference performance and, indeed, forms higher level concepts which integrate motions and objects that are necessary for real-world understanding.