This project aims to align facial and vocal characteristics within a closely related common space through the construction of multi-modal generative adversarial networks (GANs). The project proposes a multi-modal approach grounded in visual perception, utilizing the Graph Cut algorithm to align feature components with the image features of each corresponding local context, thereby achieving adaptability in multi-modal information. To enhance the speed and accuracy of the modeling process, a regional attention strategy is integrated. Experimental results demonstrate that the proposed algorithm enhances the accuracy of image recognition tasks.