Multimodal sentiment analysis is a traditional textbased sentiment analysis technique. However, the field of multimodal sentiment analysis still faces challenges such as inconsistent cross-modal feature information, poor interaction capabilities, and insufficient feature fusion. To address these issues, this paper proposes a cross-modal sentiment model based on CLIP image-text attention interaction. The model utilizes pre-trained ResNet50 and RoBERTa to extract primary image-text features. After contrastive learning with the CLIP model, it employs a multi-head attention mechanism for cross-modal feature interaction to enhance information exchange between different modalities. Subsequently, a cross-modal gating module is used to fuse feature networks, combining features at different levels while controlling feature weights. The final output is fed into a fully connected layer for sentiment recognition. Comparative experiments are conducted on the publicly available datasets MSVA-Single and MSVA-Multiple. The experimental results demonstrate that our model achieved accuracy rates of 75.38% and 73.95% , and F1-scores of 75.21% and 73.83% on the mentioned datasets, respectively. This indicates that the proposed approach exhibits higher generalization and robustness compared to existing sentiment analysis models.