A thorough diagnosis of depression, a mental illness that affects many people worldwide, must be made in light of the patient's medical background, present symptoms, and pertinent examination findings. In recent years, scholars have increasingly favored machine learning algorithms for depression diagnosis or prediction models. However, achieving higher prediction accuracy in prediction models remains challenging when relying solely on one modality. This research suggests a multi-modal feature fusion model for predicting depression tendency that is based on bi-directional LSTM and vision transformer (ViT). The model demonstrates an accuracy of 70.00%, surpassing that of single-modal depression tendency prediction models and presenting a novel approach to depression tendency prediction.