Speech Emotion Recognition (SER) is a fastdeveloping area of study with a primary goal of automatically identifying and analyzing the emotional states expressed in speech. Emotions are crucial in human communication as they impact the effectiveness and meaning of linguistic expressions. SER aims to create computational approaches and models to detect and interpret emotions from speech signals. One of the primary applications of SER is evident in the field of Human-Computer Interaction (HCI), where it can be used to develop interactive systems that adapt to the user's emotional state based on their voice. This paper investigates the use of speech data for speech emotion recognition. Additionally, we applied a transformation process to convert the speech data into 2D images. Subsequently, we compared the outcomes of this transformation with the original speech data, aligning the comparison with a dataset containing labeled speech samples in both Arabic and English. Our experiments compare three methods: a transformer-based model, a Vision Transformer (ViT) based model, and a wave2vec-based model. The transformer model is trained from scratch on two significant audio datasets: the Arabic Natural Audio Dataset (ANAD) and the Toronto Emotional Speech Set (TESS), while the vision transformer is evaluated alongside wave2vec as part of transfer learning.The results are impressive. The transformer model achieved remarkable accuracies of 94% and 99% on ANAD and TESS datasets, respectively. Additionally, ViT demonstrates strong capabilities, achieving accuracies of 88% and 98% on the ANAD and TESS datasets, respectively. To assess the transfer learning potential, we also explore the Wave2Vector model with fine-tuning. However, the findings suggest limited success, achieving only a 56% accuracy rate on the ANAD dataset.