The trend of globalization in the world is becoming increasingly frequent, and people from different regions are communicating more closely. Therefore, the demand for a second language is constantly expanding, accelerating the development of the field of English oral evaluation and also accelerating the development of online education. The study proposes a text priori based oral evaluation model, which is based on the Transformer model and uses target phonemes as input to the Decoder. The model successfully predicts the relationship between actual pronunciation and error labels. At the same time, a self-supervised oral evaluation model with accent is constructed, which simulates the training process of misreading data by calculating semantic distance. The experimental results show that when the training set ratio reaches its maximum in the Speed Ocean dataset and the L2 Arctic dataset, the F1 values of the proposed method are 0.612 and 0.596, respectively; the length of the target phoneme has a smaller impact on this model compared to other models. Experiments have shown that the proposed deep learning method can alleviate deployment difficulties, directly optimize the effectiveness of oral evaluation, provide more accurate feedback, and also provide users with a better learning experience. This has practical significance for the development of the field of oral evaluation.