The mainstream deep learning spoken language evaluation algorithms are based on speech recognition to perform mispronunciation determination. In the process of speech recognition, Deep Neural Network-Hidden Markov Model (DNN-HMM) has better performance than Gaussian Mixture Model-Hidden Markov Model (GMM-HMM). In this paper, the structure of speech recognition system is analyzed and the current mainstream English speaking evaluation algorithms are analyzed, and it is concluded that the current mainstream evaluation algorithms are basically not end-to-end and have low evaluation metrics. Therefore, this paper designs an end-to-end speech evaluation model based on the theory of deep learning, and conducts experiments for this model under the TIMIT speech dataset to verify and analyze the advantages of this paper's model in temporal modeling, which shows that the speech recognition model designed in this paper has better performance in spoken language evaluation.