This report presents the results of a research and development effort for SpeechRater SM Version 1.0 (v1.0), an automated scoring system for the spontaneous speech of English language learners used operationally in the Test of English as a Foreign Language™ (TOEFL ® ) Practice Online assessment (TPO). The report includes a summary of the validity considerations and analyses that drive both the development and the evaluation of the quality of automated scoring. These considerations include perspectives on the construct of interest, the context of use, and the empirical performance of the SpeechRater in relation to both the human scores and the intended use of the scores. The outcomes of this work have implications for short-and long-term goals for iterative improvements to SpeechRater scoring. which is used by prospective test takers to prepare for the official TOEFL iBT test. This study reports the development and validation of the system for low-stakes practice purposes. The process we followed to build this system represented a principled approach to maximizing 2 essential qualities: substantively meaningful and technically sound. In developing and evaluating the features and the scoring models to predict human assigned scores, we engaged both content and technical experts actively to ensure the construct representation and technical soundness of the system. We compared primarily two alternative methodologies of building scoring modelsmultiple regression and classification trees-in terms of their construct representation and empirical performance in predicting human scores. Based on the evaluation results, we concluded that a multiple regression model with feature weights determined by content experts was superior to the other competing models evaluated.We then used an argument-based approach to integrate and evaluate the existing evidence to support the use of SpeechRater SM v1.0 in a low-stakes practice environment. The argumentbased approach provided a mechanism for us to articulate the strengths and weaknesses in the validity argument for using SpeechRater v1.0 and put forward a transparent argument for using it for a low-stakes practice environment. In particular, the construct representation of the multiple regression model with expert weights was sufficiently broad to justify its use in a low-stakes application. While some higher-order aspects of the speaking construct (such as content and organization) are missing, more basic aspects of the construct (such as pronunciation and fluency) are richly represented. In addition, these different parts of the speaking construct tend to be highly correlated, so that the absence of higher order factors is not as detrimental to the model's agreement with human raters as it otherwise might be. The model's agreement with human raters was not sufficiently high to support high-stakes decisions but was still suitable for use in low-stakes applications. The correlation of the 6-item aggregate score with human raters was .57 and was deemed acceptable given the lo...