The optimal number of latent topics required to model the most accurate latent substructure for a source code corpus is an open question in source code analysis. Most estimates about the number of latent topics that exist in a software corpus are based on the assumption that the data is similar to natural language, but there is little empirical evidence to support this. In order to help determine the appropriate number of topics needed to accurately represent the source code, we generate a series of Latent Dirichlet Allocation models with varying topic counts. We use a heuristic to evaluate the ability of the model to identify related source code blocks, and demonstrate the consequences of choosing too few or too many latent topics.
This project examines whether or not learners feel less foreign language anxiety (FLA) in an online multiuser 3D virtual world simulation than in the real world classroom. Previous research has shown FLA to have negative effects on learner performance and learning outcomes. Research into learning in virtual worlds has indicated that performance anxiety may be lessened in these environments, however, the use of such virtual environments also places demands on the learner to develop a range of technical skills to facilitate interaction. The project examines whether or not learners feel less FLA in an online multiuser 3D virtual world simulation than in the real world classroom and also attempts to establish what impacts these demands have on learner performance and FLA. This work-in-progress paper, on the basis of preliminary analysis, has found 1) there are multiple sources of FLA in both classroom and virtual environments; 2) students found the virtual environment less stressful in terms of language use and 3) there was not a significant inherent level of technical related anxiety.
It was found in the two-year study reported in this article that a single collaborative language lesson using Second Life can result in a statistically significant increase in student self-efficacy beliefs across a range of specific and general language skills. However, students with different 'real life' prior experience varied in the durability of their language performance beliefs over time. A between-group analysis revealed differences in the pre-and post-tests, which is explained by the specificity of the curriculum -that is, the curriculum within the Second Life environment, and not just the environment itself, has a significant impact on student beliefs. This helps to dispel some critics' concerns about the pedagogical value of these environments. However, a within-group analysis revealed that students with infrequent experience of the 'real life' language context increased in their beliefs, while students with frequent experiences had similar initial responses to the other students, but were more varied in their responses over time. It is proposed that these variations over time are a result of an interaction between the domain specificity of the curriculum and authenticity, or in other words, salience of the enactive mastery experiences in Second Life with that of 'real life'.
Latent Dirichlet Allocation (LDA) is a data clustering algorithm that performs especially well for text documents. In natural-language applications it automatically finds groups of related words (called "latent topics") and clusters the documents into sets that are about the same "topic". LDA has also been applied to source code, where the documents are natural source code units such as methods or classes, and the words are the keywords, operators, and programmer-defined names in the code. The problem of determining a topic count that most appropriately describes a set of source code documents is an open problem. We address this empirically by constructing clusterings with different numbers of topics for a large number of software systems, and then use a pair of measures based on source code locality and topic model similarity to assess how well the topic structure identifies related source code units. Results suggest that the topic count required can be closely approximated using the number of software code fragments in the system. We extend these results to recommend appropriate topic counts for arbitrary software systems based on an analysis of a set of open source systems.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.