Today's scientific applications are increasingly relying on a variety of data sources, storage facilities, and computing infrastructures, and there is a growing demand for data analysis and visualization for these applications. In this context, exploiting Big Data frameworks for scientific computing is an opportunity to incorporate high-level libraries, platforms, and algorithms for machine learning, graph processing, and streaming; inherit their data awareness and fault-tolerance; and increase productivity. Nevertheless, limitations exist when Big Data platforms are integrated with an HPC environment, namely poor scalability, severe memory overhead, and huge development effort. This paper focuses on a popular Big Data framework -Apache Spark-and proposes an architecture to support the integration of highly scalable MPI block-based data models and communication patterns with a map-reducebased programming model. The resulting platform preserves the data abstraction and programming interface of Spark, without conducting any changes in the framework, but allows the user to delegate operations to the MPI layer. The evaluation of our prototype shows that our approach integrates Spark and MPI efficiently at scale, so end users can take advantage of the productivity facilitated by the rich ecosystem of high-level Big Data tools and libraries based on Spark, without compromising efficiency and scalability.
Many scientific areas make extensive use of computer simulations to study complex real-world processes. These computations are typically very resource-intensive and present scalability issues as experiments get larger even in dedicated clusters, since these are limited by their own hardware resources. Cloud computing raises as an option to move forward into the ideal unlimited scalability by providing virtually infinite resources, yet applications must be adapted to this new paradigm. This process of converting and/or migrating an application and its data in order to make use of cloud computing is sometimes known as cloudifying the application. We propose a generalist cloudification method based in the MapReduce paradigm to migrate scientific simulations into the cloud to provide greater scalability. We analysed its viability by applying it to a real-world railway power consumption simulator and running the resulting implementation on Hadoop YARN over Amazon EC2. Our tests show that the cloudified application is highly scalable and there is still a large margin to improve the theoretical model and its implementations, and also to extend it to a wider range of simulations. We also propose and evaluate a multidimensional analysis tool based on the cloudified application. It generates, executes and evaluates several experiments in parallel, for the same simulation kernel. The results we obtained indicate that out methodology is suitable for resource intensive simulations and multidimensional analysis, as it improves infrastructure's utilization, efficiency and scalability when running many complex experiments.1.
Nowadays, the modelling, design, evaluation and testing stages involved in the development of railway infrastructures are extensively assisted by computer simulators. Moreover, some expert systems take a step further to improve and propose designs, taking the user's knowledge as a baseline. These systems can generate and assess a large number of complex scenarios, which yields the execution of numerous, and potentially very complex simulations. Railway infrastructures rely heavily on these applications to analyze potential deployments prior to their installation. In this paper, we propose the railway power consumption simulator model (RPCS), a cloud-based model for the design, simulation and evaluation of railway electric infrastructures. This model integrates the parameters of an infrastructure within a search engine that generates and evaluates a set of simulations to achieve optimal designs, according to a given set of objectives and restrictions. The knowledge of the domain is represented as an ontology, which translates the elements in the infrastructure into an electric circuit, which is simulated to obtain a wide range of metrics in each element of the infrastructure. In order to support the execution of thousands of scenarios in a scalable, efficient and fault-tolerant manner, we also propose an architecture to deploy the model in a Cloud environment. To illustrate how our model would adapt to a specific problem, we describe a case study that aims to maximize energy savings, while maintaining a high power provisioning quality. Using our model, we were able to obtain the optimal substation distribution that allowed the infrastructure to operate under normal and faulty conditions. Additionally, we include the economic costs that arose from the externalization of the computations to the Amazon Elastic Compute Cloud, which were minimized by our dimensioning model and the usage of spot instances.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.