The recent advances on microprocessors architecture have brought a large amount of computational power through numerous heterogeneous computing devices. For instance, a typical modern supercomputer comprises a set of interconnected computing nodes, each of which having several multi-core processors and hardware accelerators such as Nvidia GPU and Intel MIC. Moreover, moving forward to exascale platforms, estimations indicate that computing nodes will feature up to 1000 cores. However, to fully take into advantage these resources, a profound shift on the implementation of numerical applications has to be performed. In this paper, we study the design and implementation of a seismic wave propagation simulator, based on finite difference scheme, and specifically tailored for massively parallel architectures. The application data-flow is built on top of PaRSEC, a generic task-based runtime system, targeting distributed heterogeneous architectures. Considering the memory-bound nature of the stencil scheme, we designed the numerical kernels to maximize data reuse and thus increasing their arithmetic intensity. Such a strategy allows to efficiently exploit large SIMD units available in modern CPU cores. To illustrate the performance of the application, we conducted a strong scalability study on a cluster of Intel KNL processors. The obtained results compare favorably to an existing seismic wave propagation code.
This paper presents the design of an efficient multi-target (CPU+GPU) implementation for the Parallel_for skeleton. Emerging massively parallel architectures promise very high performances for a low cost. However, these architectures change faster than ever. Thus, optimization of codes becomes a very complex and time consumming task. We have identified the data storage as the main difference between the CPU and the GPU implementation of a code. We introduce an abstract data layout in order to adapt the data storage. Based on this layout, the utilization of Parallel_for skeleton allows to compile and execute the same program both on CPU and on GPU. Once compiled, the program runs close to the hardware limits. Categories and Subject Descriptors General TermsC++ templates, parallel computing, Nvidia CUDA, Intel TBB, parallel skeletons, data layout MOTIVATION AND MAIN OBJECTIVESIn many scientific applications, computation time is a strong constraint. Optimizing for the rapidly changing computer hardware is a very expensive and time consuming task. Emerging hybrid architectures tend to make this process even more complex.The classical way to ease this optimization process is to build applications on top of High Performance Computing (HPC) libraries. Each HPC library allows the scientific developer to use a well defined Application Programming Interface (API) tailored for its specific scientific sub-domain. Because of their limited scope, it is possible to produce specialized HPC implementations of these libraries for a large variety of hardware target architectures. Hence, scientific Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Legolas++ is a generic library developed at EDF R&D that provides building blocks for the specific domain of Highly Structured Sparse Linear Algebra (HSSLA) problems arising in many simulation codes. In particular, it allows to deal with recursively blocked matrices (matrix of blocks of blocks of...) that appear for example in neutron transport simulations [16]. In order to build HPC codes meeting EDF's industrial quality standards, a multi-target version of Legolas++ is presently developed that should provide a unified interface for both multi-core CPUs and Graphics Processing Units (GPUs) optimal implementations. Not all but a large fraction of the Legolas++ operations are embarrassingly parallel and consist in applying the same function independently on multiple data. This kind of problems are well described with a Parallel_for algorithm that is an instance of parallel algorithm skeletons introduced in [4]. In this article we propose a design for a C++ multi-target (CPU/GPU) implementation of the Parallel_for skeleton. Thi...
Studies on massive open online courses (MOOCs) users discuss the existence of typical profiles and their impact on the learning process of the students. However defining the typical behaviors as well as classifying the users accordingly is a difficult task. In this paper we suggest two methods to model MOOC users behaviour given their log data. We mold their behavior into a Markov Decision Process framework. We associate the user's intentions with the MDP reward and argue that this allows us to classify them.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
hi@scite.ai
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.