In this paper, we discuss the role, design and implementation of smart containers in the SkePU skeleton library for GPU-based systems. These containers provide an interface similar to C++ STL containers but internally perform runtime optimization of data transfers and runtime memory management for their operand data on the different memory units. We discuss how these containers can help in achieving asynchronous execution for skeleton calls while providing implicit synchronization capabilities in a data consistent manner. Furthermore, we discuss the limitations of the original, already optimizing memory management mechanism implemented in SkePU containers, and propose and implement a new mechanism that provides stronger data consistency and improves performance by reducing communication and memory allocations. With several applications, we show that our new mechanism can achieve significantly (up to 33.4 times) better performance than the initial mechanism for page-locked memory on a multi-GPU based system.Keywords SkePU · Smart containers · Skeleton programming · Memory management · Runtime optimizations · GPU-based systems 1 Introduction Skeleton programming [4] for GPU-based systems is increasingly becoming popular for mapping common computational patterns. Several skeleton libraries are especially written (from scratch) targeting GPU-based systems including SkePU [10, 6], SkelCL [24] and Marrow [20]. Moreover, many existing skeleton libraries, initially written for execution on MPI-clusters and/or multicore CPUs have been ported for GPU execution, such as FastFlow [12] and Muesli [11]. These libraries differ in their 2 Usman Dastgeer, Christoph Kessler approach and feature offering but they all aim to provide performance comparable to hand-written code while requiring much less programming effort.Providing high-level abstraction with good execution performance in a library requires special design consideration. The question comes down to what is exposed to the programmer and what is handled implicitly by the skeleton library. For example, the Marrow library exposes concurrency to the application program by executing skeleton calls asynchronously; it returns a handle which can be used to synchronize execution when needed. This allows Marrow to effectively overlap computation and communication from different skeleton computations. SkelCL makes data distribution explicit so that the application programmer can choose how to map a computation to the underlying computing platform.Another important aspect in GPU computation is managing communication between CPU (main) memory and GPU (device) memory over PCIe interconnect. In Muesli, FastFlow, SkePU and SkelCL, skeleton calls can execute on a single or multicore CPU as well as on a GPU. Considering that CPUs and GPUs have separate physical memory, execution on a certain compute device may require transferring data back and forth to its associated memory if data is not already available in that memory. For example, in the following code, // 1 D arrays : v0 , v1 skel_c...