In this paper, the state-of-the-art parallel computational model research is reviewed. We will introduce various models that were developed during the past decades. According to their targeting architecture features, especially memory organization, we classify these parallel computational models into three generations. These models and their characteristics are discussed based on three generations classification. We believe that with the ever increasing speed gap between the CPU and memory systems, incorporating non-uniform memory hierarchy into computational models will become unavoidable. With the emergence of multi-core CPUs, the parallelism hierarchy of current computing platforms becomes more and more complicated. Describing this complicated parallelism hierarchy in future computational models becomes more and more important. A semi-automatic toolkit that can extract model parameters and their values on real computers can reduce the model analysis complexity, thus allowing more complicated models with more parameters to be adopted. Hierarchical memory and hierarchical parallelism will be two very important features that should be considered in future model design and research.Keywords parallel computational models, hierarchical memory, hierarchical parallelism, three generations, memory model BackgroundThe simplified and abstract description of a computer is called a "computational model". A computer architect, algorithm designer and program developer can use such a model as a basis to assess their work including the suitability of one computer architecture to various applications, the computation complexity of an algorithm and the potential performance of one program on various computers, etc. A good computational model can simplify the complicated work of the architect, algorithm designer and program developer while mapping their work effectively onto real computers. Thus, such computational model is sometimes also called "Bridging model" [1]. The bridging model between the sequential computer and algorithm designer/program developer is the Von Neumann and RAM (Random Access Machine) Model [2]. However, no commonly recognized bridging models are found between parallel computer and parallel programs, and no other model exists that can map a user's parallel program so smoothly onto parallel computers as the Von Neumann and RAM Model do. This situation is largely due to the immature parallel computer design, i.e., there are so many different architectures for parallel computers that change rapidly each year, and the greater demand on performance [3]; a clean and simplified description is almost impossible. However, the trend of parallel computer design is converging and a common parallel computer architecture model can be realized (such as cluster), and the communication (we have standard MPI interface) of parallel computing is not so interconnect network dependent, thus we have the BSP and LogP models [1,7].Based on the historical development of parallel computational models, we think they can be classified ...
Abstract. Shared memory system is an important platform for high performance computing. In traditional parallel programming, message passing interface (MPI) is widely used. But current implementation of MPI doesn't take full advantage of shared memory for communication.A double data copying method is used to copy data to and from system buffer for message passing. In this paper, we propose a novel method to design and implement the communication protocol for MPI on shared memory system. The double data copying method is replaced by a single data copying method, thus, message is transferred without the system buffer. We compare the new communication protocol with that in MPICH an implementation of MPI. Our performance measurements indicate that the new communication protocol outperforms MPICH with lower latency. For Point-to-Point communication, the new protocol performs up to about 15 times faster than MPICH, and it performs up to about 300 times faster than MPICH for collective communication.
Technique advances have made image capture and storage very convenient, which results in an explosion of the amount of visual information. It becomes difficult to find useful information from these tremendous data. Content-based Visual Information Retrieval (CBVIR) is emerging as one of the best solutions to this problem. Unfortunately, CBVIR is a very compute-intensive task. Nowadays, with the boom of multi-core processors, CBVIR can be accelerated by exploiting multi-core processing capability. In this paper, we propose a parallelization implementation of a CBVIR system facing to server application and use some serial and parallel optimization techniques to improve its performance on an 8-core and on a 16-core systems. Experimental results show that optimized implementation can achieve very fast retrieval on the two multicore systems. We also compare the performance of the application on the two multi-core systems and give an explanation of the performance difference between the two systems. Furthermore, we conduct detailed scalability and memory performance analysis to identify possible bottlenecks in the application. Based on these experimental results and performance analysis, we gain many insights into developing efficient applications on future multicore architectures.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
customersupport@researchsolutions.com
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.