Moore's law will grant computer architects ever more transistors for the foreseeable future, and the challenge is how to use them to deliver efficient performance and flexible programmability. We propose a many-core architecture, Godson-T, to attack this challenge. On the one hand, Godson-T features a region-based cache coherence protocol, asynchronous data transfer agents and hardware-supported synchronization mechanisms, to provide full potential for the high efficiency of the on-chip resource utilization. On the other hand, Godson-T features a highly efficient runtime system, a Pthreadslike programming model, and versatile parallel libraries, which make this many-core design flexibly programmable. This hardware/software cooperating design methodology bridges the high-end computing with mass programmers. Experimental evaluations are conducted on a cycle-accurate simulator of Godson-T. The results show that the proposed architecture has good scalability, fast synchronization, high computational efficiency, and flexible programmability.
The synchronization between threads has serious impact on the performance of many-core architecture. When communication is frequent, coarse-grained synchronization brings significant overhead. Thus, coarse-grained synchronization is not suitable for this situation. However, the overhead of fine-grained synchronization is still small when the communication is frequent. For the many-core architecture which supports fine-grained synchronization with on-chip storage, we propose fine-grained synchronization algorithms for scientific computation application 2-D wavefront and LU decomposition. At first, according to the memory access mode, an efficient method of data allocation is proposed. Then, way of thread partition and synchronization are discussed. Finally, we estimate the two algorithms based on Godson-T many-core architecture. The results of experiments show that the relative speedup is almost linear and the execution time is only 53.2 % of the coarse-grained synchronization. After the global barriers are eliminated, LU decomposition achieved 13.1% performance improvement. Moreover, the experiments prove that the fine-grained mechanism is able to improve the performance of processor and it has a good scalability.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.