Cooperative Computing Techniques for a Deeply Fused and Heterogeneous Many-Core Processor Architecture

The Sunway TaihuLight supercomputer is the world's first system with a peak performance greater than 100 PFlops. In this paper, we provide a detailed introduction to the TaihuLight system. In contrast with other existing heterogeneous supercomputers, which include both CPU processors and PCIe-connected many-core accelerators (NVIDIA GPU or Intel Xeon Phi), the computing power of TaihuLight is provided by a homegrown many-core SW26010 CPU that includes both the management processing elements (MPEs) and computing processing elements (CPEs) in one chip. With 260 processing elements in one CPU, a single SW26010 provides a peak performance of over three TFlops. To alleviate the memory bandwidth bottleneck in most applications, each CPE comes with a scratch pad memory, which serves as a user-controlled cache. To support the parallelization of programs on the new many-core architecture, in addition to the basic C/C++ and Fortran compilers, the system provides a customized Sunway OpenACC tool that supports the OpenACC 2.0 syntax. This paper also reports our preliminary efforts on developing and optimizing applications on the TaihuLight system, focusing on key application domains, such as earth system modeling, ocean surface wave modeling, atomistic simulation, and phase-field simulation.

show abstract

“…The general architecture of the SW26010 processor [10] is shown in Figure 2. The processor includes four core-groups (CGs).…”

mentioning

confidence: 99%

The Sunway TaihuLight supercomputer: system and applications

Liao

Yang

et al. 2016

Sci. China Inf. Sci.

394

165

View full text Add to dashboard Cite

show abstract

“…This limiting number can be increased by organizing the processors into clusters: then the first computer must speak directly only to the head of the cluster. Another way is to distribute the job near to the processing units, either inside the processor [34] or using processors to let do the job by the processing units of a GPGPU.…”

Section: A Non-technical Model Of Parallelized Sequential Operationmentioning

confidence: 99%

“…The T aihulight is an exception on both axes: it has the highest number of cores and the best parallelization efficiency. Its secret is in the processor comprising cooperating cores [34], i.e. it uses a (slightly) different computing paradigm.…”

Section: A Non-technical Model Of Parallelized Sequential Operationmentioning

confidence: 99%

“…5 * 10 3 , i.e the resulting value of (1 − α ef f ) is expected to be around 10 −7 . Notice that deploying "cooperative computing" [34] enhances further the value of (1 − α ef f ), but it means already utilizing a (slightly) different computing paradigm: the cores have a kind of direct connection and can communicate with the exclusion of the main memory.…”

Section: Estimating the Different Limiting Factors Of αmentioning

confidence: 99%

See 1 more Smart Citation

Finally, how many efficiencies the supercomputers have?

Végh¹

2020

J Supercomput

View full text Add to dashboard Cite

Using extremely large number of processing elements in computing systems leads to unexpected phenomena, such as different efficiencies of the same system for different tasks, that cannot be explained in the frame of the classical computing paradigm. The introduced simple non-technical model enables to set up a frame and formalism needed to explain the unexpected experiences around supercomputing. The paper shows that the degradation of the efficiency of the parallelized sequential system is a natural consequence of the computing paradigm, rather than an engineering imperfectness. The workload is greatly responsible for wasting the energy as well as limiting the size and the type of tasks the supercomputers can run. Case studies provide insight how the different contributions compete for dominating the resulting payload performance of the computing system, and how enhancing the technology made the computing+communication the dominating contribution in defining the efficiency of supercomputers. The model also enables to derive predictions about the supercomputer performance limitations for the near future as well as provides hints for enhancing the supercomputer components. The phenomena show interesting parallels with the phenomena experienced in science more than a century ago and through their studying a modern science was developed.1 There are some doubts about the definition of exaFLOPS, whether it means R P eak or R M ax , in the former case whether it includes accelerator cores, and in the latter case measured by which benchmark and finally using what operand length. Here the term is used as R HP L M ax , using 64-bit floating operands. 2 A special issue https://link.springer.com/journal/11714/19/10 3 https://en.wikipedia.org/wiki/PEZY Computing: The name PEZY is an acronym derived from the greek derived Metric prefixs Peta, Eta, Zetta, Yotta 4 https://blogs.nvidia.com/blog/2019/06/17/hpc-ai-performance-record-summit/ https://www.olcf.ornl.gov/2018/06/08/genomics-code-exceeds-exaops-on-summitsupercomputer/ 5 The related work and speedup deserved the Gordon Bell Prize 6 It was also learned that specific processor design is needed for exascale: As part of the announcement the development line Knights Hill [20] was canceled and instead be replaced by a "new platform and new microarchitecture specifically designed for exascale". 7 Despite its failure, the SpiNNaker2 is also under construction [19] 8 https://www.scmp.com/tech/policy/article/3015997/china-has-decided-not-fan-flamessuper-computing-rivalry-amid-us 9 https://ec.europa.eu/newsroom/dae/document.cfm? doc id =60156

show abstract

“…• as a new QT receives a new Processing Unit (PU)(s), there is no need to save/restore registers and return address (less memory utilization and less instruction cycles) • the OS can receive its own PU, which is initialized in kernel mode and can promptly (i.e. without the need of context change) service the requests from the requestor core • for resource sharing, temporarily a PU can be delegated to protect the critical section; the next call to run the code fragment with the same offset will be delayed until the processing by the first PU terminates • the processor can natively accommodate to the variable need of parallelization • the actually out-of-use cores are waiting in low energy consumption mode • the hierarchic core-to-core communication greatly increases the memory throughput • the asynchronous-style computing [57] largely reduces the loss due to the gap [58] between speed of the processor and that of the memory • the direct core-to-core connection (more dynamic than in [46]) greatly enhances efficacy in large systems [59] • the thread-like feature to fork() and the hierarchic buses change the dependence of on the number of cores from linear to logarithmic [8] (enables to build really exa-scale supercomputers) The very first version of EMPA [11] has been implemented in a form of simple (practically untimed) simulator [60], now an advanced (Transaction Level Modelled) simulator is prepared in SystemC. The initial version adapted Y86 cores [61], the new one RISC-V cores.…”

Section: Some Advantages Of Empamentioning

confidence: 99%

The Need for Modern Computing Paradigm: Science Applied to Computing

Végh¹,

Tisan

2019

2019 International Conference on Computational Science and Computational Intelligence (CSCI)

View full text Add to dashboard Cite

More than hundred years ago the 'classic physics' was it in its full power, with just a few unexplained phenomena; which however led to a revolution and the development of the 'modern physics'. Today the computing is in a similar position: computing is a sound success story, with exponentially growing utilization, but with a growing number of difficulties and unexpected issues as moving towards extreme utilization conditions. In physics studying the nature under extreme conditions has lead to the understanding of the relativistic and quantal behavior. Quite similarly in computing some phenomena, acquired in connection with extreme (computing) conditions, cannot be understood based on of the 'classic computing paradigm'. The paper draws the attention that under extreme conditions qualitatively different behaviors may be encountered in both classic and world, and pinpointing that certain, formerly unnoticed or neglected aspects enable to explain new phenomena as well as to enhance computing features. Moreover,an idea of modern computing paradigm implementation is proposed.

show abstract

Cooperative Computing Techniques for a Deeply Fused and Heterogeneous Many-Core Processor Architecture

Cited by 41 publications

References 45 publications

The Sunway TaihuLight supercomputer: system and applications

The Sunway TaihuLight supercomputer: system and applications

Finally, how many efficiencies the supercomputers have?

The Need for Modern Computing Paradigm: Science Applied to Computing

Contact Info

Product

Resources

About