New architectural solutions for parallel systems built of bus-based shared memory processor clusters are presented. A new paradigm is proposed for interprocessor communication, called communication on the fly. With this paradigm, processors can be dynamically switched between clusters at program run-time to bring intheir caches data that can be read by many processors in a cluster at the same time they are written to the cluster memory. A cache controlled macro data flow program execution paradigm is also proposed. Programs are structured into tasks for which all required data are brought to the processor data cache before task execution. A new graph representation of programs is introduced, which enables modeling of functioning of data caches, memories, bus arbiters, processor switching between clusters and parallel reads of data on the fly. This representation is used for realistic simulation of a numerical algorithm execution based on distribution of parallel tasks between dynamic SMP clusters and on communication on the fly. Performance evaluation results are presented for different configurations of the programs and shared memory clusters in the system.
,QWURGXFWLRQScalability of shared memory systems can be much improved by application of cluster-based system architecture. Such architecture has become quite common today [1,2]. Some systems are based on shared memory processor clusters with inter-cluster communication done by message passing [3][4][5][6][7]. Communication between clusters is done through different networks such as FastEthernet, GigaBit Ethernet or Myrinet. Other shared memory cluster systems constitute CC-NUMA distributed-shared memory systems. Different interconnection means are used there to implement intra and inter-cluster communication. In GigaMax system of Encore Computer Corporation [9], bus-based shared memory clusters communicated through a global bus. In Stanford DASH [10], bus-based processor clusters are interconnected using two-dimensional meshes. In Convex Exemplar [11], shared memory clusters based on crossbar switches are interconnected by a multiple ring network. The intra-cluster communication and inter-cluster communication can have different latencies, the former being usually much lower for small clusters. To optimally map a parallel program structure to system structure, areas of intensive inter-process communication in programs should be mapped into shared memory clusters. In current implementations, the size of clusters is fixed. The physical number of processors in clusters and the optimal cluster sizes requested by programs can be different. In such cases, the fixed system structure can decrease the efficiency of program execution. This paper describes a cluster-based shared memory system architecture oriented towards much more efficient computations and communication during parallel program execution than in existing systems. These goals are achieved by dynamic reconfiguration of shared memory processor clusters, a new paradigm of data cache behaviour and a new typ...