Generalized MultiAmdahl: Optimization of Heterogeneous Multi-Accelerator SoC

IEEE Comput. Arch. Lett.

2014

Self Cite

Power consumption, off-chip memory bandwidth, chip area and Network on Chip (NoC) capacity are among main chip resources limiting the scalability of Chip Multiprocessors (CMP). A closed form analytical solution for optimizing the CMP cache hierarchy and optimally allocating area among hierarchy levels under such constrained resources is developed. The optimization framework is extended by incorporating the impact of data sharing on cache miss rate. An analytical model for cache access time as a function of cache size is proposed and verified using CACTI simulation.

Section: Optimizing Cache Hierarchymentioning

confidence: 99%

Cache Hierarchy Optimization

IEEE Comput. Arch. Lett.

2014

Self Cite

“…PiM and SIMD taxonomy. to account for the "uncore" components, concluding that to sustain the scalability of future many-core systems, the uncore components must be designed to scale sublinearly with respect to the overall core count. Morad et al [2013 presented several frameworks that, given (1) a multicore architecture consisting of last-level cache (LLC), processing cores, and an NoC interconnecting the cores and the LLC; (b) workloads consisting of sequential and concurrent tasks; and (c) physical resource constraints (area, power, execution time, off-chip bandwidth), find the optimal selection of a subset of the available processing cores and the optimal resource allocation among all blocks.…”

Section: Related Workmentioning

confidence: 99%

GP-SIMD Processing-in-Memory

ACM Trans. Archit. Code Optim.

2015

Self Cite

AMIR MORAD, LEONID YAVITS, and RAN GINOSAR, Technion GP-SIMD, a novel hybrid general-purpose SIMD computer architecture, resolves the issue of data synchronization by in-memory computing through combining data storage and massively parallel processing. GP-SIMD employs a two-dimensional access memory with modified SRAM storage cells and a bit-serial processing unit per each memory row. An analytic performance model of the GP-SIMD architecture is presented, comparing it to associative processor and to conventional SIMD architectures. Cycle-accurate simulation of four workloads supports the analytical comparison. Assuming a moderate die area, GP-SIMD architecture outperforms both the associative processor and conventional SIMD coprocessor architectures by almost an order of magnitude while consuming less power.

“…Morad et al [7] [8] proposed models that minimized sequential and concurrent execution time of heterogeneous and asymmetric SoC processing cores. The limitations of the frameworks presented in [38], [8] and [7] are: (a) modeling the processing cores, but not addressing common building blocks such as NoC and LLC; (a) modeling workloads containing either a sequence of sequential heterogeneous tasks [38] [8], or modeling workloads containing a sequence of concurrent sections [5], but not both together; (c) utilizing Lagrange multipliers thus identifying the necessary condition for optimality, but not the optimal point; and (d) modeling constrained area [38] [8], or constrained area/power designs [5], but not addressing off-chip bandwidth; and (e) solving for optimal execution time under area/power constraints, but not addressing optimization of power or area under constraints.…”

Section: Related Workmentioning

confidence: 99%

“…Further, we asswne that tasks runtime depends only on core's speedup function at its designated area, power (in a similar manner to [14][3][37] [22][38] [8]). Our model, however, does account for microarchitecture differences as each core may have its own area-and power-to-performance model.…”

Section: A Workloadmentioning

confidence: 99%

Convex optimization of resource allocation in asymmetric and heterogeneous SoC

2014 24th International Workshop on Power and Timing Modeling, Optimization and Simulation (PATMOS)

2014

Self Cite

Chip area, power consumption, execution time, off chip memory bandwidth, overall cache miss rate and Network on Chip (NoC) capacity are limiting the scalability of SoCs. Consider a workload comprising a sequential and multiple concurrent tasks and asymmetric or heterogeneous SoC architecture. A convex optimization framework is proposed, for selecting the optimal set of processing cores and allocating area and power resources among them, the NoC and the last level cache, under constrained total area, total average power, total execution time and off-chip bandwidth. The framework relies on analytical performance and power models of the processing cores, NoC and last level cache as a function of their allocated resources. Due to practical implementation of the cores, the optimal architecture under constraints may exclude several of the cores. Several asymmetric and heterogeneous configurations are explored. Convex optimization is shown to extend optimizations based on Lagrange multipliers. We find that our framework obtains the optimal chip resources allocation over a wide spectrum of parameters and constraints, and thus can automate complex architectural design, analysis and verification.