Sparse Matrix-Matrix multiplication is a key kernel that has applications in several domains such as scientific computing and graph analysis. Several algorithms have been studied in the past for this foundational kernel. In this paper, we develop parallel algorithms for sparse matrixmatrix multiplication with a focus on performance portability across different high performance computing architectures. The performance of these algorithms depend on the data structures used in them. We compare different types of accumulators in these algorithms and demonstrate the performance difference between these data structures. Furthermore, we develop a meta-algorithm, kkSpGEMM, to choose the right algorithm and data structure based on the characteristics of the problem. We show performance comparisons on three architectures and demonstrate the need for the community to develop two phase sparse matrix-matrix multiplication implementations for efficient reuse of the data structures involved. arXiv:1801.03065v1 [cs.DC] 9 Jan 2018 very different characteristics. For example, traditional cpus have powerful cores with large caches, while XeonPhi processors have many lightweight cores, and GPUs provide extensive hierarchical parallelism with very simple computational units. The algorithms in this paper aim to minimize revisiting algorithmic design for these different architectures. The code divergence in the implementation and how different levels of algorithmic parallelism are mapped to computational units. is limited to access strategies of different data structures and how different levels of parallelism in the algorithm are mapped to computational units.An earlier version of this paper [13] focuses on spgemm from the perspective of performanceportability. It addressed the issue of performance-portability for spgemm with an algorithm called kkmem. It demonstrated better performance on gpus and the current generation of XeonPhi processors, Knights Landing (knls), w.r.t. state-of-art libraries. Our contributions in [13] is summarized below.• We design two thread-scalable data structures (multilevel hashmap accumulators and a memory pool) to achieve scalability on various platforms, and a graph compression technique to speedup the symbolic factorization of spgemm.• We design hierarchical, thread-scalable spgemm algorithms and implement them using the Kokkos programming model. Our implementation is available at https://github.com/kokkos/kokkos-kernels and also in the Trilinos framework (https://github.com/trilinos/Trilinos).• We also present results for the practical case of matrix structure reuse, and demonstrate its importance for application performance. This paper extends [13] with several new algorithm design choices and additional data structures.Its contributions are summarized below.• We present results for the selection of kernel parameters e.g., partitioning scheme and data structures with trade-offs for memory access vs. computational overhead cost, and provide heuristics to choose the best parameters depending on the prob...