Francis B. Moreira scite author profile

Nowadays, there are several different architectures available not only for the industry, but also for normal consumers. Traditional multicore processors, GPUs, accelerators such as the Sunway SW26010, or even energy efficiency-driven processors such as the ARM family, present very different architectural characteristics. This wide range of characteristics presents a challenge for the developers of applications. Developers must deal with different instruction sets, memory hierarchies, or even different programming paradigms when programming for these architectures. Therefore, the same application can perform well when executing on one architecture, but poorly on another architecture. To optimize an application, it is important to have a deep understanding of how it behaves on different architectures. The related work in this area mostly focuses on a limited analysis encompassing execution time and energy. In this paper, we perform a detailed investigation on the impact of the memory subsystem of different architectures, which is one of the most important aspects to be considered. For this study, we performed experiments in the Broadwell CPU and Pascal GPU, using applications from the Rodinia benchmark suite. In this way, we were able to understand why an application performs well on one architecture and poorly on others.

show abstract

Saving memory movements through vector processing in the DRAM

Alves¹,

Santos²,

Moreira³

et al. 2015

View full text Add to dashboard Cite

Machine Learning Migration for Efficient Near-Data Processing

Cordeiro

Santos

Moreira

et al. 2021

View full text Add to dashboard Cite

Survey on Near-Data Processing: Applications and Architectures

Santos

Moreira

Cordeiro

et al. 2021

JICS

View full text Add to dashboard Cite

One of the main challenges for modern processors is the data transfer between processor and memory. Such data movement implies high latency and high energy consumption. In this context, Near-Data Processing (NDP) proposals have started to gain acceptance as an accelerator device. Such proposals alleviate the memory bottleneck by moving instructions to data whereabouts. The first proposals date back to the 1990s, but it was only in the 2010s that we could observe an increase in papers addressing NDP. It occurred together with the appearance of 3D-stacked chips with logic and memory stacked layers. This survey presents a brief history of these accelerators, focusing on the applications domains migrated to near-data and the proposed architectures. We also introduce a new taxonomy to classify such architectural proposals according to their data distance.

show abstract

Vector In Memory Architecture for simple and high efficiency computing

Alves¹,

Santos²,

Cordeiro³

et al. 2022

Preprint

View full text Add to dashboard Cite

Investigating memory prefetcher performance over parallel applications: From real to simulated

Girelli

Moreira

Serpa

et al. 2021

Concurrency and Computation

View full text Add to dashboard Cite

Memory prefetcher algorithms are widely used in processors to mitigate the performance gap between the processors and the memory subsystem. The complexities behind the architectures and prefetcher algorithms, however, not only hinder the development of accurate architecture simulators, but also hinder understanding the prefetcher's contribution to performance, on both a real hardware and in a simulated environment. In this paper, we contribute to shed light on the memory prefetcher's role in the performance of parallel High‐Performance Computing applications, considering the prefetcher algorithms offered by both the real hardware and the simulators. We performed a careful experimental investigation, executing the NAS parallel benchmark (NPB) on a real Skylake machine, and as well in a simulated environment with the ZSim and Sniper simulators, taking into account the prefetcher algorithms offered by both Skylake and the simulators. Our experimental results show that: (i) prefetching from the L3 to L2 cache presents better performance gains, (ii) the memory contention in the parallel execution constrains the prefetcher's effect, (iii) Skylake's parallel memory contention is poorly simulated by ZSim and Sniper, and (iv) Skylake's noninclusive L3 cache hinders the accurate simulation of NPB with the Sniper's prefetchers.

show abstract

12 3 4

scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.

Contact Info

hi@scite.ai

10624 S. Eastern Ave., Ste. A-614

Henderson, NV 89052, USA

Blog Terms and Conditions API Terms Privacy Policy Contact Cookie Preferences Do Not Sell or Share My Personal Information

Made with 💙 for researchers

Part of the Research Solutions Family.