Late-stage (post-RTL implementation) optimization is important in achieving target performance for realistic processor design. However, several challenges remain for modern out-of-order ILP (instruction-level-parallelism) processors, such as simulation speed, flexibility, and complexity problems. This paper restudy FPGA simulation as an effective performance simulation method and proposes FPGA-enhanced design flow as an effective method to address these problems. It features a late-stage aware RTL design that parameterizes various potential design options induced from early-stage optimization. This flow enables the feasibility of late-stage design space exploration. To resolve the performance accuracy of the FPGA system for peripheral designs, reference models are introduced. With an example implementation of out-of-order core running up to 80 MHz, the experimental results show that the proposed method is practical and allows the fine-grain optimization of the processor core to be more effective.
Today’s desktop rendering platforms typically use GPUs, which have become the most powerful computing chip to meet the growing visual needs, especially in ray tracing. However, ray tracing is challenging for mobile platforms because mobile GPUs need to accommodate insufficient computing power, hardware resources, and memory bandwidth. This paper presents a novel architecture for the mobile domain called Mobile Multiple stacks Ray Tracing (MMsRT). The most complicated calculations in ray tracing are completed through lightweight embedded design. MMsRT has three key features: First, we set multiple stacks to ensure multiple rays are parallel in the system. Second, it sets a stack cache to store the data in stacks when the storage space of multiple stacks is insufficient. Third, we adopt the data prefetching mechanism to set caches to improve the cache hit rate and performance. An accurate simulator test proves that our design can be applied to mobile devices. We calculate the performance of about 82.9 Million Rays Per Second (MRPS), the chip area is about 0.856[Formula: see text]mm2, and 96.85[Formula: see text]MRPS/mm2.
The reality of the ray tracing technology that leads to its rendering effect is becoming increasingly apparent in computer vision and industrial applications. However, designing efficient ray tracing hardware is challenging due to memory access issues, divergent branches, and daunting computation intensity. This article presents a novel architecture, a RT engine (Ray Tracing engine), that accelerates ray tracing. First, we set up multiple stacks to store information for each ray so that the RT engine can process many rays parallel in the system. The information in these stacks can effectively improve the performance of the system. Second, we choose the three-phase break method during the triangle intersection test, which can make the loop break earlier. Third, the reciprocal unit adopts the approximation method, which combines Parabolic Synthesis and Second-Degree interpolation. Combined with these strategies, we implement our system at RTL level with agile chip development. Simulation and experimental results show that our architecture achieves a performance per area which is 2.4 × greater than the best reported results for ray tracing on dedicated hardware.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.