The expedient design of precision components in aerospace and other high-tech industries requires simulations of physical phenomena often described by partial differential equations (PDEs) without exact solutions. Modern design problems require simulations with a level of resolution difficult to achieve in reasonable amounts of time-even in effectively parallelized solvers. Though the scale of the problem relative to available computing power is the greatest impediment to accelerating these applications, significant performance gains can be achieved through careful attention to the details of memory communication and access. The swept time-space decomposition rule reduces communication between subdomains by exhausting the domain of influence before communicating boundary values. Here we present a GPU implementation of the swept rule, which modifies the algorithm for improved performance on this processing architecture by prioritizing use of private (shared) memory, avoiding interblock communication, and overwriting unnecessary values. It shows significant improvement in the execution time of finite-difference solvers for one-dimensional unsteady PDEs, producing speedups of 2-9 × for a range of problem sizes, respectively, compared with simple GPU versions and 7-300 × compared with parallel CPU versions. However, for a more sophisticated one-dimensional system of equations discretized with a second-order finite-volume scheme, the swept rule performs 1.2-1.9 × worse than a standard implementation for all problem sizes.execution-simulation at the speed of nature-in accordance with the highperformance computing development goals set out in the CFD Vision 2030 report [1]. Classic approaches to domain decomposition for parallelized, explicit, time-stepping partial differential equation (PDE) solutions incur substantial computational performance costs from the communication between nodes required every timestep. This communication cost consists of two parts: latency and bandwidth, where latency is the fixed cost of each communication event and bandwidth is the variable cost that depends on the amount of data transferred. Latency in inter-node communication is a fundamental barrier to this goal, and advancements to network latency have historically been slower than improvements in other computing performance barriers such as bandwidth and computational power [2]. Performance may be improved by avoiding external node communication until exhausting the domain of dependence, allowing the calculation to advance multiple timesteps while requiring a smaller number of communication events. This idea is the basis of swept time-space decomposition [3,4].Extreme-scale computing clusters have recently been used to solve the compressible Navier-Stokes equations on over 1.97 million CPU cores [5]. The monetary cost, power consumption, and size of such a cluster impedes the realization of widespread peta-and exa-scale computing required for real-time, high-fidelity, CFD simulations. While these are significant challenges, they also pr...