The need to solve block tridiagonal systems with hundreds or thousands of right-hand sides for the same block tridiagonal matrix is common in a variety of disciplines. To meet this need, the Accelerated Recursive Doubling Algorithm was developed. After a right-hand side independent phase, the algorithm allows for the quick, online calculation of solutions for different right-hand sides. In this work, we present methods to optimize the Accelerated Recursive Doubling Algorithm in memory usage and computation time in a hybrid parallelization model. The right-hand side independent phase of the naïve implementation takes ≥ 11 3 the amount of memory required to store the tridiagonal matrix, while our implementation reduces the fraction to ≈ 5 3. The right-hand side dependent phase of the naïve implementation takes ≥ 6 times the amount of memory required to store the right-hand side, while our implementation reduces the fraction to ≈ 3. The computation time for the independent phase is reduced to ≈ 2 3 times that of the naïve implementation, while the computation time for the dependent phase is reduced to ≈ 5 9. With increasing numbers of shared-memory threads q on every distributed processing element, we have O(q) theoretical speedup.