Single-Instruction-Multiple-Data (SIMD) extensions like Intel's AVX2 offer a great potential to accelerate elliptic curve cryptography compared to a straightforward implementation using only base x64 instructions. All existing AVX2 implementations of scalar multiplication on Curve25519 and alternative elliptic curves are optimized for low latency. We argue in this paper that many applications, most notably server-side TLS handshake processing, would benefit more from throughput-optimized implementations than latency-optimized ones. To support this argument we introduce throughput-optimized AVX2 implementations of variable-base scalar multiplication on Curve25519 and fixed-base scalar multiplication on Ed25519. Both implementations perform four scalar multiplications in parallel, whereby each scalar multiplication uses a 64-bit element of a 256-bit AVX2 vector. The field arithmetic is based on a radix-2 29 representation of the field elements, which makes it possible to execute four parallel multiplications modulo a multiple of p = 2 255 − 19 in just 88 Skylake cycles. Four variable-base scalar multiplications on Curve25519 require less than 250,000 Skylake cycles, which translates into a throughput of 32,318 scalar multiplications per second at a clock frequency of 2 GHz. For comparison, the currently best latency-optimized AVX2 implementation reaches a throughput of only about 21,000 scalar multiplications per second on the same Skylake processor.