This paper considers the problem of designing optimal algorithms for reinforcement learning in two-player zero-sum games. We focus on self-play algorithms which learn the optimal policy by playing against itself without any direct supervision. In a tabular episodic Markov game with S states, A max-player actions and B min-player actions, the best existing algorithm for finding an approximate Nash equilibrium requires Õ(S 2 AB) steps of game playing, when only highlighting the dependency on (S, A, B). In contrast, the best existing lower bound scales as Ω(S(A + B)) and has a significant gap from the upper bound. This paper closes this gap for the first time: we propose an optimistic variant of the Nash Q-learning algorithm with sample complexity Õ(SAB), and a new Nash V-learning algorithm with sample complexity Õ(S(A + B)). The latter result matches the information-theoretic lower bound in all problem-dependent parameters except for a polynomial factor of the length of each episode.Towards understanding learning objectives in Markov games other than finding the Nash equilibrium, we present a computational hardness result for learning the best responses against a fixed opponent. This also implies the computational hardness for achieving sublinear regret when playing against adversarial opponents.µ,ν † (µ) 1 (s 1 ) = inf ν V µ,ν h (s 1 ) at step 1. We remark that the best response of a general policy is not necessarily Markovian.