Most of today's high-speed switches and routers adopt an input-queued crossbar switch architecture. Such a switch needs to compute a matching (crossbar schedule) between the input ports and output ports during each switching cycle (time slot). A key research challenge in designing large (in number of input/output ports N ) input-queued crossbar switches is to develop crossbar scheduling algorithms that can compute "high quality" matchings -i.e., those that result in high switch throughput (ideally 100%) and low queueing delays for packets -at line rates. SERENA is one such algorithm: it outputs excellent matching decisions that result in 100% switch throughput and reasonably good queueing delays. However, since SERENA is a centralized algorithm with O(N ) computational complexity, it cannot support switches that both are large and have a very high line rate per port. In this work, we propose SERENADE (SERENA, the Distributed Edition), a parallel iterative algorithm that emulates SERENA in only O(log N ) iterations between input ports and output ports, and hence has a time complexity of only O(log N ) per port. We prove that SERENADE can exactly emulate SERENA. We also propose an early-stop version of SERENADE, called O-SERENADE, to only approximately emulate SERENA. Through extensive simulations, we show that O-SERENADE can achieve 100% throughput and that it has similar as or slightly better delay performance than SERENA under various load conditions and traffic patterns.