The dramatic proliferation of visual displays, from cell phones, through video iPods, PDAs, and notebooks, to high-quality HDTV screens, has raised the demand for a video compression scheme capable of decoding a "once-encoded" video at a range of supported video resolutions and with high quality. A promising solution to this problem has been recently proposed in the form of wavelet video coding based on motion-compensated temporal filtering (MCTF); scalability is naturally supported while efficiency is comparable to state-of-the-art hybrid coders. However, although rate (quality) and temporal scalability are natural in mainstream "t+2D" wavelet video coders, spatial scalability suffers from drift problems. In the light of the recently proposed "2D+t+2D" modification, which targets spatial scalability performance, we present a framework for the modeling of spatially-scalable motion that is well matched to this new structure. We propose a motion estimation scheme in which motion fields at different spatial scales are jointly estimated and coded. In addition, at lower spatial resolutions, we extend the block-wise constant motion model to a higher-order model based on cubic splines, effectively creating a "mixture motion model" that combines different models at different supported spatial scales. This advanced spatial modeling of motion significantly improves the coding efficiency of motion at low resolutions and leads to an excellent overall compression performance; spatial scalability performance of the proposed scheme approaches that of a non-scalable coder.