Abstract. In this chapter we overview methods that represent video sequences in terms of their content. These methods differ from those developed for MPEG/H.26X coding standards in that sequences are described in terms of extended images instead of collections of frames. We describe how these extended images, e.g., mosaics, are generated by the basically same principle: the incremental composition of visual photometric, geometric, and multi-view information into one or more extended images. Different outputs, e.g., from single 2-D mosaics to full 3-D mosaics, are obtained depending on the quality and quantity of photometric, geometric, and multi-view information. In particular, we detail a framework well suited to the representation of scenes with independently moving objects. We address the two following important cases: i) the moving objects can be represented by 2-D silhouettes (generative video approach) ; or ii) the camera motion is such that the moving object must be described by their 3-D shape (recovered through rank 1 surface-based factorization). A basic pre-processing step in content-based image sequence representation is to extract and track the relevant background and foreground objects. This is achieved by 2-D shape segmentation for which there is a wealth of methods and approaches. The chapter includes a brief description of active contour methods for image segmentation.