This paper builds upon previous work on local interest point detection and description to propose the extraction and representation of novel Local Invariant Feature Tracks (LIFT). These features compactly capture not only the spatial attributes of 2D local regions, as in SIFT and related techniques, but also their long-term trajectories in time. This and other desirable properties of LIFT allow the generation of Bags-of-Spatiotemporal-Words models that facilitate capturing the dynamics of video content, which is necessary for detecting high-level video features that by definition have a strong temporal dimension. Preliminary experimental evaluation and comparison of the proposed approach reveals promising results.