In this paper, we introduce a novel visual representation learning which relies on a handful of adaptively learned tokens, and which is applicable to both image and video understanding tasks. Instead of relying on hand-designed splitting strategies to obtain visual tokens and processing a large number of densely sampled patches for attention, our approach learns to mine important tokens in visual data. This results in efficiently and effectively finding a few important visual tokens and enables modeling of pairwise attention between such tokens, over a longer temporal horizon for videos, or the spatial content in images. Our experiments demonstrate strong performance on several challenging benchmarks for both image and video recognition tasks. Importantly, due to our tokens being adaptive, we accomplish competitive results at significantly reduced compute amount.Recent advancements in image understanding demonstrate improved accuracy on vision classification tasks. For example, departing from standard convolutional approaches, the Vision Transformer (VIT) [9] treats the image as a sequence of patches, utilizing the Transformer architecture [38] similar to text understanding.Standard approaches for video recognition take videos as stacked images (i.e., a space-time volume) and tend to extend 2D neural architectures to 3D (e.g., 5,37,11]). In parallels to the Vision Transformer for images, some approaches [2,3] proposed to create 3D 'cubelet' video tokens on a regular 3D-grid which are further processed by a Transformer, resulting in computationally heavy models. There are too many tokens to process, especially for longer videos.The main question addressed in this work is how to adaptively learn the representation from visual inputs to most effectively capture the spatial information for images and spatio-temporal interactions for videos. Here are our main ideas:Preprint. Under review.