<p>With the increasing number of vehicles in modern cities, traffic surveillance via cameras on roads has become an important application. Cities have installed thousands of cameras on roads, which send video feeds to a cloud center to run computer vision algorithms. This requires high bandwidth. Current techniques reduce the bandwidth requirement by either sending a limited number of frames/pixels/regions or relying on re-encoding the important parts of the video. This requires running DNNs to extract important portions in a frame so that they can be again sent at a higher resolution from the camera to the server. This has the disadvantage of imposing significant overhead on the camera side compute, as re-encoding is known to be expensive, and makes the system less real-time. In this work, we propose VISTA, a system that utilizes tile sampling, where a limited number of rectangular areas within the frames, known as tiles, are sent to the server. We then propose an adaptive tile sampling algorithm, that estimates the presence of moving objects by comparing the statistics of the tiles' bitrate (in kbps) and then decide to retain only the necessary tiles, thus eliminating the requirement to use a DNN at the camera side. We evaluate VISTA on different datasets having 56 videos in total to show that on average our technique reduces $17$-$40$\% of the total amount of data sent to the cloud while providing a detection accuracy of over $85\%$. Furthermore, VISTA also runs in real-time even on cheap edge devices like Raspberry Pi and nVidia Jetson Nano. Further, it requires minimal calibration compared to prior works.</p>