Common approaches to vision-based tasks such as character and object recognition use Convolutional Neural Networks (CNNs) due to their practicality in processing images and theoretical grounding. In this work, we take a different perspective in the task of Baybayin script recognition by exploring Vision Transformers, a new paradigm for processing images inspired by the Transformer model. We compare performances of CNNs and ViT and analyzed model confidence on a set of test images using Local Interpretable Model-Agnostic Explanations (LIME). Results show that, performancewise, convolution-based architectures (CNNS) still outperform sequence-based methods (ViT) for discriminating Baybayin scripts with a nearly doubled performance of 84.5% to 48.8% in accuracy respectively.