Vision Transformers (ViTs) are an exciting development in the field of computer vision, leveraging the Transformer architecture initially designed for natural language processing.