nanoVLM: Lightweight Vision-Language Model Training with PyTorch

nanoVLM is a PyTorch library for training Vision-Language Models (VLMs) from scratch, implemented in approximately 750 lines of code, with training requiring only ~200 lines.
The library enables training of a 222M parameter VLM to achieve 35.3% on MMStar in just 6 hours on a single NVIDIA H100, matching the performance of SmolVLM-256M but with 100x fewer GPU hours.
The model architecture includes a SigLiP-ViT vision encoder, a LLaMA-style language decoder, and a modality projection layer, drawing inspiration from nanoGPT for simplicity and readability.
According to additional sources, the repository is designed for training and fine-tuning small-sized VLMs with a focus on speed and simplicity, although specific details on the technical stack, installation, API, licensing, performance metrics, and a precise definition of "small-sized" are not provided.
A key question raised in the reactions concerns the optimization strategies employed to match SmolVLM's performance with significantly fewer GPU hours, specifically whether architectural choices or training strategies were the primary drivers.
The project's ability to achieve strong results by training the entire model in one stage challenges current best practices in the field, according to reactions.