top of page
Matryoshka ML: Embeddings, Transformers, and Quantization
- * The "Learning Together Series" on YouTube covers Matryoshka Representation Learning, MatFormer, and Matryoshka Quantization, starting with embeddings, transformers, and quantization. The series references three papers: Matryoshka Representation Learning, MatFormer, and Matryoshka Quantization.
- MatFormer: Nested Transformer for Elastic Inference*: Introduces a Transformer architecture with nested Feed Forward Network (FFN) blocks, enabling elastic inference by optimizing multiple nested FFN blocks of varying sizes during training, allowing extraction of smaller models without retraining. Validated across different model classes (decoders, encoders) and modalities (language, vision).
- MatFormer Key Results*: Smaller models extracted from an 850M parameter decoder-only model (MatLM) outperform independently trained counterparts. Smaller encoders extracted from a MatFormer-based ViT (MatViT) preserve metric-space structure for adaptive large-scale retrieval. Speculative decoding with MatFormer submodels reduces inference latency.
- Matryoshka Quantization (MatQuant)*: A multi-scale quantization technique that trains a single quantized model servable at different precisions by optimizing quantization loss for several target bit-widths jointly and extracts an _r_-bit model from a _c_-bit model by slicing the _r_ most significant bits (MSBs).
- Matryoshka Quantization Results*: MatQuant's int2 models outperform standard int2 quantization by 4% (OmniQuant) and 7% (QAT). Adding an extra bit for outliers improves performance by 6% (OmniQuant).
- Matryoshka Representation Learning (MRL)*: Encodes information at different granularities, allowing a single embedding to adapt to the computational constraints of downstream tasks. Achieves up to 14x smaller embedding size for ImageNet-1K classification at the same accuracy and up to 14x real-world speed-ups for large-scale retrieval on ImageNet-1K and 4K.
Source:
bottom of page