top of page

LLM Inference Optimization Techniques

  • Multi-Query Attention reduces KV cache size by a factor of 8 by sharing keys and values across attention heads, addressing a major inference bottleneck.
  • Hybrid Attention Horizons combines local (sliding window) and global attention mechanisms, achieving O(n) complexity without significantly impacting evaluation metrics, offering a more efficient alternative to full O(n^2) attention.
  • Cross-Layer KV-Sharing reduces KV cache size by 2-3x by sharing the KV cache across neighboring attention layers.
  • Stateful Caching, a custom CharacterAI implementation, organizes cached KV tensors in an LRU cache with a tree structure, based on RadixAttention.
  • Quantization involves using int8 CUDA kernels for matrix multiplications and attention, with CharacterAI training LLMs directly in int8 precision rather than quantizing post-training.
  • According to the original poster, these optimizations are crucial for companies like CharacterAI to handle over 20,000 LLM queries per second, and enable more widespread LLM adoption.
Source:
bottom of page