LLM API Leaderboard: Performance and Price Comparison

Comprehensive API Provider Comparison: The leaderboard compares over 100 LLM endpoints from providers including OpenAI, Google, DeepSeek, Mistral, and others, evaluating them across key metrics such as price (USD/1M tokens), output speed (tokens/s), latency (time to first token in seconds), and context window size (in tokens).
Key Performance Metrics Defined: The analysis focuses on output speed, measured as tokens per second after the first chunk is received, and latency, which is the time to the first token. Price is calculated as a blended rate of input and output tokens (3:1 ratio).
Grok-3 Mini Reasoning Performance: The Grok-3 mini Reasoning model exhibits a price of $0.35 per 1M tokens and an output speed of 97.1 tokens/s, with a first token latency of 0.25 seconds and an end-to-end response time of 26.01 seconds. The "Fast" variant of this model achieves 225.7 tokens/s at $1.45 per 1M tokens.
Qwen3 235B A22B (Reasoning) Series: This series demonstrates varying performance based on context window and quantization. For example, the base version with a 41k context window costs $0.30 per 1M tokens and achieves 40.1 tokens/s, while the FP8 quantized version with the same context window shows 18.8 tokens/s at the same price.
DeepSeek R1 Performance: Different configurations of the DeepSeek R1 model show a wide range of performance. For instance, one 128k context version costs $2.36 per 1M tokens and achieves 59.3 tokens/s with a 0.43s latency, while another 128k version costs $7.00 and achieves 29.8 tokens/s with a 0.53s latency.
Llama 4 Maverick Efficiency: The Llama 4 Maverick models show high output speeds, with one 8k context version reaching 789.8 tokens/s at a cost of $0.92 per 1M tokens and a latency of 0.36s.

Source:

https://artificialanalysis.ai/leaderboards/providers