Large Language Model Comparison
Quality Evaluations
Evaluation results measured independently (Higher is better)
Artificial Analysis independently runs quality evaluations on every language model endpoint covered on our site. Our current set of evaluations includes MMLU, GPQA, MATH-500, and HumanEval. Different use-cases warrant considering different evaluation tests.
Artificial Analysis Quality Index: Average result across our evaluations covering different dimensions of model intelligence. Currently includes MMLU, GPQA, Math & HumanEval. OpenAI o1 model figures are preliminary and are based on figures stated by OpenAI.
Median across providers: All quality evaluation results above, including Artificial Analysis Quality Index, are represented as the median result across all providers which support each model.
Quality vs. Price
Artificial Analysis Quality Index, Price: USD per 1M Tokens
Quality vs. Output Speed
Artificial Analysis Quality Index, Output Speed: Output Tokens per Second, Price: USD per 1M Tokens
Output Speed
Output Tokens per Second; Higher is better
Output Speed
Output Tokens per Second; Higher is better
Latency
Seconds to First Token Received; Lower is better