Large Language Model Comparison

Quality Evaluations

Evaluation results measured independently (Higher is better)

Artificial Analysis independently runs quality evaluations on every language model endpoint covered on our site. Our current set of evaluations includes MMLU, GPQA, MATH-500, and HumanEval. Different use-cases warrant considering different evaluation tests.

Artificial Analysis Quality Index: Average result across our evaluations covering different dimensions of model intelligence. Currently includes MMLU, GPQA, Math & HumanEval. OpenAI o1 model figures are preliminary and are based on figures stated by OpenAI.

Median across providers: All quality evaluation results above, including Artificial Analysis Quality Index, are represented as the median result across all providers which support each model.

Quality vs. Price

Artificial Analysis Quality Index, Price: USD per 1M Tokens

Quality vs. Output Speed

Artificial Analysis Quality Index, Output Speed: Output Tokens per Second, Price: USD per 1M Tokens

Output Speed

Output Tokens per Second; Higher is better

Output Speed

Output Tokens per Second; Higher is better

Latency

Seconds to First Token Received; Lower is better