Large Language Model (LLM) Comparison

Quality Evaluations

Evaluation results measured independently (Higher is better)

Bar chart comparing AI models across various tests: Artificial Analysis Quality Index, Reasoning & Knowledge, Scientific Reasoning & Knowledge, Quantitative Reasoning, Coding, and Communication. Models include GPT-4.0, o17-preview, Claude, and Llama series among others, with different performance scores across tests.

Artificial Analysis independently runs quality evaluations on every language model endpoint covered on our site. Our current set of evaluations includes MMLU, GPQA, MATH-500, and HumanEval. Different use-cases warrant considering different evaluation tests.

Artificial Analysis Quality Index: Average result across our evaluations covering different dimensions of model intelligence. Currently includes MMLU, GPQA, Math & HumanEval. OpenAI o1 model figures are preliminary and are based on figures stated by OpenAI.

Median across providers: All quality evaluation results above, including Artificial Analysis Quality Index, are represented as the median result across all providers which support each model.

Quality vs. Price

Artificial Analysis Quality Index, Price: USD per 1M Tokens

Graph comparing AI models by Artificial Analysis Quality Index and price, featuring models like Gemini, GPT-4o, Claude, Mistral, and Nova. Most attractive quadrant is highlighted in green.

Quality vs. Output Speed

Artificial Analysis Quality Index, Output Speed: Output Tokens per Second, Price: USD per 1M Tokens

Output Speed

Output Tokens per Second; Higher is better

Output Speed

Output Tokens per Second; Higher is better

Latency

Seconds to First Token Received; Lower is better