Large Language Model (LLM) Comparison

Quality Evaluations

Evaluation results measured independently (Higher is better)

Bar chart comparing AI models across various tests: Artificial Analysis Quality Index, Reasoning & Knowledge, Scientific Reasoning & Knowledge, Quantitative Reasoning, Coding, and Communication. Models include GPT-4.0, o17-preview, Claude, and Llama series among others, with different performance scores across tests.

Artificial Analysis independently runs quality evaluations on every language model endpoint covered on our site. Our current set of evaluations includes MMLU, GPQA, MATH-500, and HumanEval. Different use-cases warrant considering different evaluation tests.

Artificial Analysis Quality Index: Average result across our evaluations covering different dimensions of model intelligence. Currently includes MMLU, GPQA, Math & HumanEval. OpenAI o1 model figures are preliminary and are based on figures stated by OpenAI.

Median across providers: All quality evaluation results above, including Artificial Analysis Quality Index, are represented as the median result across all providers which support each model.

Quality vs. Price

Artificial Analysis Quality Index, Price: USD per 1M Tokens

Graph comparing AI models by Artificial Analysis Quality Index and price, featuring models like Gemini, GPT-4o, Claude, Mistral, and Nova. Most attractive quadrant is highlighted in green.

Quality vs. Output Speed

Artificial Analysis Quality Index, Output Speed: Output Tokens per Second, Price: USD per 1M Tokens

Graph plotting AI models with Artificial Analysis Quality Index vs. Output Speed. Models include o1-preview, o1-mini, Gemini, Claude, Llama, GPT-4o, Nova. Colored circles represent models varying in quality, speed, and price. Annotations explain metrics and trade-offs.

Output Speed

Output Tokens per Second; Higher is better

Graph comparing quality index and output speed of AI models with price indications, models include Llama, Gemini, Claude, and Nova, highlighted in various colors.

Output Speed

Output Tokens per Second; Higher is better

Bar chart showing AI model output speeds measured in tokens per second. Models include oT-7-mini, Nova Micro, Llama 3.3 70B, Llama 3.1 70B, Gemini 2.0 (Flash Sep), Nova Lite, oT-1-preview, GPT-4o mini, Nova Pro, and others, with speeds ranging from 230 to 30 tokens per second. Additional text provides definitions for output speed and median across providers.

Latency

Seconds to First Token Received; Lower is better

Bar chart comparing latency times for various AI models, measured in seconds. Llama 3, Nova models, GPT-4, and Claude models shown. Latency ranges from 0.35 to 25.35 seconds. Explanation notes included about latency and median values.