YZ Index

Evaluation Data

Name: YZ Index Benchmark Data
Creator: YZ Index YZ Index
License: https://creativecommons.org/licenses/by-nc/4.0/

Currently showing：Run #87 | 2026-04-27 | 212-question pool | Formula v7 | Judge set v6

Switch Run Model

Data Disclosure：To prevent benchmark contamination and overfitting, question texts and expected answers are not disclosed. This page shows model responses, scores, and judging methods for transparency. For the full methodology, seeMethodology page。

Model	Code Execution	Grounding	Engineering Judgment	Task Communication	Integrity	Overallpts	Value	Stability	Availability
Grok 3 grok	88.90	84.40	43.50	40.00	77.50 pass	86.88	25.8	35.5	99.0

豆包 Pro doubao	92.20	79.40	46.30	40.00	77.50 pass	86.44	93.3	38.8	100.0

Gemini 2.5 Pro gemini	89.40	78.10	47.20	40.00	80.80 pass	84.32	39.3	37.7	100.0

Claude Sonnet 4.6 claude	86.50	81.10	43.80	40.00	74.20 pass	84.07	25.1	35.7	99.0

Claude Opus 4.6 claude	86.50	79.70	46.30	40.00	67.50 pass	83.44	5.1	35.2	100.0

DeepSeek V3 deepseek	83.20	77.80	44.30	40.00	59.20 warn	80.77	99.7	32.8	100.0

Qwen Max qwen	78.40	77.30	40.70	40.00	65.80 pass	77.91	48.6	32.7	100.0

DeepSeek R1 deepseek	78.90	72.20	38.70	40.00	54.20 warn	75.89	90.3	30.2	100.0

文心一言 4.0 ernie	77.00	72.30	39.70	40.00	69.20 pass	74.89	98.6	31.3	100.0

GPT-4o gpt	71.70	57.60	41.50	40.00	74.20 pass	65.36	29.1	30.4	91.0

GPT-o3 gpt	73.40	49.20	38.70	40.00	69.20 pass	62.51	7.0	28.9	87.0

API Access：For programmatic access to evaluation data, please use our API。