YZ Index

Evaluation Data

Currently showing：Run #154 | 2026-06-08 | 212-question pool | Formula v7 | Judge set v6.1

Switch Run Model

Data Disclosure：To prevent benchmark contamination and overfitting, question texts and expected answers are not disclosed. This page shows model responses, scores, and judging methods for transparency. For the full methodology, seeMethodology page。

Model	Code Execution	Grounding	Engineering Judgment	Task Communication	Integrity	Overallpts	Value	Stability	Availability
Grok 4 grok	93.90	85.00	82.10	87.80	86.30 pass	89.90	29.7	68.6	100.0

Claude Opus 4.7 claude	90.30	87.50	93.10	89.40	94.30 pass	89.04	6.2	67.7	100.0

Doubao Pro doubao	94.60	81.60	88.80	84.10	92.20 pass	88.75	96.2	71.2	100.0

Claude Sonnet 4.6 claude	87.60	86.80	93.20	87.80	94.70 pass	87.24	29.7	62.7	100.0

Gemini 2.5 Pro gemini	88.10	84.20	87.70	84.60	88.80 pass	86.35	44.6	66.0	99.0

Qwen3 Max qwen	89.70	81.90	85.70	85.30	87.50 pass	86.19	58.5	59.8	100.0

Gemini 3.1 Pro gemini	88.40	80.40	85.20	84.90	87.70 pass	84.80	29.3	63.2	99.0

DeepSeek V4 Pro deepseek	87.90	77.60	82.40	85.10	81.80 pass	83.27	47.5	59.1	100.0

GPT-o3 gpt	84.80	80.40	91.50	87.50	90.60 pass	82.82	10.5	58.0	100.0

GPT-5.5 gpt	81.90	79.70	92.10	87.40	88.30 pass	80.91	20.4	51.8	100.0

ERNIE Bot 4.5 ernie	78.00	75.60	72.20	72.00	70.00 pass	76.92	99.3	44.2	100.0

API Access：For programmatic access to evaluation data, please use our API。