YZ Index
Evaluation Data
Currently showing:Run #154 | 2026-06-08 | 212-question pool | Formula v7 | Judge set v6.1
Data Disclosure:To prevent benchmark contamination and overfitting, question texts and expected answers are not disclosed. This page shows model responses, scores, and judging methods for transparency. For the full methodology, seeMethodology page。
| Model | Code Execution | Grounding | Engineering Judgment | Task Communication | Integrity | Overallpts | Value | Stability | Availability | Per Task |
|---|---|---|---|---|---|---|---|---|---|---|
| Grok 4 grok | 93.90 | 85.00 | 82.10 | 87.80 | 86.30 pass | 89.90 | 29.7 | 68.6 | 100.0 | |
| Claude Opus 4.7 claude | 90.30 | 87.50 | 93.10 | 89.40 | 94.30 pass | 89.04 | 6.2 | 67.7 | 100.0 | |
| Doubao Pro doubao | 94.60 | 81.60 | 88.80 | 84.10 | 92.20 pass | 88.75 | 96.2 | 71.2 | 100.0 | |
| Claude Sonnet 4.6 claude | 87.60 | 86.80 | 93.20 | 87.80 | 94.70 pass | 87.24 | 29.7 | 62.7 | 100.0 | |
| Gemini 2.5 Pro gemini | 88.10 | 84.20 | 87.70 | 84.60 | 88.80 pass | 86.35 | 44.6 | 66.0 | 99.0 | |
| Qwen3 Max qwen | 89.70 | 81.90 | 85.70 | 85.30 | 87.50 pass | 86.19 | 58.5 | 59.8 | 100.0 | |
| Gemini 3.1 Pro gemini | 88.40 | 80.40 | 85.20 | 84.90 | 87.70 pass | 84.80 | 29.3 | 63.2 | 99.0 | |
| DeepSeek V4 Pro deepseek | 87.90 | 77.60 | 82.40 | 85.10 | 81.80 pass | 83.27 | 47.5 | 59.1 | 100.0 | |
| GPT-o3 gpt | 84.80 | 80.40 | 91.50 | 87.50 | 90.60 pass | 82.82 | 10.5 | 58.0 | 100.0 | |
| GPT-5.5 gpt | 81.90 | 79.70 | 92.10 | 87.40 | 88.30 pass | 80.91 | 20.4 | 51.8 | 100.0 | |
| ERNIE Bot 4.5 ernie | 78.00 | 75.60 | 72.20 | 72.00 | 70.00 pass | 76.92 | 99.3 | 44.2 | 100.0 |
API Access:For programmatic access to evaluation data, please use our API。