Skip to main content
YZ Index

Evaluation Data

Currently showing:Run #87 | 2026-04-27 | 212-question pool | Formula v7 | Judge set v6

Data Disclosure:To prevent benchmark contamination and overfitting, question texts and expected answers are not disclosed. This page shows model responses, scores, and judging methods for transparency. For the full methodology, seeMethodology page
Model Code Execution Grounding Engineering Judgment Task Communication Integrity Overallpts Value Stability Availability Per Task
Grok 3 grok 88.90 84.40 43.50 40.00 77.50 pass 86.88 25.8 35.5 99.0
豆包 Pro doubao 92.20 79.40 46.30 40.00 77.50 pass 86.44 93.3 38.8 100.0
Gemini 2.5 Pro gemini 89.40 78.10 47.20 40.00 80.80 pass 84.32 39.3 37.7 100.0
Claude Sonnet 4.6 claude 86.50 81.10 43.80 40.00 74.20 pass 84.07 25.1 35.7 99.0
Claude Opus 4.6 claude 86.50 79.70 46.30 40.00 67.50 pass 83.44 5.1 35.2 100.0
DeepSeek V3 deepseek 83.20 77.80 44.30 40.00 59.20 warn 80.77 99.7 32.8 100.0
Qwen Max qwen 78.40 77.30 40.70 40.00 65.80 pass 77.91 48.6 32.7 100.0
DeepSeek R1 deepseek 78.90 72.20 38.70 40.00 54.20 warn 75.89 90.3 30.2 100.0
文心一言 4.0 ernie 77.00 72.30 39.70 40.00 69.20 pass 74.89 98.6 31.3 100.0
GPT-4o gpt 71.70 57.60 41.50 40.00 74.20 pass 65.36 29.1 30.4 91.0
GPT-o3 gpt 73.40 49.20 38.70 40.00 69.20 pass 62.51 7.0 28.9 87.0
API Access:For programmatic access to evaluation data, please use our API