YZ Index
Evaluation Data
Currently showing:Run #164 WDCD | 2026-06-11 | Formula v7 | Judge set v6.3
Data Disclosure:To prevent benchmark contamination and overfitting, question texts and expected answers are not disclosed. This page shows model responses, scores, and judging methods for transparency. For the full methodology, seeMethodology page。
| Model | DCD Overall | R1 Constraint Acknowledgment | R2 Distraction Resistance | R3 Constraint Integrity | Per Task |
|---|---|---|---|---|---|
| GPT-5.5 gpt | 88.33 | 100 | 87 | 167 | |
| Gemini 3.1 Pro gemini | 87.50 | 100 | 90 | 160 | |
| Claude Sonnet 4.6 claude | 83.33 | 97 | 83 | 153 | |
| DeepSeek V4 Pro deepseek | 82.50 | 100 | 77 | 153 | |
| Grok 4 grok | 81.67 | 100 | 80 | 147 | |
| Qwen3 Max qwen | 81.67 | 100 | 73 | 153 | |
| ERNIE Bot 4.5 ernie | 77.50 | 90 | 90 | 130 | |
| Doubao Pro doubao | 75.00 | 70 | 83 | 147 | |
| Gemini 2.5 Pro gemini | 73.33 | 100 | 70 | 123 | |
| Claude Opus 4.7 claude | 70.00 | 100 | 83 | 97 | |
| GPT-o3 gpt | 61.67 | 97 | 77 | 73 |
API Access:For programmatic access to evaluation data, please use our API。