Skip to main content
YZ Index

Evaluation Data

Main Leaderboard WDCD Compliance Test
Currently showing:Run #164 WDCD | 2026-06-11 | Formula v7 | Judge set v6.3
Data Disclosure:To prevent benchmark contamination and overfitting, question texts and expected answers are not disclosed. This page shows model responses, scores, and judging methods for transparency. For the full methodology, seeMethodology page
Model DCD Overall R1 Constraint Acknowledgment R2 Distraction Resistance R3 Constraint Integrity Per Task
GPT-5.5 gpt 88.33 100 87 167
Gemini 3.1 Pro gemini 87.50 100 90 160
Claude Sonnet 4.6 claude 83.33 97 83 153
DeepSeek V4 Pro deepseek 82.50 100 77 153
Grok 4 grok 81.67 100 80 147
Qwen3 Max qwen 81.67 100 73 153
ERNIE Bot 4.5 ernie 77.50 90 90 130
Doubao Pro doubao 75.00 70 83 147
Gemini 2.5 Pro gemini 73.33 100 70 123
Claude Opus 4.7 claude 70.00 100 83 97
GPT-o3 gpt 61.67 97 77 73
API Access:For programmatic access to evaluation data, please use our API