Skip to main content
YZ Index

Evaluation Data

Main Leaderboard WDCD Compliance Test
Currently showing:Run #154 | 2026-06-08 | 212-question pool | Formula v7 | Judge set v6.1
Data Disclosure:To prevent benchmark contamination and overfitting, question texts and expected answers are not disclosed. This page shows model responses, scores, and judging methods for transparency. For the full methodology, seeMethodology page
Model Code Execution Grounding Engineering Judgment Task Communication Integrity Overallpts Value Stability Availability Per Task
Grok 4 grok 93.90 85.00 82.10 87.80 86.30 pass 89.90 29.7 68.6 100.0
Claude Opus 4.7 claude 90.30 87.50 93.10 89.40 94.30 pass 89.04 6.2 67.7 100.0
Doubao Pro doubao 94.60 81.60 88.80 84.10 92.20 pass 88.75 96.2 71.2 100.0
Claude Sonnet 4.6 claude 87.60 86.80 93.20 87.80 94.70 pass 87.24 29.7 62.7 100.0
Gemini 2.5 Pro gemini 88.10 84.20 87.70 84.60 88.80 pass 86.35 44.6 66.0 99.0
Qwen3 Max qwen 89.70 81.90 85.70 85.30 87.50 pass 86.19 58.5 59.8 100.0
Gemini 3.1 Pro gemini 88.40 80.40 85.20 84.90 87.70 pass 84.80 29.3 63.2 99.0
DeepSeek V4 Pro deepseek 87.90 77.60 82.40 85.10 81.80 pass 83.27 47.5 59.1 100.0
GPT-o3 gpt 84.80 80.40 91.50 87.50 90.60 pass 82.82 10.5 58.0 100.0
GPT-5.5 gpt 81.90 79.70 92.10 87.40 88.30 pass 80.91 20.4 51.8 100.0
ERNIE Bot 4.5 ernie 78.00 75.60 72.20 72.00 70.00 pass 76.92 99.3 44.2 100.0
API Access:For programmatic access to evaluation data, please use our API