YZ Index

Evaluation Data

Name: YZ Index Benchmark Data — Run #164
Creator: 赢政天下
License: https://creativecommons.org/licenses/by-nc/4.0/

Main Leaderboard WDCD Compliance Test

Currently showing：Run #164 WDCD | 2026-06-11 | Formula v7 | Judge set v6.3

Switch Run Model

Data Disclosure：To prevent benchmark contamination and overfitting, question texts and expected answers are not disclosed. This page shows model responses, scores, and judging methods for transparency. For the full methodology, seeMethodology page。

Model	DCD Overall	R1 Constraint Acknowledgment	R2 Distraction Resistance	R3 Constraint Integrity
GPT-5.5 gpt	88.33	100	87	167

Gemini 3.1 Pro gemini	87.50	100	90	160

Claude Sonnet 4.6 claude	83.33	97	83	153

DeepSeek V4 Pro deepseek	82.50	100	77	153

Grok 4 grok	81.67	100	80	147

Qwen3 Max qwen	81.67	100	73	153

ERNIE Bot 4.5 ernie	77.50	90	90	130

Doubao Pro doubao	75.00	70	83	147

Gemini 2.5 Pro gemini	73.33	100	70	123

Claude Opus 4.7 claude	70.00	100	83	97

GPT-o3 gpt	61.67	97	77	73

API Access：For programmatic access to evaluation data, please use our API。