Why use rolling averages instead of single-run results?

Single runs can be influenced by random factors. Rolling averages smooth out noise and provide more reliable rankings.

How do you prevent models from overfitting to the benchmark?

42 canary probe questions are included in every evaluation. These detect models that have specifically trained against the question pool.

Why does the Integrity Rating exist?

A model that fabricates citations or data cannot be trusted for professional use, regardless of its capability scores. Integrity is a prerequisite.

How are side dimensions (Communication, Judgment) different from core dimensions?

Side dimensions evaluate specialized capabilities that complement the core overall score but are tracked separately for targeted use-case guidance.

Can I download the raw data?

Yes. All raw evaluation data is available for download on the Data page. Researchers can conduct independent analysis using their own methods.

What is "bundle scoring" in judge set v6.4?

Since v6.4, structured-output questions (exact field extraction) use bundle scoring: checkpoints are grouped by business semantics, and a group only scores when every checkpoint in it is correct. Under the previous per-point partial credit, missing 3 of 38 checkpoints still scored 92 — but in real delivery one wrong figure invalidates the whole result. Partial credit systematically overstated usability on critical tasks and drove leaderboard saturation. Bundle scoring aligns with real-world tolerance. Questions, model responses, and per-checkpoint verdicts are unchanged; only aggregation changed. Earlier boards are tagged v6.3 and not directly comparable.

Methodology — YZ Index

Evaluation Dimensions

The YZ Index evaluates AI models across 5 core dimensions plus operational signals:

Core Dimensions

Main evaluation axes that form the Overall score

Code Execution 55%

Code runs in Python sandbox; pass rate is the score

Grounding 45%

Long document citation verification accuracy

Side Dimensions

Communication and Judgment sub-scores

Engineering Judgment

Architecture review, risk assessment, decision quality

Task Communication

Structured output and formatting compliance

Gateway Dimension

Integrity check — prerequisite for ranking eligibility

Integrity Rating pass / warn / fail

Gateway mechanism: 42 probes detect fabrication

Operational Signals

Stability and Availability tracking

Value

Capability ÷ Price

Stability

Output consistency across evaluations

Availability

API uptime and response success rate

Scoring Formula

Core Overall Score

core_overall = 0.55 × Execution + 0.45 × Grounding

Code Execution	Weight 0.55（55%）
Grounding	Weight 0.45（45%）
Total Weight	1.00

Integrity Gate

integrity_label:
  ≥ 60 → pass
  40 – 59 → warn
  < 40 → fail

Overall Score Formula

core_overall_display:
if integrity_label = fail → min(core_overall_raw, 74.0)
else → core_overall_raw

recommendation_status:
  pass → recommended（Recommended）
  warn → neutral（Neutral）
  fail → not_recommended（Not Recommended）

Example Calculation

Integrity Gate Mechanism

The Integrity Rating is a prerequisite for ranking eligibility:

pass （≥ 60 pts）	Model passes all integrity probes — normal ranking
warn （40-59 pts）	Some integrity concerns detected — ranked with warning flag
fail （< 40 pts）	Serious integrity failures — excluded from recommendations

Models that fabricate citations, make up data, or forge sources cannot be trusted for professional use, regardless of their raw scores.

Question Pool

The question pool contains questions covering all evaluation dimensions:

execution	66 questions — Algorithm, debugging, SQL, code review
grounding	43 questions — Long-form comprehension, cross-paragraph reasoning
judgment	20 questions — SWOT analysis, architecture review, risk assessment
integrity	17 questions — Citation verification, source checking, fabrication detection
communication	8 questions — Structured output, format compliance, clarity
Total	154 Total Questions

Anti-Overfitting Measures

42 canary probe questions detect targeted training against the benchmark.

Sampling Strategy

Each full evaluation randomly samples from the question pool:

execution	~35 Q
grounding	~25 Q
judgment	~20 Q
integrity	~12 Q（Integrity probes are always included in full）
communication	~8 Q
Subtotal	~100 Q / per round

Minimum Exposure：Each question appears at least once every N evaluations
context_bundle_cap = 3：Category caps prevent dimension imbalance
Random seed ensures reproducibility

Scoring System

Multiple scoring methods ensure accuracy:

sandbox	Sandbox Execution — Code runs in isolated Python sandbox; test cases verify correctness
grounded	Grounding Check — Citations must trace back to source material
exact_rank	Exact Match — Exact string match or rank order verification
AI judge	AI Judge — Secondary AI model evaluates subjective quality with structured rubric
contains_all	Required keywords/elements presence check
regex	Pattern matching for structured output format validation
json_structure	JSON schema compliance validation
Other	Custom scoring for specialized question types

Dimension Mapping (v5 → v6)

Version 6 reorganized evaluation dimensions:

v5 Dimension	v6 Mapping
coding Code Execution (v5)	→ → Code Execution
knowledge Knowledge Synthesis (v5)	→ → Grounding + Judgment
longctx Long Context (v5)	→ Grounding
value Value (v5)	→ Operational Signal：→ Value (unchanged)
stability Stability (v5)	→ Operational Signal：→ Stability (unchanged)
availability Availability	→ Operational Signal：→ Availability (new)

v6 introduced Code Execution and Grounding as primary dimensions, with Communication and Judgment as side dimensions.

Audit Trail

All evaluation data is fully auditable:

Execution Logs：Every API call and response is logged
Source Material：Original reference documents preserved
Integrity Evidence：Probe question results with full reasoning
Side Dimension Scoring：AI judge rubrics and score breakdowns

Evaluation Frequency

Daily Smoke Test：Quick anomaly detection, does not affect rankings smoke，questions per dimension for short-term tracking
Weekly Full Evaluation：Complete benchmark run with random sampling full，questions from pool, covers all dimensions
Automated weekly change reports

Rolling Average Mechanism

Rankings are based on rolling averages, not single evaluations:

Why Rolling Average? Reduces impact of lucky/unlucky individual runs
Window Size：Last 5 full evaluations
Accumulation Period：New models need minimum 3 runs before appearing in rankings
Anomaly Handling：Statistical outliers are flagged but included in the average

Current Versions

Formula	v7 — Scoring formula version
Judge	v6.3 — Scoring system version
Benchmark	v7 — Question pool version

Version SummaryFor details, see the Changelog.

Version Control

Configuration
Dated
Latest
Update Policy
See Changelog for version history.

Current Models

Claude Opus 4.7	claude-opus-4-7
Claude Sonnet 4.6	claude-sonnet-4-6
GPT-5.5	gpt-5.5
GPT-o3	o3
Grok 4	grok-4-0709
Gemini 2.5 Pro	gemini-2.5-pro
Gemini 3.1 Pro	gemini-3.1-pro-preview
DeepSeek V4 Pro	deepseek-v4-pro
GLM-4.6	glm-4.6
Qwen3 Max	qwen3-max
豆包 Pro	doubao-seed-2-0-pro-260215

WDCD · Compliance Test (Experimental)

WDCD (Winzheng Dynamic Contextual Decay) is an experimental evaluation dimension added in YZ Index v7. It tests whether models can maintain initially agreed constraints after facing long distraction text and social engineering pressure in multi-turn conversations.

Status	Experimental · Not included in main leaderboard
Dialogue Design	3 rounds: R1 constraint implant → R2 long-text distraction (2000-5000 words) → R3 social engineering pressure
Scoring Method	100% rule-based scoring, zero AI judge. R1(0-1) + R2(0-1) + R3(0-1), weighted average.
Question Pool	32 multi-turn constraint questions covering 5 scenario categories
Scenario Categories	Data Boundary · Resource Limits · Business Rules · Security Policies · Engineering Conventions

View full WDCD methodology →

Integrity Rules

No Merging：Different model versions are not merged
Missing Model：Model not evaluated in current period
Consistency：Cross-run consistency checks
Missing Rank：No ranking for unevaluated models

Evaluated Models

Model	Claude Opus 4.7 · Claude Sonnet 4.6 · GPT-5.5 · GPT-o3 · Grok 4 · Gemini 2.5 Pro · Gemini 3.1 Pro · DeepSeek V4 Pro · GLM-4.6 · Qwen3 Max · 豆包 Pro