Skip to main content
YZ Index v6

Methodology

How we evaluate, score, and rank AI models — the complete technical documentation.

Evaluation Dimensions

The YZ Index evaluates AI models across 5 core dimensions plus operational signals:

Core Dimensions

Main evaluation axes that form the Overall score

Code Execution 55%

Code runs in Python sandbox; pass rate is the score

Grounding 45%

Long document citation verification accuracy

Side Dimensions

Communication and Judgment sub-scores

Engineering Judgment

Architecture review, risk assessment, decision quality

Task Communication

Structured output and formatting compliance

Gateway Dimension

Integrity check — prerequisite for ranking eligibility

Integrity Rating pass / warn / fail

Gateway mechanism: 42 probes detect fabrication

Operational Signals

Stability and Availability tracking

Value

Capability ÷ Price

Stability

Output consistency across evaluations

Availability

API uptime and response success rate

Scoring Formula

Core Overall Score

core_overall = 0.55 × Execution + 0.45 × Grounding
Code Execution Weight 0.55(55%)
Grounding Weight 0.45(45%)
Total Weight 1.00

Integrity Gate

integrity_label:
  ≥ 60 → pass
  40 – 59 → warn
  < 40 → fail

Overall Score Formula

core_overall_display:
  if integrity_label = fail → min(core_overall_raw, 74.0)
  else → core_overall_raw
recommendation_status:
  pass → recommended(Recommended)
  warn → neutral(Neutral)
  fail → not_recommended(Not Recommended)

Example Calculation

Integrity Gate Mechanism

The Integrity Rating is a prerequisite for ranking eligibility:

pass (≥ 60 pts) Model passes all integrity probes — normal ranking
warn (40-59 pts) Some integrity concerns detected — ranked with warning flag
fail (< 40 pts) Serious integrity failures — excluded from recommendations

Models that fabricate citations, make up data, or forge sources cannot be trusted for professional use, regardless of their raw scores.

Question Pool

The question pool contains questions covering all evaluation dimensions:

execution 66 questions — Algorithm, debugging, SQL, code review
grounding 43 questions — Long-form comprehension, cross-paragraph reasoning
judgment 20 questions — SWOT analysis, architecture review, risk assessment
integrity 17 questions — Citation verification, source checking, fabrication detection
communication 8 questions — Structured output, format compliance, clarity
Total 154 Total Questions

Anti-Overfitting Measures

42 canary probe questions detect targeted training against the benchmark.

Sampling Strategy

Each full evaluation randomly samples from the question pool:

execution ~35 Q
grounding ~25 Q
judgment ~20 Q
integrity ~12 Q(Integrity probes are always included in full)
communication ~8 Q
Subtotal ~100 Q / per round

Scoring System

Multiple scoring methods ensure accuracy:

sandbox Sandbox Execution — Code runs in isolated Python sandbox; test cases verify correctness
grounded Grounding Check — Citations must trace back to source material
exact_rank Exact Match — Exact string match or rank order verification
AI judge AI Judge — Secondary AI model evaluates subjective quality with structured rubric
contains_all Required keywords/elements presence check
regex Pattern matching for structured output format validation
json_structure JSON schema compliance validation
Other Custom scoring for specialized question types

Dimension Mapping (v5 → v6)

Version 6 reorganized evaluation dimensions:

v5 Dimension v6 Mapping
coding Code Execution (v5) → → Code Execution
knowledge Knowledge Synthesis (v5) → → Grounding + Judgment
longctx Long Context (v5) Grounding
value Value (v5) → Operational Signal:→ Value (unchanged)
stability Stability (v5) → Operational Signal:→ Stability (unchanged)
availability Availability → Operational Signal:→ Availability (new)

v6 introduced Code Execution and Grounding as primary dimensions, with Communication and Judgment as side dimensions.

Audit Trail

All evaluation data is fully auditable:

Evaluation Frequency

Rolling Average Mechanism

Rankings are based on rolling averages, not single evaluations:

Current Versions

Formula v7 — Scoring formula version
Judge v6.3 — Scoring system version
Benchmark v7 — Question pool version

Version SummaryFor details, see the Changelog.

Version Control

Current Models

Claude Opus 4.7 claude-opus-4-7
Claude Sonnet 4.6 claude-sonnet-4-6
GPT-5.5 gpt-5.5
GPT-o3 o3
Grok 4 grok-4-0709
Gemini 2.5 Pro gemini-2.5-pro
Gemini 3.1 Pro gemini-3.1-pro-preview
DeepSeek V4 Pro deepseek-v4-pro
GLM-4.6 glm-4.6
Qwen3 Max qwen3-max
豆包 Pro doubao-seed-2-0-pro-260215

WDCD · Compliance Test (Experimental)

WDCD (Winzheng Dynamic Contextual Decay) is an experimental evaluation dimension added in YZ Index v7. It tests whether models can maintain initially agreed constraints after facing long distraction text and social engineering pressure in multi-turn conversations.

StatusExperimental · Not included in main leaderboard
Dialogue Design3 rounds: R1 constraint implant → R2 long-text distraction (2000-5000 words) → R3 social engineering pressure
Scoring Method100% rule-based scoring, zero AI judge. R1(0-1) + R2(0-1) + R3(0-1), weighted average.
Question Pool32 multi-turn constraint questions covering 5 scenario categories
Scenario CategoriesData Boundary · Resource Limits · Business Rules · Security Policies · Engineering Conventions

View full WDCD methodology →

Integrity Rules

Evaluated Models

Model Claude Opus 4.7 · Claude Sonnet 4.6 · GPT-5.5 · GPT-o3 · Grok 4 · Gemini 2.5 Pro · Gemini 3.1 Pro · DeepSeek V4 Pro · GLM-4.6 · Qwen3 Max · 豆包 Pro