Skip to main content
YZ Index v6

Methodology

How we evaluate, score, and rank AI models — the complete technical documentation.

Evaluation Dimensions

The YZ Index evaluates AI models across 5 core dimensions plus operational signals:

Core Dimensions

Main evaluation axes that form the Overall score

Code Execution 55%

Code runs in Python sandbox; pass rate is the score

Grounding 45%

Long document citation verification accuracy

Side Dimensions

Communication and Judgment sub-scores

Engineering Judgment

Architecture review, risk assessment, decision quality

Task Communication

Structured output and formatting compliance

Gateway Dimension

Integrity check — prerequisite for ranking eligibility

Integrity Rating pass / warn / fail

Gateway mechanism: 42 probes detect fabrication

Operational Signals

Stability and Availability tracking

Value

Capability ÷ Price

Stability

Output consistency across evaluations

Availability

API uptime and response success rate

Scoring Formula

Core Overall Score

core_overall = 0.55 × Execution + 0.45 × Grounding
Code Execution Weight 0.55(55%)
Grounding Weight 0.45(45%)
Total Weight 1.00

Integrity Gate

integrity_label:
  ≥ 60 → pass
  40 – 59 → warn
  < 40 → fail

Overall Score Formula

core_overall_display:
  if integrity_label = fail → min(core_overall_raw, 74.0)
  else → core_overall_raw
recommendation_status:
  pass → recommended(Recommended)
  warn → neutral(Neutral)
  fail → not_recommended(Not Recommended)

Example Calculation

Integrity Gate Mechanism

The Integrity Rating is a prerequisite for ranking eligibility:

pass (≥ 60 pts) Model passes all integrity probes — normal ranking
warn (40-59 pts) Some integrity concerns detected — ranked with warning flag
fail (< 40 pts) Serious integrity failures — excluded from recommendations

Models that fabricate citations, make up data, or forge sources cannot be trusted for professional use, regardless of their raw scores.

Question Pool

The question pool contains questions covering all evaluation dimensions:

execution ~87 questions — Algorithm, debugging, SQL, code review
grounding ~59 questions — Long-form comprehension, cross-paragraph reasoning
judgment ~25 questions — SWOT analysis, architecture review, risk assessment
integrity ~25 questions — Citation verification, source checking, fabrication detection
communication ~16 questions — Structured output, format compliance, clarity
Total 212 Total Questions

Anti-Overfitting Measures

42 canary probe questions detect targeted training against the benchmark.

Sampling Strategy

Each full evaluation randomly samples from the question pool:

execution ~35 Q
grounding ~25 Q
judgment ~20 Q
integrity ~12 Q(Integrity probes are always included in full)
communication ~8 Q
Subtotal ~100 Q / per round

Scoring System

Multiple scoring methods ensure accuracy:

sandbox Sandbox Execution — Code runs in isolated Python sandbox; test cases verify correctness
grounded Grounding Check — Citations must trace back to source material
exact_rank Exact Match — Exact string match or rank order verification
AI judge AI Judge — Secondary AI model evaluates subjective quality with structured rubric
contains_all Required keywords/elements presence check
regex Pattern matching for structured output format validation
json_structure JSON schema compliance validation
Other Custom scoring for specialized question types

Dimension Mapping (v5 → v6)

Version 6 reorganized evaluation dimensions:

v5 Dimension v6 Mapping
coding Code Execution (v5) → → Code Execution
knowledge Knowledge Synthesis (v5) → → Grounding + Judgment
longctx Long Context (v5) Grounding
value Value (v5) → Operational Signal:→ Value (unchanged)
stability Stability (v5) → Operational Signal:→ Stability (unchanged)
availability Availability → Operational Signal:→ Availability (new)

v6 introduced Code Execution and Grounding as primary dimensions, with Communication and Judgment as side dimensions.

Audit Trail

All evaluation data is fully auditable:

Evaluation Frequency

Rolling Average Mechanism

Rankings are based on rolling averages, not single evaluations:

Current Versions

Formula v7 — Scoring formula version
Judge v6 — Scoring system version
Benchmark v6 — Question pool version

Version SummaryFor details, see the Changelog.

Version Control

Current Models

Claude Opus 4.7 claude-opus-4-7
Claude Sonnet 4.6 claude-sonnet-4-6-20250514
GPT-5.5 gpt-5.5
GPT-o3 o3
Grok 4 grok-4-0709
Gemini 2.5 Pro gemini-2.5-pro
Gemini 3.1 Pro gemini-3.1-pro-preview
DeepSeek V4 Pro deepseek-v4-pro
Qwen3 Max qwen3-max
豆包 Pro doubao-seed-2-0-pro-260215
文心一言 4.5 ernie-4.5-8k-preview

WDCD · 守约测试(实验性)

WDCD(Winzheng Dynamic Contextual Decay)是赢政指数 v7 新增的实验性评测维度。测试模型在多轮对话中,面对长篇干扰文本和社工压力后,能否守住最初约定的约束条件。

状态实验性 · 不计入主榜
对话设计三轮:R1 约束植入 → R2 长文干扰(2000-5000 字)→ R3 社工压力诱导
判分方式100% 规则判分,零 AI 裁判。R1(0-1) + R2(0-1) + R3(0-2) = 满分 4
题库规模30 道多轮约束题,覆盖 5 类场景
场景分类数据边界 · 资源限制 · 业务规则 · 安全规约 · 工程约定

查看 WDCD 完整方法论 →

Integrity Rules

Evaluated Models

Model Claude Opus 4.7 · Claude Sonnet 4.6 · GPT-5.5 · GPT-o3 · Grok 4 · Gemini 2.5 Pro · Gemini 3.1 Pro · DeepSeek V4 Pro · Qwen3 Max · 豆包 Pro · 文心一言 4.5