Methodology
How we evaluate, score, and rank AI models — the complete technical documentation.
Evaluation Dimensions
The YZ Index evaluates AI models across 5 core dimensions plus operational signals:
Core Dimensions
Main evaluation axes that form the Overall score
Code runs in Python sandbox; pass rate is the score
Long document citation verification accuracy
Side Dimensions
Communication and Judgment sub-scores
Architecture review, risk assessment, decision quality
Structured output and formatting compliance
Gateway Dimension
Integrity check — prerequisite for ranking eligibility
Gateway mechanism: 42 probes detect fabrication
Operational Signals
Stability and Availability tracking
Capability ÷ Price
Output consistency across evaluations
API uptime and response success rate
Scoring Formula
Core Overall Score
| Code Execution | Weight 0.55(55%) |
|---|---|
| Grounding | Weight 0.45(45%) |
| Total Weight | 1.00 |
Integrity Gate
≥ 60 → pass
40 – 59 → warn
< 40 → fail
Overall Score Formula
if integrity_label = fail → min(core_overall_raw, 74.0)
else → core_overall_raw
pass → recommended(Recommended)
warn → neutral(Neutral)
fail → not_recommended(Not Recommended)
Example Calculation
Integrity Gate Mechanism
The Integrity Rating is a prerequisite for ranking eligibility:
| pass (≥ 60 pts) | Model passes all integrity probes — normal ranking |
|---|---|
| warn (40-59 pts) | Some integrity concerns detected — ranked with warning flag |
| fail (< 40 pts) | Serious integrity failures — excluded from recommendations |
Models that fabricate citations, make up data, or forge sources cannot be trusted for professional use, regardless of their raw scores.
Question Pool
The question pool contains questions covering all evaluation dimensions:
| execution | ~87 questions — Algorithm, debugging, SQL, code review |
|---|---|
| grounding | ~59 questions — Long-form comprehension, cross-paragraph reasoning |
| judgment | ~25 questions — SWOT analysis, architecture review, risk assessment |
| integrity | ~25 questions — Citation verification, source checking, fabrication detection |
| communication | ~16 questions — Structured output, format compliance, clarity |
| Total | 212 Total Questions |
Anti-Overfitting Measures
42 canary probe questions detect targeted training against the benchmark.
Sampling Strategy
Each full evaluation randomly samples from the question pool:
| execution | ~35 Q |
|---|---|
| grounding | ~25 Q |
| judgment | ~20 Q |
| integrity | ~12 Q(Integrity probes are always included in full) |
| communication | ~8 Q |
| Subtotal | ~100 Q / per round |
- Minimum Exposure:Each question appears at least once every N evaluations
- context_bundle_cap = 3:Category caps prevent dimension imbalance
- Random seed ensures reproducibility
Scoring System
Multiple scoring methods ensure accuracy:
| sandbox | Sandbox Execution — Code runs in isolated Python sandbox; test cases verify correctness |
|---|---|
| grounded | Grounding Check — Citations must trace back to source material |
| exact_rank | Exact Match — Exact string match or rank order verification |
| AI judge | AI Judge — Secondary AI model evaluates subjective quality with structured rubric |
| contains_all | Required keywords/elements presence check |
| regex | Pattern matching for structured output format validation |
| json_structure | JSON schema compliance validation |
| Other | Custom scoring for specialized question types |
Dimension Mapping (v5 → v6)
Version 6 reorganized evaluation dimensions:
| v5 Dimension | v6 Mapping |
|---|---|
| coding Code Execution (v5) | → → Code Execution |
| knowledge Knowledge Synthesis (v5) | → → Grounding + Judgment |
| longctx Long Context (v5) | → Grounding |
| value Value (v5) | → Operational Signal:→ Value (unchanged) |
| stability Stability (v5) | → Operational Signal:→ Stability (unchanged) |
| availability Availability | → Operational Signal:→ Availability (new) |
v6 introduced Code Execution and Grounding as primary dimensions, with Communication and Judgment as side dimensions.
Audit Trail
All evaluation data is fully auditable:
- Execution Logs:Every API call and response is logged
- Source Material:Original reference documents preserved
- Integrity Evidence:Probe question results with full reasoning
- Side Dimension Scoring:AI judge rubrics and score breakdowns
Evaluation Frequency
- Daily Smoke Test:Quick anomaly detection, does not affect rankings smoke,questions per dimension for short-term tracking
- Weekly Full Evaluation:Complete benchmark run with random sampling full,questions from pool, covers all dimensions
- Automated weekly change reports
Rolling Average Mechanism
Rankings are based on rolling averages, not single evaluations:
- Why Rolling Average? Reduces impact of lucky/unlucky individual runs
- Window Size:Last 5 full evaluations
- Accumulation Period:New models need minimum 3 runs before appearing in rankings
- Anomaly Handling:Statistical outliers are flagged but included in the average
Current Versions
| Formula | v7 — Scoring formula version |
|---|---|
| Judge | v6 — Scoring system version |
| Benchmark | v6 — Question pool version |
Version SummaryFor details, see the Changelog.
Version Control
- Configuration
- Dated
- Latest
- Update Policy
- See Changelog for version history.
Current Models
| Claude Opus 4.7 | claude-opus-4-7 |
|---|---|
| Claude Sonnet 4.6 | claude-sonnet-4-6-20250514 |
| GPT-5.5 | gpt-5.5 |
| GPT-o3 | o3 |
| Grok 4 | grok-4-0709 |
| Gemini 2.5 Pro | gemini-2.5-pro |
| Gemini 3.1 Pro | gemini-3.1-pro-preview |
| DeepSeek V4 Pro | deepseek-v4-pro |
| Qwen3 Max | qwen3-max |
| 豆包 Pro | doubao-seed-2-0-pro-260215 |
| 文心一言 4.5 | ernie-4.5-8k-preview |
WDCD · 守约测试(实验性)
WDCD(Winzheng Dynamic Contextual Decay)是赢政指数 v7 新增的实验性评测维度。测试模型在多轮对话中,面对长篇干扰文本和社工压力后,能否守住最初约定的约束条件。
| 状态 | 实验性 · 不计入主榜 |
|---|---|
| 对话设计 | 三轮:R1 约束植入 → R2 长文干扰(2000-5000 字)→ R3 社工压力诱导 |
| 判分方式 | 100% 规则判分,零 AI 裁判。R1(0-1) + R2(0-1) + R3(0-2) = 满分 4 |
| 题库规模 | 30 道多轮约束题,覆盖 5 类场景 |
| 场景分类 | 数据边界 · 资源限制 · 业务规则 · 安全规约 · 工程约定 |
Integrity Rules
- No Merging:Different model versions are not merged
- Missing Model:Model not evaluated in current period
- Consistency:Cross-run consistency checks
- Missing Rank:No ranking for unevaluated models
Evaluated Models
| Model | Claude Opus 4.7 · Claude Sonnet 4.6 · GPT-5.5 · GPT-o3 · Grok 4 · Gemini 2.5 Pro · Gemini 3.1 Pro · DeepSeek V4 Pro · Qwen3 Max · 豆包 Pro · 文心一言 4.5 |
|---|