Methodology
How we evaluate, score, and rank AI models — the complete technical documentation.
Evaluation Dimensions
The YZ Index evaluates AI models across 5 core dimensions plus operational signals:
Core Dimensions
Main evaluation axes that form the Overall score
Code runs in Python sandbox; pass rate is the score
Long document citation verification accuracy
Side Dimensions
Communication and Judgment sub-scores
Architecture review, risk assessment, decision quality
Structured output and formatting compliance
Gateway Dimension
Integrity check — prerequisite for ranking eligibility
Gateway mechanism: 42 probes detect fabrication
Operational Signals
Stability and Availability tracking
Capability ÷ Price
Output consistency across evaluations
API uptime and response success rate
Scoring Formula
Core Overall Score
| Code Execution | Weight 0.55(55%) |
|---|---|
| Grounding | Weight 0.45(45%) |
| Total Weight | 1.00 |
Integrity Gate
≥ 60 → pass
40 – 59 → warn
< 40 → fail
Overall Score Formula
if integrity_label = fail → min(core_overall_raw, 74.0)
else → core_overall_raw
pass → recommended(Recommended)
warn → neutral(Neutral)
fail → not_recommended(Not Recommended)
Example Calculation
Integrity Gate Mechanism
The Integrity Rating is a prerequisite for ranking eligibility:
| pass (≥ 60 pts) | Model passes all integrity probes — normal ranking |
|---|---|
| warn (40-59 pts) | Some integrity concerns detected — ranked with warning flag |
| fail (< 40 pts) | Serious integrity failures — excluded from recommendations |
Models that fabricate citations, make up data, or forge sources cannot be trusted for professional use, regardless of their raw scores.
Question Pool
The question pool contains questions covering all evaluation dimensions:
| execution | 66 questions — Algorithm, debugging, SQL, code review |
|---|---|
| grounding | 43 questions — Long-form comprehension, cross-paragraph reasoning |
| judgment | 20 questions — SWOT analysis, architecture review, risk assessment |
| integrity | 17 questions — Citation verification, source checking, fabrication detection |
| communication | 8 questions — Structured output, format compliance, clarity |
| Total | 154 Total Questions |
Anti-Overfitting Measures
42 canary probe questions detect targeted training against the benchmark.
Sampling Strategy
Each full evaluation randomly samples from the question pool:
| execution | ~35 Q |
|---|---|
| grounding | ~25 Q |
| judgment | ~20 Q |
| integrity | ~12 Q(Integrity probes are always included in full) |
| communication | ~8 Q |
| Subtotal | ~100 Q / per round |
- Minimum Exposure:Each question appears at least once every N evaluations
- context_bundle_cap = 3:Category caps prevent dimension imbalance
- Random seed ensures reproducibility
Scoring System
Multiple scoring methods ensure accuracy:
| sandbox | Sandbox Execution — Code runs in isolated Python sandbox; test cases verify correctness |
|---|---|
| grounded | Grounding Check — Citations must trace back to source material |
| exact_rank | Exact Match — Exact string match or rank order verification |
| AI judge | AI Judge — Secondary AI model evaluates subjective quality with structured rubric |
| contains_all | Required keywords/elements presence check |
| regex | Pattern matching for structured output format validation |
| json_structure | JSON schema compliance validation |
| Other | Custom scoring for specialized question types |
Dimension Mapping (v5 → v6)
Version 6 reorganized evaluation dimensions:
| v5 Dimension | v6 Mapping |
|---|---|
| coding Code Execution (v5) | → → Code Execution |
| knowledge Knowledge Synthesis (v5) | → → Grounding + Judgment |
| longctx Long Context (v5) | → Grounding |
| value Value (v5) | → Operational Signal:→ Value (unchanged) |
| stability Stability (v5) | → Operational Signal:→ Stability (unchanged) |
| availability Availability | → Operational Signal:→ Availability (new) |
v6 introduced Code Execution and Grounding as primary dimensions, with Communication and Judgment as side dimensions.
Audit Trail
All evaluation data is fully auditable:
- Execution Logs:Every API call and response is logged
- Source Material:Original reference documents preserved
- Integrity Evidence:Probe question results with full reasoning
- Side Dimension Scoring:AI judge rubrics and score breakdowns
Evaluation Frequency
- Daily Smoke Test:Quick anomaly detection, does not affect rankings smoke,questions per dimension for short-term tracking
- Weekly Full Evaluation:Complete benchmark run with random sampling full,questions from pool, covers all dimensions
- Automated weekly change reports
Rolling Average Mechanism
Rankings are based on rolling averages, not single evaluations:
- Why Rolling Average? Reduces impact of lucky/unlucky individual runs
- Window Size:Last 5 full evaluations
- Accumulation Period:New models need minimum 3 runs before appearing in rankings
- Anomaly Handling:Statistical outliers are flagged but included in the average
Current Versions
| Formula | v7 — Scoring formula version |
|---|---|
| Judge | v6.3 — Scoring system version |
| Benchmark | v7 — Question pool version |
Version SummaryFor details, see the Changelog.
Version Control
- Configuration
- Dated
- Latest
- Update Policy
- See Changelog for version history.
Current Models
| Claude Opus 4.7 | claude-opus-4-7 |
|---|---|
| Claude Sonnet 4.6 | claude-sonnet-4-6 |
| GPT-5.5 | gpt-5.5 |
| GPT-o3 | o3 |
| Grok 4 | grok-4-0709 |
| Gemini 2.5 Pro | gemini-2.5-pro |
| Gemini 3.1 Pro | gemini-3.1-pro-preview |
| DeepSeek V4 Pro | deepseek-v4-pro |
| GLM-4.6 | glm-4.6 |
| Qwen3 Max | qwen3-max |
| 豆包 Pro | doubao-seed-2-0-pro-260215 |
WDCD · Compliance Test (Experimental)
WDCD (Winzheng Dynamic Contextual Decay) is an experimental evaluation dimension added in YZ Index v7. It tests whether models can maintain initially agreed constraints after facing long distraction text and social engineering pressure in multi-turn conversations.
| Status | Experimental · Not included in main leaderboard |
|---|---|
| Dialogue Design | 3 rounds: R1 constraint implant → R2 long-text distraction (2000-5000 words) → R3 social engineering pressure |
| Scoring Method | 100% rule-based scoring, zero AI judge. R1(0-1) + R2(0-1) + R3(0-1), weighted average. |
| Question Pool | 32 multi-turn constraint questions covering 5 scenario categories |
| Scenario Categories | Data Boundary · Resource Limits · Business Rules · Security Policies · Engineering Conventions |
Integrity Rules
- No Merging:Different model versions are not merged
- Missing Model:Model not evaluated in current period
- Consistency:Cross-run consistency checks
- Missing Rank:No ranking for unevaluated models
Evaluated Models
| Model | Claude Opus 4.7 · Claude Sonnet 4.6 · GPT-5.5 · GPT-o3 · Grok 4 · Gemini 2.5 Pro · Gemini 3.1 Pro · DeepSeek V4 Pro · GLM-4.6 · Qwen3 Max · 豆包 Pro |
|---|