YZ Index

Weekly Report

Weekly model performance changes and trend analysis.

2026 Week27 2026 Week26 2026-27 2026-26 2026-24 2026-23 2026-22 2026-21 2026-20 2026-19 2026-18 2026-06-15-Same-Day Compare

Baseline: Run #192 · Formula v7 · Judge v6.3 · Benchmark v7 · 2026-06-22 04:39 SGT Current: Run #204 · Formula v7 · Judge v6.3 · Benchmark v7 · 2026-06-29 04:56 SGT

Overall Score Changes Ranked by absolute change magnitude

Gemini 3.1 Pro +5.3

77.2 → 82.5

Claude Sonnet 4.6 +1.1

81.9 → 83.0

GPT-5.5 -15.4

88.3 → 72.9

ERNIE Bot 4.5 -8.1

81.3 → 73.2

GPT-o3 -7.1

90.5 → 83.4

Qwen3 Max -6.9

87.8 → 81.0

Doubao Pro -6.5

88.1 → 81.6

Grok 4 -4.9

89.9 → 85.0

Gemini 2.5 Pro -4.3

82.2 → 77.9

DeepSeek V4 Pro -3.5

92.3 → 88.8

Claude Opus 4.7 -1.2

90.6 → 89.3

Side Dimension Changes Communication and Judgment changes

Grok 4 +5.6

Judgment: 82.7 → 88.3

Gemini 2.5 Pro +5.3

Judgment: 80.0 → 85.3

Gemini 3.1 Pro +2.1

Judgment: 86.1 → 88.2

GPT-o3 +1.4

Judgment: 90.8 → 92.2

ERNIE Bot 4.5 +1.0

Judgment: 57.0 → 58.0

Doubao Pro +0.6

Communication: 99.1 → 99.7

Grok 4 -9.7

Communication: 92.2 → 82.5

Claude Sonnet 4.6 -8.8

Communication: 93.4 → 84.6

DeepSeek V4 Pro -2.7

Judgment: 96.5 → 93.8

Claude Sonnet 4.6 -1.6

Judgment: 96.7 → 95.1

Qwen3 Max -1.3

Communication: 80.6 → 79.3

Claude Opus 4.7 -0.6

Judgment: 96.1 → 95.5

Qwen3 Max -0.6

Judgment: 70.6 → 70.0

Operational Signal Changes Stability and Availability changes

Gemini 3.1 Pro +6.0

Stability: 30.1 → 36.1

Gemini 2.5 Pro +1.8

Availability: 89.0 → 90.8

Gemini 3.1 Pro +0.9

Value: 27.1 → 28.0

Claude Sonnet 4.6 +0.7

Stability: 42.0 → 42.7

Gemini 2.5 Pro -16.0

Stability: 60.4 → 44.4

Grok 4 -11.5

Stability: 53.0 → 41.5

Qwen3 Max -9.8

Stability: 46.9 → 37.1

Doubao Pro -9.5

Stability: 61.1 → 51.6

GPT-5.5 -9.2

Availability: 100.0 → 90.8

GPT-5.5 -8.7

Stability: 56.6 → 47.9

ERNIE Bot 4.5 -8.3

Stability: 35.0 → 26.7

DeepSeek V4 Pro -8.0

Stability: 63.7 → 55.7

GPT-o3 -6.8

Stability: 57.8 → 51.0

GPT-5.5 -2.9

Value: 21.4 → 18.5

Qwen3 Max -2.3

Value: 56.2 → 53.9

GPT-o3 -2.1

Availability: 98.0 → 95.9

Doubao Pro -2.1

Availability: 98.0 → 95.9

Gemini 2.5 Pro -1.5

Value: 41.7 → 40.2

Grok 4 -1.3

Value: 28.9 → 27.6

DeepSeek V4 Pro -1.2

Value: 50.7 → 49.5

Doubao Pro -1.0

Value: 96.0 → 95.0

Grok 4 -1.0

Availability: 100.0 → 99.0

GPT-o3 -0.6

Value: 10.8 → 10.2

ERNIE Bot 4.5 -0.5

Value: 99.2 → 98.7

Claude Opus 4.7 -0.5

Stability: 54.3 → 53.8

Show legacy dimension changes

6 Up

5 Down

0 Stable

11 models

Significant Increases

Gemini 2.5 Pro： Code Execution +11.6

execution_raw

ERNIE Bot 4.5： Code Execution +7

execution_raw

Doubao Pro： Code Execution +2.6

execution_raw

Gemini 3.1 Pro： Code Execution +2.4

execution_raw

DeepSeek V4 Pro： Code Execution +2.1

execution_raw

GPT-o3： Code Execution +2

execution_raw

Significant Decreases

Claude Sonnet 4.6： Code Execution -15.6

execution_raw

Claude Opus 4.7： Code Execution -7.9

execution_raw

Qwen3 Max： Code Execution -6.5

execution_raw

Grok 4： Engineering Judgment -5.6

judgment_raw

GPT-5.5： Code Execution -4.5

execution_raw