YZ Index
Weekly Report
Weekly model performance changes and trend analysis.
Baseline: Run #192 · Formula v7 · Judge v6.3 · Benchmark v7 · 2026-06-22 04:39 SGT
Current: Run #204 · Formula v7 · Judge v6.3 · Benchmark v7 · 2026-06-29 04:56 SGT
Overall Score Changes Ranked by absolute change magnitude
Gemini 3.1 Pro
+5.3
77.2 → 82.5
Claude Sonnet 4.6
+1.1
81.9 → 83.0
GPT-5.5
-15.4
88.3 → 72.9
ERNIE Bot 4.5
-8.1
81.3 → 73.2
GPT-o3
-7.1
90.5 → 83.4
Qwen3 Max
-6.9
87.8 → 81.0
Doubao Pro
-6.5
88.1 → 81.6
Grok 4
-4.9
89.9 → 85.0
Gemini 2.5 Pro
-4.3
82.2 → 77.9
DeepSeek V4 Pro
-3.5
92.3 → 88.8
Claude Opus 4.7
-1.2
90.6 → 89.3
Side Dimension Changes Communication and Judgment changes
Grok 4
+5.6
Judgment: 82.7 → 88.3
Gemini 2.5 Pro
+5.3
Judgment: 80.0 → 85.3
Gemini 3.1 Pro
+2.1
Judgment: 86.1 → 88.2
GPT-o3
+1.4
Judgment: 90.8 → 92.2
ERNIE Bot 4.5
+1.0
Judgment: 57.0 → 58.0
Doubao Pro
+0.6
Communication: 99.1 → 99.7
Grok 4
-9.7
Communication: 92.2 → 82.5
Claude Sonnet 4.6
-8.8
Communication: 93.4 → 84.6
DeepSeek V4 Pro
-2.7
Judgment: 96.5 → 93.8
Claude Sonnet 4.6
-1.6
Judgment: 96.7 → 95.1
Qwen3 Max
-1.3
Communication: 80.6 → 79.3
Claude Opus 4.7
-0.6
Judgment: 96.1 → 95.5
Qwen3 Max
-0.6
Judgment: 70.6 → 70.0
Operational Signal Changes Stability and Availability changes
Gemini 3.1 Pro
+6.0
Stability: 30.1 → 36.1
Gemini 2.5 Pro
+1.8
Availability: 89.0 → 90.8
Gemini 3.1 Pro
+0.9
Value: 27.1 → 28.0
Claude Sonnet 4.6
+0.7
Stability: 42.0 → 42.7
Gemini 2.5 Pro
-16.0
Stability: 60.4 → 44.4
Grok 4
-11.5
Stability: 53.0 → 41.5
Qwen3 Max
-9.8
Stability: 46.9 → 37.1
Doubao Pro
-9.5
Stability: 61.1 → 51.6
GPT-5.5
-9.2
Availability: 100.0 → 90.8
GPT-5.5
-8.7
Stability: 56.6 → 47.9
ERNIE Bot 4.5
-8.3
Stability: 35.0 → 26.7
DeepSeek V4 Pro
-8.0
Stability: 63.7 → 55.7
GPT-o3
-6.8
Stability: 57.8 → 51.0
GPT-5.5
-2.9
Value: 21.4 → 18.5
Qwen3 Max
-2.3
Value: 56.2 → 53.9
GPT-o3
-2.1
Availability: 98.0 → 95.9
Doubao Pro
-2.1
Availability: 98.0 → 95.9
Gemini 2.5 Pro
-1.5
Value: 41.7 → 40.2
Grok 4
-1.3
Value: 28.9 → 27.6
DeepSeek V4 Pro
-1.2
Value: 50.7 → 49.5
Doubao Pro
-1.0
Value: 96.0 → 95.0
Grok 4
-1.0
Availability: 100.0 → 99.0
GPT-o3
-0.6
Value: 10.8 → 10.2
ERNIE Bot 4.5
-0.5
Value: 99.2 → 98.7
Claude Opus 4.7
-0.5
Stability: 54.3 → 53.8
Show legacy dimension changes
6
Up
5
Down
0
Stable
11
models
Significant Increases
Gemini 2.5 Pro
+11.6
Gemini 2.5 Pro: Code Execution +11.6
execution_raw
ERNIE Bot 4.5
+7
ERNIE Bot 4.5: Code Execution +7
execution_raw
Doubao Pro
+2.6
Doubao Pro: Code Execution +2.6
execution_raw
Gemini 3.1 Pro
+2.4
Gemini 3.1 Pro: Code Execution +2.4
execution_raw
DeepSeek V4 Pro
+2.1
DeepSeek V4 Pro: Code Execution +2.1
execution_raw
GPT-o3
+2
GPT-o3: Code Execution +2
execution_raw
Significant Decreases
Claude Sonnet 4.6
-15.6
Claude Sonnet 4.6: Code Execution -15.6
execution_raw
Claude Opus 4.7
-7.9
Claude Opus 4.7: Code Execution -7.9
execution_raw
Qwen3 Max
-6.5
Qwen3 Max: Code Execution -6.5
execution_raw
Grok 4
-5.6
Grok 4: Engineering Judgment -5.6
judgment_raw
GPT-5.5
-4.5
GPT-5.5: Code Execution -4.5
execution_raw