YZ Index
Weekly Report
Weekly model performance changes and trend analysis.
Baseline: Run #112 · Formula v7 · Judge v6 · Benchmark v6 · 2026-05-11 04:21 SGT
Current: Run #122 · Formula v7 · Judge v6 · Benchmark v6 · 2026-05-18 04:18 SGT
Overall Score Changes Ranked by absolute change magnitude
Grok 4
+31.8
49.2 → 81.0
GPT-5.5
+3.8
73.2 → 77.0
GPT-o3
+2.6
75.7 → 78.3
Qwen3 Max
+1.8
77.2 → 79.0
Gemini 2.5 Pro
+0.6
78.5 → 79.0
文心一言 4.5
-11.1
78.2 → 67.1
Gemini 3.1 Pro
-1.6
79.2 → 77.7
DeepSeek V4 Pro
-1.4
77.7 → 76.4
豆包 Pro
-1.4
82.6 → 81.3
Claude Opus 4.7
-1.1
81.1 → 80.0
Claude Sonnet 4.6
-0.5
83.5 → 83.0
Side Dimension Changes Communication and Judgment changes
Grok 4
+8.7
Communication: 16.3 → 25.0
Grok 4
+7.5
Judgment: 37.7 → 45.2
Gemini 2.5 Pro
+5.0
Communication: 25.0 → 30.0
Gemini 2.5 Pro
+3.7
Judgment: 39.5 → 43.2
Qwen3 Max
+3.7
Judgment: 41.5 → 45.2
Gemini 3.1 Pro
+3.6
Judgment: 45.2 → 48.8
Claude Opus 4.7
+2.1
Judgment: 53.7 → 55.8
GPT-o3
-8.1
Judgment: 51.3 → 43.2
GPT-5.5
-5.2
Judgment: 48.4 → 43.2
DeepSeek V4 Pro
-5.0
Communication: 30.0 → 25.0
文心一言 4.5
-5.0
Communication: 30.0 → 25.0
豆包 Pro
-4.8
Judgment: 52.8 → 48.0
DeepSeek V4 Pro
-2.9
Judgment: 45.2 → 42.3
Claude Sonnet 4.6
-2.0
Judgment: 54.9 → 52.9
文心一言 4.5
-1.7
Judgment: 42.0 → 40.3
Integrity Rating Changes Changes in model integrity status
文心一言 4.5
Restored
⚠ warn →
✔ pass
Grok 4
Restored
⚠ warn →
✔ pass
Operational Signal Changes Stability and Availability changes
Grok 4
+38.0
Availability: 62.0 → 100.0
Grok 4
+9.7
Value: 15.1 → 24.8
Qwen3 Max
+3.0
Value: 47.7 → 50.7
Gemini 3.1 Pro
+2.7
Stability: 36.8 → 39.5
Gemini 2.5 Pro
+2.1
Value: 36.0 → 38.1
Gemini 2.5 Pro
+2.0
Availability: 98.0 → 100.0
GPT-5.5
+1.1
Value: 16.2 → 17.3
Claude Sonnet 4.6
+1.0
Value: 25.0 → 26.0
GPT-5.5
+1.0
Stability: 34.4 → 35.4
Gemini 3.1 Pro
+0.9
Value: 24.1 → 25.0
DeepSeek V4 Pro
+0.6
Value: 39.8 → 40.4
豆包 Pro
+0.5
Value: 92.4 → 92.9
GPT-o3
+0.5
Value: 8.4 → 8.9
文心一言 4.5
-6.5
Stability: 33.2 → 26.7
Grok 4
-5.7
Stability: 36.3 → 30.6
豆包 Pro
-3.1
Stability: 41.3 → 38.2
GPT-o3
-2.4
Stability: 35.9 → 33.5
Claude Sonnet 4.6
-2.2
Stability: 39.7 → 37.5
DeepSeek V4 Pro
-2.2
Stability: 36.1 → 33.9
Claude Opus 4.7
-1.9
Stability: 38.7 → 36.8
文心一言 4.5
-1.0
Availability: 100.0 → 99.0
Gemini 3.1 Pro
-1.0
Availability: 100.0 → 99.0
文心一言 4.5
-0.7
Value: 98.6 → 97.9
Gemini 2.5 Pro
-0.7
Stability: 35.0 → 34.3
Show legacy dimension changes
10
Up
8
Down
0
Stable
18
models
Significant Increases
文心一言 4.5
+72
文心一言 4.5:首次加入评测, Overall Score 72.0
Overall (v5)
DeepSeek V4 Pro
+65.2
DeepSeek V4 Pro:首次加入评测, Overall Score 65.2
Overall (v5)
Qwen3 Max
+64.9
Qwen3 Max:首次加入评测, Overall Score 64.9
Overall (v5)
Gemini 3.1 Pro
+63.6
Gemini 3.1 Pro:首次加入评测, Overall Score 63.6
Overall (v5)
Claude Opus 4.7
+62.5
Claude Opus 4.7:首次加入评测, Overall Score 62.5
Overall (v5)
GPT-5.5
+59.6
GPT-5.5:首次加入评测, Overall Score 59.6
Overall (v5)
Grok 4
+41.5
Grok 4:首次加入评测, Overall Score 41.5
Overall (v5)
GPT-o3
+20.9
GPT-o3: Grounding +20.9
grounding_raw
Claude Sonnet 4.6
+10.2
Claude Sonnet 4.6: Engineering Judgment +10.2
judgment_raw
豆包 Pro
+10.1
豆包 Pro: Engineering Judgment +10.1
judgment_raw
Significant Decreases
DeepSeek V3
-75.1
DeepSeek V3: this week 退出评测
Overall (v5)
DeepSeek R1
-74
DeepSeek R1: this week 退出评测
Overall (v5)
文心一言 4.0
-71
文心一言 4.0: this week 退出评测
Overall (v5)
Grok 3
-65.6
Grok 3: this week 退出评测
Overall (v5)
Qwen Max
-64.8
Qwen Max: this week 退出评测
Overall (v5)
Claude Opus 4.6
-61.6
Claude Opus 4.6: this week 退出评测
Overall (v5)
GPT-4o
-59.8
GPT-4o: this week 退出评测
Overall (v5)
Gemini 2.5 Pro
-5.4
Gemini 2.5 Pro: Code Execution -5.4
execution_raw