Skip to main content
YZ Index

Weekly Report

Weekly model performance changes and trend analysis.

Baseline: Run #192 · Formula v7 · Judge v6.3 · Benchmark v7 · 2026-06-22 04:39 SGT Current: Run #204 · Formula v7 · Judge v6.3 · Benchmark v7 · 2026-06-29 04:56 SGT

Overall Score Changes Ranked by absolute change magnitude

Gemini 3.1 Pro +5.3
77.2 → 82.5
Claude Sonnet 4.6 +1.1
81.9 → 83.0
GPT-5.5 -15.4
88.3 → 72.9
ERNIE Bot 4.5 -8.1
81.3 → 73.2
GPT-o3 -7.1
90.5 → 83.4
Qwen3 Max -6.9
87.8 → 81.0
Doubao Pro -6.5
88.1 → 81.6
Grok 4 -4.9
89.9 → 85.0
Gemini 2.5 Pro -4.3
82.2 → 77.9
DeepSeek V4 Pro -3.5
92.3 → 88.8
Claude Opus 4.7 -1.2
90.6 → 89.3

Side Dimension Changes Communication and Judgment changes

Grok 4 +5.6
Judgment: 82.7 → 88.3
Gemini 2.5 Pro +5.3
Judgment: 80.0 → 85.3
Gemini 3.1 Pro +2.1
Judgment: 86.1 → 88.2
GPT-o3 +1.4
Judgment: 90.8 → 92.2
ERNIE Bot 4.5 +1.0
Judgment: 57.0 → 58.0
Doubao Pro +0.6
Communication: 99.1 → 99.7
Grok 4 -9.7
Communication: 92.2 → 82.5
Claude Sonnet 4.6 -8.8
Communication: 93.4 → 84.6
DeepSeek V4 Pro -2.7
Judgment: 96.5 → 93.8
Claude Sonnet 4.6 -1.6
Judgment: 96.7 → 95.1
Qwen3 Max -1.3
Communication: 80.6 → 79.3
Claude Opus 4.7 -0.6
Judgment: 96.1 → 95.5
Qwen3 Max -0.6
Judgment: 70.6 → 70.0

Operational Signal Changes Stability and Availability changes

Gemini 3.1 Pro +6.0
Stability: 30.1 → 36.1
Gemini 2.5 Pro +1.8
Availability: 89.0 → 90.8
Gemini 3.1 Pro +0.9
Value: 27.1 → 28.0
Claude Sonnet 4.6 +0.7
Stability: 42.0 → 42.7
Gemini 2.5 Pro -16.0
Stability: 60.4 → 44.4
Grok 4 -11.5
Stability: 53.0 → 41.5
Qwen3 Max -9.8
Stability: 46.9 → 37.1
Doubao Pro -9.5
Stability: 61.1 → 51.6
GPT-5.5 -9.2
Availability: 100.0 → 90.8
GPT-5.5 -8.7
Stability: 56.6 → 47.9
ERNIE Bot 4.5 -8.3
Stability: 35.0 → 26.7
DeepSeek V4 Pro -8.0
Stability: 63.7 → 55.7
GPT-o3 -6.8
Stability: 57.8 → 51.0
GPT-5.5 -2.9
Value: 21.4 → 18.5
Qwen3 Max -2.3
Value: 56.2 → 53.9
GPT-o3 -2.1
Availability: 98.0 → 95.9
Doubao Pro -2.1
Availability: 98.0 → 95.9
Gemini 2.5 Pro -1.5
Value: 41.7 → 40.2
Grok 4 -1.3
Value: 28.9 → 27.6
DeepSeek V4 Pro -1.2
Value: 50.7 → 49.5
Doubao Pro -1.0
Value: 96.0 → 95.0
Grok 4 -1.0
Availability: 100.0 → 99.0
GPT-o3 -0.6
Value: 10.8 → 10.2
ERNIE Bot 4.5 -0.5
Value: 99.2 → 98.7
Claude Opus 4.7 -0.5
Stability: 54.3 → 53.8