Skip to main content
YZ Index

Recent Comparison

Same-day comparison of two evaluation runs.

Baseline: Run #112 · Formula v7 · Judge v6 · Benchmark v6 · 2026-05-11 04:21 SGT Current: Run #122 · Formula v7 · Judge v6 · Benchmark v6 · 2026-05-18 04:18 SGT

Overall Score Changes Ranked by absolute change magnitude

Grok 4 +31.8
49.2 → 81.0
GPT-5.5 +3.8
73.2 → 77.0
GPT-o3 +2.6
75.7 → 78.3
Qwen3 Max +1.8
77.2 → 79.0
Gemini 2.5 Pro +0.6
78.5 → 79.0
文心一言 4.5 -11.1
78.2 → 67.1
Gemini 3.1 Pro -1.6
79.2 → 77.7
DeepSeek V4 Pro -1.4
77.7 → 76.4
豆包 Pro -1.4
82.6 → 81.3
Claude Opus 4.7 -1.1
81.1 → 80.0
Claude Sonnet 4.6 -0.5
83.5 → 83.0

Side Dimension Changes Communication and Judgment changes

Grok 4 +8.7
Communication: 16.3 → 25.0
Grok 4 +7.5
Judgment: 37.7 → 45.2
Gemini 2.5 Pro +5.0
Communication: 25.0 → 30.0
Gemini 2.5 Pro +3.7
Judgment: 39.5 → 43.2
Qwen3 Max +3.7
Judgment: 41.5 → 45.2
Gemini 3.1 Pro +3.6
Judgment: 45.2 → 48.8
Claude Opus 4.7 +2.1
Judgment: 53.7 → 55.8
GPT-o3 -8.1
Judgment: 51.3 → 43.2
GPT-5.5 -5.2
Judgment: 48.4 → 43.2
DeepSeek V4 Pro -5.0
Communication: 30.0 → 25.0
文心一言 4.5 -5.0
Communication: 30.0 → 25.0
豆包 Pro -4.8
Judgment: 52.8 → 48.0
DeepSeek V4 Pro -2.9
Judgment: 45.2 → 42.3
Claude Sonnet 4.6 -2.0
Judgment: 54.9 → 52.9
文心一言 4.5 -1.7
Judgment: 42.0 → 40.3

Integrity Rating Changes Changes in model integrity status

文心一言 4.5 Restored
⚠ warn✔ pass
Grok 4 Restored
⚠ warn✔ pass

Operational Signal Changes Stability and Availability changes

Grok 4 +38.0
Availability: 62.0 → 100.0
Grok 4 +9.7
Value: 15.1 → 24.8
Qwen3 Max +3.0
Value: 47.7 → 50.7
Gemini 3.1 Pro +2.7
Stability: 36.8 → 39.5
Gemini 2.5 Pro +2.1
Value: 36.0 → 38.1
Gemini 2.5 Pro +2.0
Availability: 98.0 → 100.0
GPT-5.5 +1.1
Value: 16.2 → 17.3
Claude Sonnet 4.6 +1.0
Value: 25.0 → 26.0
GPT-5.5 +1.0
Stability: 34.4 → 35.4
Gemini 3.1 Pro +0.9
Value: 24.1 → 25.0
DeepSeek V4 Pro +0.6
Value: 39.8 → 40.4
豆包 Pro +0.5
Value: 92.4 → 92.9
GPT-o3 +0.5
Value: 8.4 → 8.9
文心一言 4.5 -6.5
Stability: 33.2 → 26.7
Grok 4 -5.7
Stability: 36.3 → 30.6
豆包 Pro -3.1
Stability: 41.3 → 38.2
GPT-o3 -2.4
Stability: 35.9 → 33.5
Claude Sonnet 4.6 -2.2
Stability: 39.7 → 37.5
DeepSeek V4 Pro -2.2
Stability: 36.1 → 33.9
Claude Opus 4.7 -1.9
Stability: 38.7 → 36.8
文心一言 4.5 -1.0
Availability: 100.0 → 99.0
Gemini 3.1 Pro -1.0
Availability: 100.0 → 99.0
文心一言 4.5 -0.7
Value: 98.6 → 97.9
Gemini 2.5 Pro -0.7
Stability: 35.0 → 34.3

Show legacy dimension changes
8 Up
3 Down
0 Stable
11 models

Significant Increases

Significant Decreases