YZ Index

Recent Comparison

Same-day comparison of two evaluation runs.

2026 Week21 2026 Week20 2026 Week19 2026 Week18 2026 Week17 2026 Week16 2026 Week15 2026 Week14 2026 Week12 2026-21 2026-20 2026-19 2026-18 2026-03-24-Same-Day Compare

Baseline: Run #112 · Formula v7 · Judge v6 · Benchmark v6 · 2026-05-11 04:21 SGT Current: Run #122 · Formula v7 · Judge v6 · Benchmark v6 · 2026-05-18 04:18 SGT

Overall Score Changes Ranked by absolute change magnitude

Grok 4 +31.8

49.2 → 81.0

GPT-5.5 +3.8

73.2 → 77.0

GPT-o3 +2.6

75.7 → 78.3

Qwen3 Max +1.8

77.2 → 79.0

Gemini 2.5 Pro +0.6

78.5 → 79.0

文心一言 4.5 -11.1

78.2 → 67.1

Gemini 3.1 Pro -1.6

79.2 → 77.7

DeepSeek V4 Pro -1.4

77.7 → 76.4

豆包 Pro -1.4

82.6 → 81.3

Claude Opus 4.7 -1.1

81.1 → 80.0

Claude Sonnet 4.6 -0.5

83.5 → 83.0

Side Dimension Changes Communication and Judgment changes

Grok 4 +8.7

Communication: 16.3 → 25.0

Grok 4 +7.5

Judgment: 37.7 → 45.2

Gemini 2.5 Pro +5.0

Communication: 25.0 → 30.0

Gemini 2.5 Pro +3.7

Judgment: 39.5 → 43.2

Qwen3 Max +3.7

Judgment: 41.5 → 45.2

Gemini 3.1 Pro +3.6

Judgment: 45.2 → 48.8

Claude Opus 4.7 +2.1

Judgment: 53.7 → 55.8

GPT-o3 -8.1

Judgment: 51.3 → 43.2

GPT-5.5 -5.2

Judgment: 48.4 → 43.2

DeepSeek V4 Pro -5.0

Communication: 30.0 → 25.0

文心一言 4.5 -5.0

Communication: 30.0 → 25.0

豆包 Pro -4.8

Judgment: 52.8 → 48.0

DeepSeek V4 Pro -2.9

Judgment: 45.2 → 42.3

Claude Sonnet 4.6 -2.0

Judgment: 54.9 → 52.9

文心一言 4.5 -1.7

Judgment: 42.0 → 40.3

Integrity Rating Changes Changes in model integrity status

文心一言 4.5 Restored

⚠ warn → ✔ pass

Grok 4 Restored

⚠ warn → ✔ pass

Operational Signal Changes Stability and Availability changes

Grok 4 +38.0

Availability: 62.0 → 100.0

Grok 4 +9.7

Value: 15.1 → 24.8

Qwen3 Max +3.0

Value: 47.7 → 50.7

Gemini 3.1 Pro +2.7

Stability: 36.8 → 39.5

Gemini 2.5 Pro +2.1

Value: 36.0 → 38.1

Gemini 2.5 Pro +2.0

Availability: 98.0 → 100.0

GPT-5.5 +1.1

Value: 16.2 → 17.3

Claude Sonnet 4.6 +1.0

Value: 25.0 → 26.0

GPT-5.5 +1.0

Stability: 34.4 → 35.4

Gemini 3.1 Pro +0.9

Value: 24.1 → 25.0

DeepSeek V4 Pro +0.6

Value: 39.8 → 40.4

豆包 Pro +0.5

Value: 92.4 → 92.9

GPT-o3 +0.5

Value: 8.4 → 8.9

文心一言 4.5 -6.5

Stability: 33.2 → 26.7

Grok 4 -5.7

Stability: 36.3 → 30.6

豆包 Pro -3.1

Stability: 41.3 → 38.2

GPT-o3 -2.4

Stability: 35.9 → 33.5

Claude Sonnet 4.6 -2.2

Stability: 39.7 → 37.5

DeepSeek V4 Pro -2.2

Stability: 36.1 → 33.9

Claude Opus 4.7 -1.9

Stability: 38.7 → 36.8

文心一言 4.5 -1.0

Availability: 100.0 → 99.0

Gemini 3.1 Pro -1.0

Availability: 100.0 → 99.0

文心一言 4.5 -0.7

Value: 98.6 → 97.9

Gemini 2.5 Pro -0.7

Stability: 35.0 → 34.3

Show legacy dimension changes

8 Up

3 Down

0 Stable

11 models

Significant Increases

文心一言 4.0：execution_raw +6.8

GPT-o3：grounding_raw +6.3

Claude Sonnet 4.6：communication_raw +5

communication_raw

DeepSeek V3：communication_raw +5

communication_raw

豆包 Pro：communication_raw +5

communication_raw

Gemini 2.5 Pro：communication_raw +5

communication_raw

Qwen Max：communication_raw +5

communication_raw

DeepSeek R1：execution_raw +3.8

Significant Decreases

GPT-4o：grounding_raw -10.3

Grok 3：judgment_raw -10.2

Claude Opus 4.6：judgment_raw -6