YZ Index

Weekly Report

Weekly model performance changes and trend analysis.

2026 Week21 2026 Week20 2026 Week19 2026 Week18 2026 Week17 2026 Week16 2026 Week15 2026 Week14 2026 Week12 2026-21 2026-20 2026-19 2026-18 2026-03-24-Same-Day Compare

Baseline: Run #112 · Formula v7 · Judge v6 · Benchmark v6 · 2026-05-11 04:21 SGT Current: Run #122 · Formula v7 · Judge v6 · Benchmark v6 · 2026-05-18 04:18 SGT

Overall Score Changes Ranked by absolute change magnitude

Grok 4 +31.8

49.2 → 81.0

GPT-5.5 +3.8

73.2 → 77.0

GPT-o3 +2.6

75.7 → 78.3

Qwen3 Max +1.8

77.2 → 79.0

Gemini 2.5 Pro +0.6

78.5 → 79.0

文心一言 4.5 -11.1

78.2 → 67.1

Gemini 3.1 Pro -1.6

79.2 → 77.7

DeepSeek V4 Pro -1.4

77.7 → 76.4

豆包 Pro -1.4

82.6 → 81.3

Claude Opus 4.7 -1.1

81.1 → 80.0

Claude Sonnet 4.6 -0.5

83.5 → 83.0

Side Dimension Changes Communication and Judgment changes

Grok 4 +8.7

Communication: 16.3 → 25.0

Grok 4 +7.5

Judgment: 37.7 → 45.2

Gemini 2.5 Pro +5.0

Communication: 25.0 → 30.0

Gemini 2.5 Pro +3.7

Judgment: 39.5 → 43.2

Qwen3 Max +3.7

Judgment: 41.5 → 45.2

Gemini 3.1 Pro +3.6

Judgment: 45.2 → 48.8

Claude Opus 4.7 +2.1

Judgment: 53.7 → 55.8

GPT-o3 -8.1

Judgment: 51.3 → 43.2

GPT-5.5 -5.2

Judgment: 48.4 → 43.2

DeepSeek V4 Pro -5.0

Communication: 30.0 → 25.0

文心一言 4.5 -5.0

Communication: 30.0 → 25.0

豆包 Pro -4.8

Judgment: 52.8 → 48.0

DeepSeek V4 Pro -2.9

Judgment: 45.2 → 42.3

Claude Sonnet 4.6 -2.0

Judgment: 54.9 → 52.9

文心一言 4.5 -1.7

Judgment: 42.0 → 40.3

Integrity Rating Changes Changes in model integrity status

文心一言 4.5 Restored

⚠ warn → ✔ pass

Grok 4 Restored

⚠ warn → ✔ pass

Operational Signal Changes Stability and Availability changes

Grok 4 +38.0

Availability: 62.0 → 100.0

Grok 4 +9.7

Value: 15.1 → 24.8

Qwen3 Max +3.0

Value: 47.7 → 50.7

Gemini 3.1 Pro +2.7

Stability: 36.8 → 39.5

Gemini 2.5 Pro +2.1

Value: 36.0 → 38.1

Gemini 2.5 Pro +2.0

Availability: 98.0 → 100.0

GPT-5.5 +1.1

Value: 16.2 → 17.3

Claude Sonnet 4.6 +1.0

Value: 25.0 → 26.0

GPT-5.5 +1.0

Stability: 34.4 → 35.4

Gemini 3.1 Pro +0.9

Value: 24.1 → 25.0

DeepSeek V4 Pro +0.6

Value: 39.8 → 40.4

豆包 Pro +0.5

Value: 92.4 → 92.9

GPT-o3 +0.5

Value: 8.4 → 8.9

文心一言 4.5 -6.5

Stability: 33.2 → 26.7

Grok 4 -5.7

Stability: 36.3 → 30.6

豆包 Pro -3.1

Stability: 41.3 → 38.2

GPT-o3 -2.4

Stability: 35.9 → 33.5

Claude Sonnet 4.6 -2.2

Stability: 39.7 → 37.5

DeepSeek V4 Pro -2.2

Stability: 36.1 → 33.9

Claude Opus 4.7 -1.9

Stability: 38.7 → 36.8

文心一言 4.5 -1.0

Availability: 100.0 → 99.0

Gemini 3.1 Pro -1.0

Availability: 100.0 → 99.0

文心一言 4.5 -0.7

Value: 98.6 → 97.9

Gemini 2.5 Pro -0.7

Stability: 35.0 → 34.3

Show legacy dimension changes

10 Up

8 Down

0 Stable

18 models

Significant Increases

文心一言 4.5：首次加入评测， Overall Score 72.0

DeepSeek V4 Pro：首次加入评测， Overall Score 65.2

Qwen3 Max：首次加入评测， Overall Score 64.9

Gemini 3.1 Pro：首次加入评测， Overall Score 63.6

Claude Opus 4.7：首次加入评测， Overall Score 62.5

GPT-5.5：首次加入评测， Overall Score 59.6

Grok 4：首次加入评测， Overall Score 41.5

GPT-o3： Grounding +20.9

Claude Sonnet 4.6： Engineering Judgment +10.2

豆包 Pro： Engineering Judgment +10.1

Significant Decreases

DeepSeek V3： this week 退出评测

DeepSeek R1： this week 退出评测

文心一言 4.0： this week 退出评测

Grok 3： this week 退出评测

Qwen Max： this week 退出评测

Claude Opus 4.6： this week 退出评测

GPT-4o： this week 退出评测

Gemini 2.5 Pro： Code Execution -5.4