YZ Index
Model Incident Reports
Auto-detected: overall crash / dimension collapse / strict task zeroed · updated weekly
8
GPT-o3 Strict Question:"Reservoir Sampling"from Full Score to 0
8
Claude Sonnet 4.6 Strict Question:"SQL: Suspected Duplicate Payment Identification"from Full Score to 0
10
Gemini 2.5 Pro Availability dropped 24 points
10
Gemini 2.5 Pro Grounding (v5) dropped 29 points
10
Gemini 2.5 Pro Code Execution (v5) dropped 33.4 points
10
Gemini 2.5 Pro Overall Score Dropped 19.5 points
8
GPT-o3 Strict Question:"SQL: Consecutive Login Days"from Full Score to 0
8
GPT-o3 Strict Question:"Debug: Matrix Rotation"from Full Score to 0
8
Claude Sonnet 4.6 Strict Question:"SQL: Suspected Duplicate Payment Identification"from Full Score to 0
8
Claude Opus 4.6 Strict Question:"SQL: Suspected Duplicate Payment Identification"from Full Score to 0
10
GPT-4o Code Execution (v5) dropped 23.7 points
10
GPT-4o Overall Score Dropped 10.5 points
10
Qwen Max Stability dropped 22.8 points
10
Claude Opus 4.6 Stability dropped 22.5 points
10
Grok 3 Stability dropped 22.5 points
10
GPT-o3 Availability dropped 31 points
10
GPT-o3 Stability dropped 25 points
10
GPT-o3 Grounding dropped 33.5 points
10
GPT-4o Availability dropped 35 points
10
GPT-4o Stability dropped 20.6 points
10
GPT-4o Grounding dropped 21.9 points
10
Gemini 2.5 Pro Stability dropped 22.8 points
10
ERNIE Bot 4.0 Stability dropped 22.1 points
10
DeepSeek V3 Stability dropped 21.4 points
10
DeepSeek R1 Stability dropped 22.1 points
10
Claude Sonnet 4.6 Stability dropped 23 points
9