GPT-o3 Drops from 100 to 0 on One Problem, Yet the Main Board Rises

May 11, 2026 22 Views - Read Source Winzheng Index

GPT-o3 代码执行严格题 Debug事故矩阵旋转

The most glaring issue is not that GPT-o3 scored 0, but that it went from full marks to zero on a basic debugging problem, while the main board still rose by 2.1.

The core problem in this incident is "Debug: Matrix Rotation." In the previous run, GPT-o3 scored 100 on this problem; this time, it scored 0. The problem itself is not obscure: rotate an N×N matrix 90 degrees clockwise in-place. The standard solution is to first transpose along the main diagonal, then reverse each row. GPT-o3 also wrote this approach, but the final step was never actually executed.

for i in range(n):
    matrix[i].reverse

The issue is here: reverse is missing parentheses. It only retrieves the list method object without calling it. As a result, the function only completed the transpose, but the matrix never underwent clockwise rotation. For a strict problem, this is not "almost correct" but a functional error, so the drop from 100 to 0 is a reasonable judgment.

It's Not a Lack of Knowledge, but a Broken Execution Chain

What's more notable is that GPT-o3's overall data did not collapse simultaneously. The v6 main board rose from 73.62 to 75.69, up by 2.1; material constraints from 66.80 to 73.10, up by 6.3; code execution only dropped from 79.20 to 77.80, down by 1.4. In other words, this incident is not an overall degradation of the model, but a typical "local hard failure": the overall metrics look better, yet a single point is enough to trip up developers.

This kind of failure is more dangerous than "completely not knowing." Because the model explains the correct approach, and the comments read like a standard answer, readers might initially think it has completed the task. The actual error is hidden in a missing pair of parentheses. Without test cases, manual review can easily miss it.

The Main Board Rise Masks the Risk of Strict Problems

According to the YZ Index v6, the main board only looks at two auditable dimensions: code execution and material constraints. The main board's rise this time mainly comes from material constraint improvement: from 66.80 to 73.10, an increase of 6.3. It shows that GPT-o3 is better at aligning with the problem description and utilizing materials this time, but it doesn't guarantee that every code path is correctly executed.

Code execution dropping from 79.20 to 77.80, a mere 1.4 decline, seems minor; yet the "matrix rotation" falling from 100 to 0 shows that averages dilute the severity of incidents. For engineering teams, the average score is not an insurance policy for production. A trivial API calling mistake can cause what appears to be a correct algorithm to return wrong results in production.

Side Index Signals Show Divergence

Engineering judgment (side index, AI-assisted evaluation) rose from 43.50 to 51.30, up by 7.8; task expression (side index, AI-assisted evaluation) dropped from 40.00 to 30.00, down by 10. This set of data is interesting: it might be better at discerning direction, but its ability to deliver complete, verifiable tasks has weakened.

Applied to this problem, GPT-o3's directional judgment was not wrong: transpose plus reverse rows. But it failed at delivery: it did not call reverse(), nor did it provide minimal test verification. This is the classic incident pattern of "looks like it understands, but runs incorrectly."

Low Stability Does Not Equal Low Accuracy

This run's stability dropped from 37.4 to 35.9, continuing to be low. But it must be emphasized that stability measures response consistency, based on the standard deviation of scores, with the formula max(0, 100-stddev×2), not accuracy. A value of 35.9 means: when answering similar problems multiple times, the scores fluctuate significantly, and the output is not consistent enough; it cannot be interpreted as "35.9% accuracy."

This is critical to GPT-o3's risk profile: it is not unusable. Availability remains 100.0, meaning it responds normally; cost-effectiveness went from 8.5 to 8.4, basically stable. But in strict code problems, it exhibits "high-confidence low-level errors." These errors are not solved by longer explanations; they can only be mitigated by execution, testing, and problem scoring.

Conclusion: GPT-o3 Should Be Treated as a Candidate Programmer, Not a Final Compiler

The conclusion from this incident is clear: GPT-o3's ability to align with materials is improving, and the main board is rising, but its code delivery still suffers from pinpoint breaks. Especially low-level API call mistakes like this one best expose the model's shortcoming of not actually running the code.

My usage recommendation is straightforward: let GPT-o3 write drafts, let it explain ideas, but whenever it enters strict logic, array transformations, or boundary-condition-heavy code scenarios, it must be paired with tests. Model code without tests is essentially just "syntactically plausible answer text."

Remember this: a rise in the main board does not mean incidents disappear; real engineering risk is often hidden in that missing pair of parentheses.

Data source: YZ Index | Run #112 | View raw data

GPT-o3 Drops from 100 to 0 on One Problem, Yet the Main Board Rises

It's Not a Lack of Knowledge, but a Broken Execution Chain

The Main Board Rise Masks the Risk of Strict Problems

Side Index Signals Show Divergence

Low Stability Does Not Equal Low Accuracy

Conclusion: GPT-o3 Should Be Treated as a Candidate Programmer, Not a Final Compiler

Related Reviews

Winzheng Index 11 Major AI Models SQL Consecutive Login Challenge: 8 Full Scores, 3 Crashes – Stunning Code Execution Gap

Winzheng Index 11-Model Generational Battle: No. 1 Holds Steady, Grok Falls to the Bottom

winzheng.com GPT-4o Code Execution Plummets 23.7 Points: Version Update Triggers Performance Avalanche

Winzheng Index Weekly AI Model Test: GPT-4o Plummets 10 Points in Material Constraints, Domestic Wenxin Bucks the Trend