GPT-o3 Reservoir Sampling Score Plummets from 100 to 0, Code Execution Truth Hides in Details

In the v6 evaluation, GPT-o3's main score rose from 75.86 to 82.82, and its material constraint score jumped sharply from 66.40 to 80.40, a 14-point increase. On the surface, the model's overall ability is improving, but the strict question "Reservoir Sampling" scored 0 points, down from 100, and this single question error directly dragged down the credibility of code execution.

Fatal Flaws Exposed by the Original Answer

The code fragment provided for the question that lost points is as follows:

def reservoir_sample(stream, k, seed=None):
rng = random.Random(seed)
reservoir = []
if k <= 0:
return reservoir
for i, item in enumerate(stream):
if i < k:
reservoir.append(item)
else:
j = rng.randrange(i + 1)
if j < k:

This code abruptly truncates at line 11, failing to complete the random replacement logic or handle the return statement. The core of the reservoir sampling algorithm is to decide whether to replace an element in the reservoir with probability k/i when i >= k. The above answer did not even finish writing this key branch, directly resulting in a strict scoring score of zero.

Disconnect Between Engineering Judgment Surge and Code Execution

In the same batch of data, engineering judgment (side score, AI-assisted evaluation) surged from 41.20 to 91.50, and task expression also rose from 40.00 to 87.50. The model appears comfortable when describing system failure scenarios and providing architectural suggestions, but immediately reveals its weaknesses when it comes to strict questions requiring precise implementation of classic algorithms.

This contrast suggests that current optimization may lean more towards "explaining clearly" rather than "writing correctly." The code execution dimension increased only slightly by 1.2 points to 84.80, forming a sharp contrast with the 50-point increase in engineering judgment.

True Meaning of Stability Improvement

Stability rose from 33.8 to 58.0. According to the formula max(0, 100-stddev×2), this means the standard deviation of the model's scores on multiple answers to similar questions has narrowed, indicating improved consistency. However, the fact of a single question scoring 0 reminds us that improved consistency does not equal improved accuracy, especially on algorithmic questions that require rigorous mathematical proof, where the model may still completely fail in one go.

  • In the legacy dimension, knowledge synthesis rose from 53.9 to 91.2, indicating a significant enhancement in the model's ability to restate concepts.
  • However, v5 code execution actually decreased slightly from 84.5 to 82.2, showing that the new version did not fully inherit advantages in strict code implementation.

The integrity rating rose from 73.90 to 90.60, cost-effectiveness from 8.5 to 10.5, and usability remains at a perfect score—these are all positive signals, yet they cannot mask the weaknesses in algorithmic implementation ability.

Core Assessment

GPT-o3 is currently better at describing solutions in language, yet still carries systemic risk in code implementation that requires zero errors. The score of 0 on the reservoir sampling question is not an accidental mistake, but a persistent implementation gap in the model's handling of precise probabilistic algorithms.

If future versions truly aim to raise the ceiling of code execution, they must repeatedly refine such classic algorithms under strict scoring environments, rather than merely inflating side scores through engineering description points.


Data Source: YZ Index | Run #154 | View Raw Data