Claude Sonnet 4.6 Drops from 100 to 0 on Strict SQL Question, Yet Main Leaderboard Rises by 9.3

In the v6 evaluation, Claude Sonnet 4.6's score on a single strict SQL question, "Suspected Duplicate Payment Identification," dropped directly from 100 to 0, while its main leaderboard score rose from 77.98 to 87.24. This data set itself constitutes a contradiction: overall capability improves, yet core code execution collapses in the very scenario that requires the most precise logic.

Fatal Defects Exposed by the Original Answer

The original SQL provided in the evaluation is as follows:

SELECT p1.id AS first_id, p2.id AS second_id, ...
FROM payments p1
JOIN payments p2 ON p1.user_id = p2.user_id
AND p1.merchant_id = p2.merchant_id
AND p1.amount = p2.amount
AND p1.status = 'paid' AND p2.status = 'paid'

This code lacks two necessary conditions: first, p1.id < p2.id (to avoid self-matching of the same record), and second, a filter for payment time difference. The result is that each payment produces a Cartesian product output with itself and all other records of the same amount, directly receiving a score of 0.

The Coexistence of Main Leaderboard Improvement and Strict Question Collapse

In the same version, code execution rose from 82.70 to 87.60, material constraints increased from 72.20 to 86.80, and engineering judgment (side leaderboard, AI-assisted evaluation) surged by 42.3 points. Behind the main leaderboard's overall +9.3 is that the model produces more complete and better-formatted outputs on routine tasks. However, the strict question requires "one-shot" precise logic, yet the model fails at the most basic deduplication filtering.

This indicates that the optimization direction of the v6 version leans toward "coverage" rather than "rigor." When the task shifts from open-ended Q&A to scenarios that must return an exact result set, the model's self-consistency is insufficient to handle boundary conditions.

Volatility Risks Masked by Stability Improvement

Stability rose from 36.5 to 62.7, meaning the standard deviation of the model's scores on similar questions narrowed. However, the stability formula max(0, 100-stddev×2) measures output consistency, not accuracy. The model now more "stably" produces SQL that lacks id filtering, indicating that it has solidified occasional correct paths into a systematic error.

Engineering Judgment and Actual Failure Scenarios

Although engineering judgment (side leaderboard, AI-assisted evaluation) improved, this question directly corresponds to a real risk control scenario: identifying suspected duplicate payments. If this query were run directly in a production environment, merchants would receive massive false alarms, dramatically increasing manual review costs. The model may "look smarter" in open evaluations, but it fails at the very point where defensive programming is most needed.

Comparing the legacy dimension, knowledge synthesis jumped from 57.8 to 92.9, indicating that the model has absorbed more SQL syntax knowledge but failed to translate that knowledge into correct implementations with boundary conditions.

When a model treats "it looks like it might run" as the passing line, strict questions become its sharpest mirror of flaws.

The core of this incident is not the loss of points on a single question, but the systematic bias it reveals in the current optimization path: main leaderboard scores can be inflated by increasing coverage, while strict questions expose the logical gaps without mercy. If the next version does not incorporate the defensive constraint of "must write correctly on the first try" into the training signal, similar zero-score incidents will recur periodically.


Data source: Winzheng Index (YZ Index) | Run #154 | View raw data