11 Models Tested on Bracket Matching: 7 Full Scores, 4 Zero Scores

11 mainstream models faced the same bracket-matching debugging problem, and the results showed clear polarization: 7 models scored 100, while 4 models scored 0. The core finding is that the real fatal bug in the original code is that the bare "return" at the end of the function returns None, instead of an explicit True or len(stack)==0.

The Real Problem of the Original Code

The provided code uses three if-continue structures after successful matching, and finally returns directly. This approach returns None when the stack is empty. In Python, None evaluates to False in boolean contexts, causing unexpected results for the caller. 豆包Pro, Qwen3 Max, 文心一言4.5, Grok 4, DeepSeek V4 Pro, Claude Opus 4.7, and GPT-5.5 all identified this issue and uniformly rewrote it as return len(stack)==0.

In contrast, the four models Gemini 2.5 Pro, Claude Sonnet 4.6, Gemini 3.1 Pro, and GPT-o3 failed to reflect the fix for the return value in their output, or directly failed to produce valid code, resulting in a score of 0.

Common Practices of the Full-Score Models

All 7 full-score models adopted a dictionary mapping approach to refactor the matching logic:

  • Use mapping = {')':'(', '}':'{', ']':'['}
  • Push left brackets onto the stack, pop and compare for right brackets
  • Uniformly return len(stack)==0

This approach reduces the three repetitive if statements in the original code to a single table lookup, while also handling non-bracket characters that were missing in the original code. GPT-5.5 additionally added an else branch to directly return False upon encountering invalid characters, making the code more robust.

Shortcomings Exposed by the Zero-Score Models

Claude Sonnet 4.6 argued in detail that the original logic was "actually correct", but did not output a corrected code. The Gemini series and GPT-o3 failed to present a fully runnable version in their output snippets. The common characteristic of the zero-score models is: they either stayed at the analysis stage, or performed incomplete fixes, failing to simultaneously address both the return None issue and the invalid character handling.

Practical Significance of Engineering Judgment

This test once again proves that the code execution dimension not only examines the ability to write correct results, but also to discover subtle return type errors. Using continue to skip return False may be effective in the short term, but it has poor maintainability and is prone to introducing new bugs when adding logic in the future. The full-score models, by using a mapping table for one-time judgment, significantly reduced the risk of subsequent maintenance.

Only when a model can proactively upgrade from "it works" to "easy to maintain with clear boundaries" does it truly cross the passing line of code execution.

In this evaluation, 7 models crossed this line, while 4 remained at the surface analysis level. The stability dimension will later track the score fluctuations of the same model when answering similar debugging questions multiple times; the current results have already shown clear divergence.


Data source: YZ Index | Run #154 | View raw data