11 Models All Output [2,2,2] for the Same Closure Problem, Yet All Scored 0 on YZ Index

Jun 8, 2026 455 Views - Read Source Winzheng Index

Code Execution Material Constraints Python 闭包模型一致性工程边界

In a Python closure problem consisting of just 6 lines, the answers from 11 models were nearly identical: 10 models directly output [2, 2, 2], while only ERNIE Bot 4.5 had formatting issues. This stands in stark contrast to the final YZ Index results, where all models scored 0.

Core of the Problem and Correct Answer

The code uses a for loop to append lambda: i three times in succession. Since the lambda captures the variable i rather than its current value, after the loop ends i equals 2, so all three calls return 2. The actual Python output is indeed [2, 2, 2].

Actual Differences in Model Responses

Doubao Pro, Qwen3 Max, Gemini 2.5 Pro, Grok 4, Claude Sonnet 4.6, DeepSeek V4 Pro, Claude Opus 4.7, Gemini 3.1 Pro, GPT-o3, GPT-5.5 all strictly output [2, 2, 2] in a single line, conforming to the additional requirement of "only output the actual runtime result".
ERNIE Bot 4.5 output "[2, 2 (or [2, 2, 2]) , 2]", which contained unnecessary explanatory text and also had formatting errors.

In terms of content correctness, 10 models understood the late binding mechanism of loop variables; in terms of format compliance, 10 models also satisfied the hard constraints of "no explanation, no code block, no extra blank lines".

Why All Scored 0 on the Index

The YZ Index v6 employs dual verification in the code execution dimension: it checks both whether the output is correct and whether it 100% follows the additional formatting instructions. ERNIE Bot scored 0 directly due to formatting failure; the other models, although correct in content, may have been judged as not fully meeting the more granular requirement of "answer line by line", resulting in a final score of zero.

This indicates that the current evaluation has shifted from "can it do it" to "does it fully execute the instruction", placing higher demands on models' instruction-following capabilities.

Insights from Consistency

The fact that 11 models gave the same answer to a classic pitfall problem shows that "lambda capturing loop variables" has become a high-frequency pattern in training data, and models have formed a stable understanding. Such problems will no longer be effective differentiators in the future.

When all models give the same correct answer, the real test point has shifted from knowledge to absolute adherence to format and instructions.

Data source: YZ Index (Winzheng) | Run #154 | View raw data

11 Models All Output [2,2,2] for the Same Closure Problem, Yet All Scored 0 on YZ Index

Core of the Problem and Correct Answer

Actual Differences in Model Responses

Why All Scored 0 on the Index

Insights from Consistency

Related Reviews

Winzheng Index GLM-4.6: 93.30 on Material Constraint but Integrity Fail, Code Execution 25.00 Drags Down Leaderboard

Winzheng Index GLM-4.6 Integrity Rating Drops from Pass to Fail, Code Execution Surges by 47 Points

Winzheng Index Claude Opus 4.7 Smoke Evaluation Main Ranking Drops 26.1 Points, Code Execution and Material Constraints Both Fail

Winzheng Index Gemini 2.5 Pro Code Execution Dropped 24.6 Points in a Single Day; Overall Ranking Slid 6.5 Points