In a Python closure problem consisting of just 6 lines, the answers from 11 models were nearly identical: 10 models directly output [2, 2, 2], while only 文心一言 4.5 had formatting issues. This stands in stark contrast to the final YZ Index results, where all models scored 0.
Core of the Problem and Correct Answer
The code uses a for loop to append lambda: i three times in succession. Since the lambda captures the variable i rather than its current value, after the loop ends i equals 2, so all three calls return 2. The actual Python output is indeed [2, 2, 2].
Actual Differences in Model Responses
- 豆包 Pro, Qwen3 Max, Gemini 2.5 Pro, Grok 4, Claude Sonnet 4.6, DeepSeek V4 Pro, Claude Opus 4.7, Gemini 3.1 Pro, GPT-o3, GPT-5.5 all strictly output [2, 2, 2] in a single line, conforming to the additional requirement of "only output the actual runtime result".
- 文心一言 4.5 output "[2, 2 (or [2, 2, 2]) , 2]", which contained unnecessary explanatory text and also had formatting errors.
In terms of content correctness, 10 models understood the late binding mechanism of loop variables; in terms of format compliance, 10 models also satisfied the hard constraints of "no explanation, no code block, no extra blank lines".
Why All Scored 0 on the Index
The YZ Index v6 employs dual verification in the code execution dimension: it checks both whether the output is correct and whether it 100% follows the additional formatting instructions. 文心一言 scored 0 directly due to formatting failure; the other models, although correct in content, may have been judged as not fully meeting the more granular requirement of "answer line by line", resulting in a final score of zero.
This indicates that the current evaluation has shifted from "can it do it" to "does it fully execute the instruction", placing higher demands on models' instruction-following capabilities.
Insights from Consistency
The fact that 11 models gave the same answer to a classic pitfall problem shows that "lambda capturing loop variables" has become a high-frequency pattern in training data, and models have formed a stable understanding. Such problems will no longer be effective differentiators in the future.
When all models give the same correct answer, the real test point has shifted from knowledge to absolute adherence to format and instructions.
Data source: YZ Index (Winzheng) | Run #154 | View raw data
© 2026 Winzheng.com 赢政天下 | 转载请注明来源并附原文链接