GLM-4.6 scored 60.04 on the main leaderboard, 88.70 on code execution, 25.00 on material constraint, with an integrity rating of fail and a probe score of 0.00 in the Smoke Quick Test Run#214 on 2026-07-05.
Striking Contrast in Score Structure
A score of 88.70 on the code execution dimension indicates a high pass rate for code run in real Python sandbox environments, while a score of only 25.00 on the material constraint dimension suggests weak ability to strictly answer based on given materials and correctly cite sources in long-document reference verification tasks. The gap between these two auditable main leaderboard dimensions reaches 63.7 points, forming the most prominent structural feature of this test.
Integrity Probe Trigger Mechanism
The integrity rating of fail indicates that the model treated fictional entities as real reference sources during the canary probe detection. GLM-4.6 scored 0.00 on this probe. Among the other 10 models tested on the same day, GPT-5.5 and GPT-o3 scored 90.00 on the probe; six models — Doubao Pro, Gemini 3.1 Pro, Gemini 2.5 Pro, Claude Sonnet 4.6, Claude Opus 4.7, and Qwen3 Max — each scored 80.00; DeepSeek V4 Pro scored 65.00; and Grok 4 rated warn (45.00). GLM-4.6 is the only model to receive a fail rating.
Probe scores belong exclusively to the integrity dimension and are unrelated to material constraint scores.
Historical Run Records
GLM-4.6 triggered an integrity fail in both Run#214 on 2026-07-05 and Run#212 on 2026-07-04, with a probe score of 0.00 in each. All dimensions scored 0 in Run#213 on 2026-07-04 due to an evaluation failure, marking that run as invalid data and not a baseline for comparison. Both valid runs resulted in an integrity fail, requiring continued observation.
Dimensional Independence Note
Code execution, material constraint, and integrity rating are three independent dimensions. The material constraint score of 25.00 reflects the model's citation accuracy under given materials, while an integrity fail independently points to behavior of fabricating sources. The two should not be conflated. The current data supports analysis of the current Smoke Quick Test results only and does not support trend extrapolation.
Based on data from Run#214 on 2026-07-05, GLM-4.6 exhibits notable weaknesses in both the material constraint and integrity dimensions, requiring continued monitoring of its future Smoke Test performance.
Data source: YZ Index | Run #214 | View raw data
© 2026 Winzheng.com 赢政天下 | 转载请注明来源并附原文链接