GLM-4.6 Scores 25 in Material Constraint, 88.7 in Code Execution, Zero on Integrity Probe

Jul 5, 2026 39 Views - Read Source Winzheng Index

GLM-4.6 Material Constraints Integrity Rating Smoke快测引用核验

GLM-4.6 scored 60.04 on the main leaderboard, 88.70 on code execution, 25.00 on material constraint, with an integrity rating of fail and a probe score of 0.00 in the Smoke Quick Test Run#214 on 2026-07-05.

Striking Contrast in Score Structure

A score of 88.70 on the code execution dimension indicates a high pass rate for code run in real Python sandbox environments, while a score of only 25.00 on the material constraint dimension suggests weak ability to strictly answer based on given materials and correctly cite sources in long-document reference verification tasks. The gap between these two auditable main leaderboard dimensions reaches 63.7 points, forming the most prominent structural feature of this test.

Integrity Probe Trigger Mechanism

The integrity rating of fail indicates that the model treated fictional entities as real reference sources during the canary probe detection. GLM-4.6 scored 0.00 on this probe. Among the other 10 models tested on the same day, GPT-5.5 and GPT-o3 scored 90.00 on the probe; six models — Doubao Pro, Gemini 3.1 Pro, Gemini 2.5 Pro, Claude Sonnet 4.6, Claude Opus 4.7, and Qwen3 Max — each scored 80.00; DeepSeek V4 Pro scored 65.00; and Grok 4 rated warn (45.00). GLM-4.6 is the only model to receive a fail rating.

Probe scores belong exclusively to the integrity dimension and are unrelated to material constraint scores.

Historical Run Records

GLM-4.6 triggered an integrity fail in both Run#214 on 2026-07-05 and Run#212 on 2026-07-04, with a probe score of 0.00 in each. All dimensions scored 0 in Run#213 on 2026-07-04 due to an evaluation failure, marking that run as invalid data and not a baseline for comparison. Both valid runs resulted in an integrity fail, requiring continued observation.

Dimensional Independence Note

Code execution, material constraint, and integrity rating are three independent dimensions. The material constraint score of 25.00 reflects the model's citation accuracy under given materials, while an integrity fail independently points to behavior of fabricating sources. The two should not be conflated. The current data supports analysis of the current Smoke Quick Test results only and does not support trend extrapolation.

Based on data from Run#214 on 2026-07-05, GLM-4.6 exhibits notable weaknesses in both the material constraint and integrity dimensions, requiring continued monitoring of its future Smoke Test performance.

Data source: YZ Index | Run #214 | View raw data

This article is from Winzheng Index blog, translated in full by Winzheng (winzheng.com). Click here to view the original When republishing the translation, please credit the source. Thank you!

GLM-4.6 Scores 25 in Material Constraint, 88.7 in Code Execution, Zero on Integrity Probe

Striking Contrast in Score Structure

Integrity Probe Trigger Mechanism

Historical Run Records

Dimensional Independence Note

Related Reviews

Winzheng Index Qwen3 Max Smoke Evaluation Main Score Plummets 12 Points, Integrity Rating Changes from Pass to Fail

Winzheng Index Doubao Pro and Gemini 3.1 Pro tied at 88.54: 2026-07-05 Smoke Quick Test Data Brief

Winzheng Index Qwen3 Max Main Board Plunges 12.9 Points, Gemini 2.5 Pro Leads Smoke Lite List with 96.99 Points

Winzheng Index Gemini 3.1 Pro Tops with 82.97 Points, Execution Score of 75 Points Widens Gap with Second Place