11 mainstream models showed significant divergence on the same engineering judgment question: 8 models output A>B>D>C and received 60 points, while 3 models output A>B>C>D and were given 0 points. The difference is only in the relative order of D and C.
Logical Basis for Correct Ordering
The question requires ordering the four responses from best to worst. Option A explicitly acknowledges "insufficient assessment of technical complexity during requirements review" and supplements with specific improvement measures, representing a fully accountable and verifiable response. Option B attributes the issue to "the development team discovering technical difficulties," still focusing on the technology itself, but does not directly admit to omissions during the review stage. Option D blames the delay on "time was already tight," evading its own review responsibility, a typical external attribution. Option C directly shifts blame to "the client changed requirements mid-way," deflecting without supporting evidence, making it the worst ranked.
Therefore, the correct order should be A>B>D>C. Placing C before D means the model considers "blaming the client" as more acceptable than "complaining about time constraints," which completely contradicts the scoring criteria given in the question.
Comparison of Responses Between Scored Models and 0-Point Models
豆包Pro, Gemini 2.5 Pro, Grok 4, Claude Sonnet 4.6, DeepSeek V4 Pro, Claude Opus 4.7, Gemini 3.1 Pro, and GPT-o3 — these 8 models consistently chose A>B>D>C. They made the correct judgment on the ordering of D and C.
On the other hand, the three models Qwen3 Max, 文心一言4.5, and GPT-5.5 placed C before D, forming A>B>C>D. Between "shifting responsibility to the client" and "complaining about insufficient time," they chose the former as the relatively better answer.
Placing C before D is equivalent to defaulting that "blaming the client without evidence" is more acceptable than "shifting responsibility to time pressure," which directly conflicts with the basic material constraint requirements of engineering judgment.
Actual Performance Under the Execution Dimension
The execution dimension focuses on whether the model can strictly follow the given rules to perform the ordering task. The 8 models scoring 60 points strictly followed the "best to worst" instruction, distinguishing the decreasing order of B, D, and C after A. The 3 models scoring 0 points also output an order, but reversed the relative positions of C and D, indicating a deviation in the final step of rule execution.
Differences in Material Usage Under the Grounding Dimension
The grounding dimension measures whether the model firmly anchors its judgment on the four original passages provided in the question. The correct models treated A's "added technical pre-review step" as a positive point, and C's "client changed requirements mid-way" as a deduction for unsubstantiated deflection. The 0-point models, on the other hand, tended to favor C over D, showing insufficient recognition of the binding force of the key material "without evidence" in the question.
Engineering Judgment (Side Ranking, AI-Assisted Evaluation) Observations
The engineering judgment side ranking shows that most models can identify A as the best option, but a few models confuse "blaming the client" with "complaining about time" in the subsequent ordering. This indicates that some models still have systematic bias in prioritizing responsibility attribution.
Looking at the results, the two main rankings — execution and grounding — are sufficient to distinguish clear differences. The 8 models showed high consistency in both dimensions, while the 3 models lost points in both dimensions simultaneously.
This test again confirms: when the question explicitly requires "ordering from best to worst," the differences in model outputs mainly concentrate on the relative ordering of negative options, rather than the identification of positive options.
In the future, if similar questions are repeatedly tested multiple times to observe the fluctuation range of the same model on the order of D and C, it will more clearly reflect the true stability of its grounding dimension.
Data source: YZ Index | Run #154 | View raw data
© 2026 Winzheng.com 赢政天下 | 转载请注明来源并附原文链接