AI Model Response Analysis for OG Card Image Debugging Problem

In this engineering judgment test, 8 AI models demonstrated significant differences in understanding depth. The question described a typical production environment debugging scenario: the same code produces different results for different inputs, requiring judgment on the first troubleshooting step.

Clear Stratification in Response Quality

The high-scoring group (80 points) includes Claude Sonnet 4.6, Claude Opus 4.6, and Qwen Max. These three models accurately grasped the core issue—rendering anomalies caused by data differences. They all explicitly proposed comparing content differences among the three articles. Notably, the Claude series detailed possible problem areas: special characters, emojis, multi-byte characters, text length, character encoding, etc. This specific analysis demonstrates deep understanding of common PHP GD library issues.

The middle-scoring group (60 points), GPT-4o and GPT-o3, chose to check error logs. While this is a reasonable debugging step, it lacks the insight into the problem's essence compared to the high-scoring group—since Article A works normally, the code logic itself has no fatal errors, and the problem is more likely at the data level.

The low-scoring group (0 points) includes DeepSeek V3, DeepSeek R1, and Gemini 2.5 Pro. The first two provided overly brief responses lacking practical guidance. While Gemini 2.5 Pro mentioned checking PHP error logs and explained possible reasons for blank images, it similarly overlooked the key information that "Article A works normally."

Key Differences in Understanding Depth

High-scoring models demonstrated scenario-based thinking—they understood not only the technical aspects but also the problem's context. The pattern of "same code, different results" directly points to input data differences. In contrast, low-scoring models appeared to execute generic debugging procedures, lacking targeted analysis for the specific scenario.

Particularly noteworthy is that both the Claude series and Qwen Max mentioned "special characters" as a detail, reflecting that they may possess richer practical development knowledge bases and understand common pitfalls when the GD library handles Unicode characters, emojis, and other content.

This test clearly demonstrated the gap in engineering judgment among different AI models: excellent models not only provide answers but also conduct precise analysis based on scenario characteristics—exactly the capability most needed in actual work.


Data source: YZ Index | Run #20 | View raw data