11 AIs Answer the Same Debugging Question: 5 Score Zero, Where's the Fatal Gap?

When your code breaks, what advice will AI assistants give you? I tested 11 mainstream models with a real debugging scenario, and the results were jaw-dropping: 45% of the models couldn't even get a passing grade, including the newly released DeepSeek V3.

A Question That Reveals AI's True Capabilities

The question was simple: Using PHP GD library to generate article share cards, after modifying the code, Article A works normally while Articles B and C show blank images - what should be the first step? This is a scenario every engineer encounters - same code, different behavior, a typical edge case problem.

The result? 11 models gave 11 different answers, and the gap was ridiculously large.

5 Zero-Score Answers: AI's "Correct Nonsense"

DeepSeek V3's response was just one sentence: "Check the image generation paths and permission settings for Articles B and C." This answer completely missed the point - the question already stated it was "the same generation logic." If it were a path permission issue, how could Article A work normally?

DeepSeek R1, Wenxin Yiyan 4.0, and Qwen Max gave identical responses: check input parameters, check for special characters, check data validity. These are all correct nonsense, equivalent to a doctor telling a patient "you need to get checked."

"Check if parameters are abnormal" - The problem with this type of answer is that it doesn't tell you how to check, what to check, or why to check this way.

Common Features of 80-Point Answers: Specific and Actionable

Now look at the high-scoring answers. 豆包 Pro directly said: "Check PHP error logs, or temporarily enable error output." Claude Sonnet suggested: "Compare the data differences between Articles A, B, and C, especially title length, special characters, and encoding formats."

What do these answers have in common? They're specific, actionable, and prioritized. They don't speak in generalities but provide clear operational steps.

More importantly, high-scoring models all recognized the essence of the problem: since A works normally while B and C don't, the difference must be in the data, not in the code logic. This reasoning ability is what distinguishes excellent engineers from average ones.

The Middle Ground of 60 Points: Not Deep Enough

Gemini 2.5 Pro suggested checking git diff, GPT-o3 recommended examining the modified code sections. These answers aren't wrong, but they're too inefficient. In actual work, if you dive into code first instead of checking specific errors, you might waste a lot of time.

It's like solving a case - you could ask eyewitnesses first, but instead choose to review surveillance footage. The direction isn't wrong, but it's not the optimal solution.

Three Fatal Blind Spots of AI Models

Through this question, I discovered three fatal blind spots of current AI models in engineering problems:

  • Lack of debugging intuition: Real engineers seeing "A works, B and C don't" would first think to compare differences, not generically "check parameters"
  • Don't understand priorities: Checking logs, comparing data, reviewing code diffs - all correct, but with completely different priorities
  • Overly safe responses: To avoid being wrong, many models choose to give the most conservative, correct, but also most useless answers

What Does This Mean?

This test result made me realize a harsh truth: When handling actual engineering problems, at least half of AI models aren't even as good as a programmer with 2 years of experience.

More ironically, models that excel in benchmarks (like DeepSeek V3) performed terribly on this practical question. What does this tell us? It suggests that the way we currently evaluate AI might be fundamentally wrong.

Of course, there's good news too. The performance of 豆包 Pro, Claude, and Grok proves that AI can indeed become excellent debugging assistants - provided you choose the right model.

In the future, only AIs that can provide truly valuable advice on actual engineering problems deserve the title "intelligent." As for those models that only speak correct nonsense, let them stick to benchmarks.


Data source: YZ Index | Run #33 | View raw data