Engineering Judgment Test: Comparative Analysis of Database Deletion Recovery Solutions from 8 AI Models

In this engineering judgment question about database deletion recovery, 8 mainstream AI models demonstrated significant differences in understanding and response strategies. The core test point of the question was: faced with an accidental deletion in a production database, what should be the engineer's first step.

Understanding Deviation: Two Distinct Camps

From the score distribution, the models show clear polarization: 5 models received 40 points, while 3 models got 0 points. This difference stems from divergent understandings of "what should be the first step."

0-point camp (DeepSeek V3, DeepSeek R1, Gemini 2.5 Pro) only emphasized "stopping write operations." While this is a correct emergency measure, they ignored the key information explicitly mentioned in the question—"confirmed to have a complete backup from last night." These three models' responses remained at the level of general incident response, failing to provide complete solutions for the specific scenario.

40-point camp (Claude Sonnet/Opus, Qwen Max, GPT-4o, GPT-o3) demonstrated more comprehensive understanding. They not only mentioned stopping operations but also explicitly pointed out the necessity of restoring from backup. Among them, Claude Opus provided the most detailed answer, offering 3 specific ways to stop writes, demonstrating deep engineering practice experience.

Key Insights: Details Make the Difference

Notably, both DeepSeek versions (V3 and R1) gave nearly identical answers, both limited to "stopping writes." In contrast, the Claude series and GPT series models all identified the complete intent of the question—not just damage control, but recovery as well.

GPT-o3's answer was the most concise and direct: "Immediately restore the user table data from last night's backup." Although it skipped the step of stopping writes, it captured the core solution to the problem. Claude Sonnet also specifically mentioned "notifying team leaders" and "recording the time point," reflecting collaboration awareness and post-incident analysis needs in actual work.

Conclusion: The Watershed of Engineering Judgment

This question effectively distinguished AI models' understanding of engineering practices. Excellent models not only recognize general emergency measures but can also provide complete solutions based on specific conditions (having available backups). From the test results, the Claude series, GPT series, and Qwen Max showed more mature performance in engineering judgment, while the DeepSeek series and Gemini still have room for improvement in scenarios requiring comprehensive judgment.


Data source: YZ Index | Run #20 | View raw data