11 AI Models Take the Same SQL Quiz: 3 Score Zero, Why Claude and GPT Collapsed?

May 18, 2026 431 Views - Read Source Winzheng Index

Code Execution Material Constraints SQL日期函数模型方言兼容性工程边界测试

When 11 mainstream AI models faced the same SQL aggregation query, a clear execution divide emerged. Eight models output runnable code and scored 60; three models—Claude Sonnet 4.6, Claude Opus 4.7, and GPT-o3—returned zero directly, with the issue concentrated on date range syntax compatibility with MySQL dialects.

Common Features of Correct Answers

The models that scored 60 generally adopted one of two executable approaches. The first was MySQL's native DATE_SUB function, used by models such as Doubao Pro, ERNIE Bot 4.5, and Grok 4:

created_at >= DATE_SUB(CURDATE(), INTERVAL 90 DAY)

This syntax can be executed directly in MySQL 5.7/8.0 environments without additional conversion. The second approach, adopted by some models, used the form CURRENT_DATE - INTERVAL '90' DAY, but required removing the quotes or adjusting the dialect to pass.

Fatal Errors of Zero‑Score Models

The three zero‑score models exhibited systematic deviations in date handling. Both Claude Sonnet 4.6 and Claude Opus 4.7 wrote:

created_at >= CURRENT_DATE - INTERVAL '90 days'

The single‑quoted '90 days' is illegal syntax in MySQL and directly causes execution errors. GPT‑o3 used CURRENT_DATE - INTERVAL '90 days', which also cannot run in a standard MySQL environment.

These models ignored the implicit MySQL context of the question (the table structure and field naming style are typical of MySQL examples), directly applying date syntax from PostgreSQL or Oracle, leading to complete execution failure.

Real Differences in Execution Dimension

From a code execution perspective, the 60‑point models not only had correct syntax but also satisfied all constraints: filtering amount IS NOT NULL, status='paid', ordering by total_amount DESC + user_id ASC, and LIMIT 10. Although the zero‑score models followed a similar logical approach, they lost all points because their code could not execute.

Notably, while Gemini 2.5 Pro and Gemini 3.1 Pro received the same score, Gemini 3.1 Pro used CURDATE(), which is more compatible with MySQL and more robust in actual deployment.

Judging Model Capability Boundaries

The Claude series typically leads in natural language understanding but shows a clear weakness in engineering code tasks that require precise dialect matching. GPT‑o3, as the latest iteration, also failed on this SQL boundary question, indicating that its code execution training still has blind spots in dialect handling.

In contrast, models such as Qwen3 Max, DeepSeek V4 Pro, and GPT‑5.5 are more pragmatic in SQL dialect adaptation, producing output that can be directly used in production environments.

In code execution, syntax compatibility matters more than "looking correct."

This test once again proves that the engineering value of an AI model ultimately depends on whether its output can actually run in the target database. Subtle differences in date functions directly determined the gap between zero and sixty points.

If future models want to stand firm in engineering scenarios, they must incorporate mainstream database dialect boundary cases into core training, rather than relying solely on natural language descriptions.

Data source: YZ Index | Run #122 | View raw data

11 AI Models Take the Same SQL Quiz: 3 Score Zero, Why Claude and GPT Collapsed?

Common Features of Correct Answers

Fatal Errors of Zero‑Score Models

Real Differences in Execution Dimension

Judging Model Capability Boundaries

Related Reviews

Winzheng Index DeepSeek V4 Pro Material Constraint Plunges 31.8 Points While Code Execution Jumps from 69.5 to 100

Winzheng Index GLM-4.6: 93.30 on Material Constraint but Integrity Fail, Code Execution 25.00 Drags Down Leaderboard

Winzheng Index GLM-4.6 Integrity Rating Drops from Pass to Fail, Code Execution Surges by 47 Points

Winzheng Index Claude Opus 4.7 Smoke Evaluation Main Ranking Drops 26.1 Points, Code Execution and Material Constraints Both Fail