When 11 mainstream AI models faced the same SQL aggregation query, a clear execution divide emerged. Eight models output runnable code and scored 60; three models—Claude Sonnet 4.6, Claude Opus 4.7, and GPT-o3—returned zero directly, with the issue concentrated on date range syntax compatibility with MySQL dialects.
Common Features of Correct Answers
The models that scored 60 generally adopted one of two executable approaches. The first was MySQL's native DATE_SUB function, used by models such as 豆包 Pro, 文心一言 4.5, and Grok 4:
created_at >= DATE_SUB(CURDATE(), INTERVAL 90 DAY)
This syntax can be executed directly in MySQL 5.7/8.0 environments without additional conversion. The second approach, adopted by some models, used the form CURRENT_DATE - INTERVAL '90' DAY, but required removing the quotes or adjusting the dialect to pass.
Fatal Errors of Zero‑Score Models
The three zero‑score models exhibited systematic deviations in date handling. Both Claude Sonnet 4.6 and Claude Opus 4.7 wrote:
created_at >= CURRENT_DATE - INTERVAL '90 days'
The single‑quoted '90 days' is illegal syntax in MySQL and directly causes execution errors. GPT‑o3 used CURRENT_DATE - INTERVAL '90 days', which also cannot run in a standard MySQL environment.
These models ignored the implicit MySQL context of the question (the table structure and field naming style are typical of MySQL examples), directly applying date syntax from PostgreSQL or Oracle, leading to complete execution failure.
Real Differences in Execution Dimension
From a code execution perspective, the 60‑point models not only had correct syntax but also satisfied all constraints: filtering amount IS NOT NULL, status='paid', ordering by total_amount DESC + user_id ASC, and LIMIT 10. Although the zero‑score models followed a similar logical approach, they lost all points because their code could not execute.
Notably, while Gemini 2.5 Pro and Gemini 3.1 Pro received the same score, Gemini 3.1 Pro used CURDATE(), which is more compatible with MySQL and more robust in actual deployment.
Judging Model Capability Boundaries
The Claude series typically leads in natural language understanding but shows a clear weakness in engineering code tasks that require precise dialect matching. GPT‑o3, as the latest iteration, also failed on this SQL boundary question, indicating that its code execution training still has blind spots in dialect handling.
In contrast, models such as Qwen3 Max, DeepSeek V4 Pro, and GPT‑5.5 are more pragmatic in SQL dialect adaptation, producing output that can be directly used in production environments.
In code execution, syntax compatibility matters more than "looking correct."
This test once again proves that the engineering value of an AI model ultimately depends on whether its output can actually run in the target database. Subtle differences in date functions directly determined the gap between zero and sixty points.
If future models want to stand firm in engineering scenarios, they must incorporate mainstream database dialect boundary cases into core training, rather than relying solely on natural language descriptions.
Data source: YZ Index | Run #122 | View raw data
© 2026 Winzheng.com 赢政天下 | 转载请注明来源并附原文链接