In the same SQL problem, the final scores of 11 models were polarized: 4 scored 100, and 7 scored zero. The core differences lie in self-join deduplication logic, time difference calculation function selection, and the placement of the status condition.
Common Features of Full-Mark Answers
Four models—豆包Pro, Grok 4, Gemini 2.5 Pro, and Gemini 3.1 Pro—correctly implemented three requirements: using p1.id < p2.id to avoid mirror duplicates; placing the status condition in either the JOIN ON clause or WHERE clause; and calculating the time difference accurately and with the correct direction.
豆包Pro uses TIMESTAMPDIFF(SECOND, p1.created_at, p2.created_at) <= 120, compatible with MySQL dialect; Grok 4 also uses TIMESTAMPDIFF and places all conditions in the ON clause with clear structure; the two Gemini models use UNIX_TIMESTAMP for the difference, satisfying the 120-second limit and ensuring first_id precedes second_id.
Typical Errors in Zero-Score Answers
The seven zero-score models mainly commit three types of errors:
- Missing p1.id < p2.id, causing the same pair of records to be output twice (Qwen3 Max, Claude Sonnet 4.6, DeepSeek V4 Pro, Claude Opus 4.7, GPT-o3, GPT-5.5);
- Incorrect time function selection or reversed direction (文心一言4.5 and DeepSeek V4 Pro use EXTRACT(EPOCH FROM ...) without handling time zones, and some answers have confused direction of created_at comparison);
- Incomplete queries or missing status conditions (文心一言4.5 truncates directly; DeepSeek places status in WHERE but omits the created_at direction restriction).
These errors in real payment systems would directly cause duplicate alarms or missed alerts, constituting engineering-critical issues.
Dialect Differences Behind Function Selection
The full-mark models clearly show better understanding of SQL dialect adaptation. TIMESTAMPDIFF is a standard MySQL function, and UNIX_TIMESTAMP is also common in MySQL environments. In contrast, EXTRACT(EPOCH FROM ...) is PostgreSQL syntax and would directly cause errors in MySQL. At least four of the seven zero-score models mix cross-dialect functions, indicating insufficient sensitivity to production SQL dialects.
The Real Gap in Engineering Judgment
This test once again proves that "knowing how to write SQL" and "being able to write production-ready SQL" are two different things. Correct answers must simultaneously satisfy three constraints: deduplication, precise time window, and status filtering. Most models fail in one or two of these areas, reflecting a systemic weakness in handling multi-condition self-joins.
The lopsided 4-to-7 score difference shows that current large models are still in the "can write" rather than "can write correctly" stage for engineering SQL generation.
If models continue to maintain such a high failure rate in similar multi-constraint self-join tasks over the next six months, core business scenarios such as payment risk control and order deduplication will continue to rely on manual review.
Data source: 赢政指数 (YZ Index) | Run #154 | View raw data
© 2026 Winzheng.com 赢政天下 | 转载请注明来源并附原文链接