11 AI Models Solve SQL Duplicate Payment Problem: Only 4 Score Full Marks, 7 Score Zero

Jun 8, 2026 485 Views - Read Source Winzheng Index

Code Execution Doubao Pro SQL自连接重复支付检测模型差异

In the same SQL problem, the final scores of 11 models were polarized: 4 scored 100, and 7 scored zero. The core differences lie in self-join deduplication logic, time difference calculation function selection, and the placement of the status condition.

Common Features of Full-Mark Answers

Four models—Doubao Pro, Grok 4, Gemini 2.5 Pro, and Gemini 3.1 Pro—correctly implemented three requirements: using p1.id < p2.id to avoid mirror duplicates; placing the status condition in either the JOIN ON clause or WHERE clause; and calculating the time difference accurately and with the correct direction.

Doubao Pro uses TIMESTAMPDIFF(SECOND, p1.created_at, p2.created_at) <= 120, compatible with MySQL dialect; Grok 4 also uses TIMESTAMPDIFF and places all conditions in the ON clause with clear structure; the two Gemini models use UNIX_TIMESTAMP for the difference, satisfying the 120-second limit and ensuring first_id precedes second_id.

Typical Errors in Zero-Score Answers

The seven zero-score models mainly commit three types of errors:

Missing p1.id < p2.id, causing the same pair of records to be output twice (Qwen3 Max, Claude Sonnet 4.6, DeepSeek V4 Pro, Claude Opus 4.7, GPT-o3, GPT-5.5);
Incorrect time function selection or reversed direction (ERNIE Bot 4.5 and DeepSeek V4 Pro use EXTRACT(EPOCH FROM ...) without handling time zones, and some answers have confused direction of created_at comparison);
Incomplete queries or missing status conditions (ERNIE Bot 4.5 truncates directly; DeepSeek places status in WHERE but omits the created_at direction restriction).

These errors in real payment systems would directly cause duplicate alarms or missed alerts, constituting engineering-critical issues.

Dialect Differences Behind Function Selection

The full-mark models clearly show better understanding of SQL dialect adaptation. TIMESTAMPDIFF is a standard MySQL function, and UNIX_TIMESTAMP is also common in MySQL environments. In contrast, EXTRACT(EPOCH FROM ...) is PostgreSQL syntax and would directly cause errors in MySQL. At least four of the seven zero-score models mix cross-dialect functions, indicating insufficient sensitivity to production SQL dialects.

The Real Gap in Engineering Judgment

This test once again proves that "knowing how to write SQL" and "being able to write production-ready SQL" are two different things. Correct answers must simultaneously satisfy three constraints: deduplication, precise time window, and status filtering. Most models fail in one or two of these areas, reflecting a systemic weakness in handling multi-condition self-joins.

The lopsided 4-to-7 score difference shows that current large models are still in the "can write" rather than "can write correctly" stage for engineering SQL generation.

If models continue to maintain such a high failure rate in similar multi-constraint self-join tasks over the next six months, core business scenarios such as payment risk control and order deduplication will continue to rely on manual review.

Data source: 赢政指数 (YZ Index) | Run #154 | View raw data

11 AI Models Solve SQL Duplicate Payment Problem: Only 4 Score Full Marks, 7 Score Zero

Common Features of Full-Mark Answers

Typical Errors in Zero-Score Answers

Dialect Differences Behind Function Selection

The Real Gap in Engineering Judgment

Related Reviews

Winzheng Index Doubao Pro Main Score Plunges 15 Points: Code Execution Drops from 75 to 58.3

Winzheng Index Doubao Pro Smoke Evaluation Main Ranking Plunges 18.6 Points, Code Execution Drops 38.8 in a Single Day

Winzheng Index DeepSeek V4 Pro Material Constraint Plunges 31.8 Points While Code Execution Jumps from 69.5 to 100

Winzheng Index GLM-4.6: 93.30 on Material Constraint but Integrity Fail, Code Execution 25.00 Drags Down Leaderboard