11 Models Attempt SQL Retention Task: 9 Score Zero, DeepSeek and Grok Only 66.7

May 18, 2026 525 Views - Read Source Winzheng Index

Code Execution Material Constraints SQL留存测试模型对比工程边界

In the YZ Index v6 code execution test, this "SQL Monthly Retention Cohort" problem directly exposed the true gap between 11 models. The final result was harsh: 9 models scored 0, with only DeepSeek V4 Pro and Grok 4 achieving 66.7 points.

Core Failure Mode: CTE Cut Off Mid-Writing

Nine models — Doubao Pro, Qwen3 Max, Claude Sonnet 4.6, ERNIE Bot 4.5, Gemini 2.5 Pro, Gemini 3.1 Pro, Claude Opus 4.7, GPT-5.5, and GPT-o3 — all got stuck at almost the same spot: the WITH statement was truncated halfway, or the SELECT only listed the first two columns and stopped. A typical example is Doubao Pro's month_offset_map, which only UNIONed three rows and then ended abruptly, and Qwen3 Max's active_months GROUP BY stopped at "e.u".

In real engineering, such "half-finished" outputs are equivalent to direct errors. The problem is not a wrong logical direction, but an inability to even assemble a complete, executable SQL statement.

Systematic Chaos in Date Calculation and Cohort Isolation

Even models that managed to finish the query commonly tripped on two critical aspects: precise filtering of cohort_month and correct generation of month_offset.

The Claude series and GPT series repeatedly used EXTRACT(YEAR FROM ...) subtraction without properly handling cross-year scenarios; Gemini 2.5 Pro used DATE_PART to compute offsets but ultimately failed to map active_month back to the three fixed offsets 0/1/2; ERNIE Bot simply used a rough DATEDIFF(MONTH ...) approach, clearly lacking engineering rigor.

In contrast, while DeepSeek V4 Pro and Grok 4's solutions still had minor issues, they at least completed the correct skeleton: "first find the 2025-01 cohort, then CROSS JOIN three offsets, finally LEFT JOIN for statistics." That is the fundamental reason they managed to score 66.7.

The zero score is not because the problem is too difficult, but because the models collectively failed at the step of "translating the idea into an executable SQL."

Why Only Two Passed?

DeepSeek V4 Pro's approach completely separated cohort filtering and offset generation, used CROSS JOIN to forcefully produce three rows (0/1/2), and then applied LEFT JOIN to count retained_users — the cleanest logic. Grok 4, though slightly redundant in its TIMESTAMPDIFF syntax, also completed the dual-CTE structure of cohort_users + user_activity.

The remaining models either over-relied on DATE_TRUNC but failed to handle HAVING filtering properly, or focused on computing offset formulas while neglecting the most basic requirement: "must output exactly three fixed-offset rows."

Direct Insights for Engineering Deployment

In real data warehouse scenarios, retention analysis is one of the most common analytical needs. The problem exposed here is not that the models don’t know SQL, but that when the requirement demands "precise control of output row count + multi-step CTE + date offsets," most current models are still in a stage of "writing something that looks like SQL but cannot run."

For teams relying on AI-assisted SQL writing, this means a manual review step must remain, especially in multi-step aggregation scenarios like cohort analysis, retention, and funnel.

The code execution dimension of the YZ Index essentially tests not whether a model "understands SQL," but whether it can "deliver correct, runnable SQL in one shot." The answer from this test: only a few can do it so far.

Data source: YZ Index | Run #122 | View raw data

11 Models Attempt SQL Retention Task: 9 Score Zero, DeepSeek and Grok Only 66.7

Core Failure Mode: CTE Cut Off Mid-Writing

Systematic Chaos in Date Calculation and Cohort Isolation

Why Only Two Passed?

Direct Insights for Engineering Deployment

Related Reviews

Winzheng Index GLM-4.6: 93.30 on Material Constraint but Integrity Fail, Code Execution 25.00 Drags Down Leaderboard

Winzheng Index GLM-4.6 Integrity Rating Drops from Pass to Fail, Code Execution Surges by 47 Points

Winzheng Index Claude Opus 4.7 Smoke Evaluation Main Ranking Drops 26.1 Points, Code Execution and Material Constraints Both Fail

Winzheng Index Gemini 2.5 Pro Code Execution Dropped 24.6 Points in a Single Day; Overall Ranking Slid 6.5 Points