In the YZ Index v6 code execution test, this "SQL Monthly Retention Cohort" problem directly exposed the true gap between 11 models. The final result was harsh: 9 models scored 0, with only DeepSeek V4 Pro and Grok 4 achieving 66.7 points.
Core Failure Mode: CTE Cut Off Mid-Writing
Nine models — 豆包 Pro, Qwen3 Max, Claude Sonnet 4.6, 文心一言 4.5, Gemini 2.5 Pro, Gemini 3.1 Pro, Claude Opus 4.7, GPT-5.5, and GPT-o3 — all got stuck at almost the same spot: the WITH statement was truncated halfway, or the SELECT only listed the first two columns and stopped. A typical example is 豆包 Pro's month_offset_map, which only UNIONed three rows and then ended abruptly, and Qwen3 Max's active_months GROUP BY stopped at "e.u".
In real engineering, such "half-finished" outputs are equivalent to direct errors. The problem is not a wrong logical direction, but an inability to even assemble a complete, executable SQL statement.
Systematic Chaos in Date Calculation and Cohort Isolation
Even models that managed to finish the query commonly tripped on two critical aspects: precise filtering of cohort_month and correct generation of month_offset.
The Claude series and GPT series repeatedly used EXTRACT(YEAR FROM ...) subtraction without properly handling cross-year scenarios; Gemini 2.5 Pro used DATE_PART to compute offsets but ultimately failed to map active_month back to the three fixed offsets 0/1/2; 文心一言 simply used a rough DATEDIFF(MONTH ...) approach, clearly lacking engineering rigor.
In contrast, while DeepSeek V4 Pro and Grok 4's solutions still had minor issues, they at least completed the correct skeleton: "first find the 2025-01 cohort, then CROSS JOIN three offsets, finally LEFT JOIN for statistics." That is the fundamental reason they managed to score 66.7.
The zero score is not because the problem is too difficult, but because the models collectively failed at the step of "translating the idea into an executable SQL."
Why Only Two Passed?
DeepSeek V4 Pro's approach completely separated cohort filtering and offset generation, used CROSS JOIN to forcefully produce three rows (0/1/2), and then applied LEFT JOIN to count retained_users — the cleanest logic. Grok 4, though slightly redundant in its TIMESTAMPDIFF syntax, also completed the dual-CTE structure of cohort_users + user_activity.
The remaining models either over-relied on DATE_TRUNC but failed to handle HAVING filtering properly, or focused on computing offset formulas while neglecting the most basic requirement: "must output exactly three fixed-offset rows."
Direct Insights for Engineering Deployment
In real data warehouse scenarios, retention analysis is one of the most common analytical needs. The problem exposed here is not that the models don’t know SQL, but that when the requirement demands "precise control of output row count + multi-step CTE + date offsets," most current models are still in a stage of "writing something that looks like SQL but cannot run."
For teams relying on AI-assisted SQL writing, this means a manual review step must remain, especially in multi-step aggregation scenarios like cohort analysis, retention, and funnel.
The code execution dimension of the YZ Index essentially tests not whether a model "understands SQL," but whether it can "deliver correct, runnable SQL in one shot." The answer from this test: only a few can do it so far.
Data source: YZ Index | Run #122 | View raw data
© 2026 Winzheng.com 赢政天下 | 转载请注明来源并附原文链接