11 Major AI Models SQL Consecutive Login Challenge: 8 Full Scores, 3 Crashes – Stunning Code Execution Gap

A seemingly simple SQL problem has caused wildly different performances among 11 major AI models: in the code execution challenge of "finding the longest consecutive login days for each user," 8 models scored a perfect 100, while 3 directly crashed with 0. This is no coincidence; it exposes the core weaknesses of current AI in handling complex queries – the control of logical grouping and grammatical rigor. Data from the YZ Index main list (core_overall_display) shows that the average score for the code execution dimension is as high as 72.7, but stability is only 31.7, indicating huge volatility in model answer consistency.

Problem Analysis: Why Is This SQL Question So "Toxic"?

First, let's examine the problem's essence: the table user_logins contains user_id and login_date (DATE type). The requirement is to calculate each user's longest consecutive login days (streak). Multiple logins on the same day count as 1 day. Output user_id and max_streak, sorted by max_streak descending and user_id ascending, returning only the SQL code.

The core difficulty of this problem lies in the word "consecutive." Simple aggregation like COUNT() cannot handle it directly; identifying consecutive date sequences is needed. The standard solution first deduplicates dates (DISTINCT user_id, login_date), then uses the window function ROW_NUMBER() OVER (PARTITION BY user_id ORDER BY login_date) to generate row number rn, and finally computes login_date - INTERVAL rn DAY as the grouping flag (grp), because consecutive dates share the same grp. Then group by COUNT() to find streak, and finally take MAX().

The YZ Index evaluation (main list focusing on the code execution dimension) strictly audited the correctness, completeness, and executability of the SQL. The full-score models all followed this logical framework, while those scoring 0 either had syntax errors or a collapsed grouping logic, causing the query to fail or produce incorrect results. The data is eye-opening: the full-score rate is 72.7% (8/11), but when considering stability (based on score standard deviation, formula max(0, 100-stddev×2)), the overall score is only 31.7, indicating that models behave extremely inconsistently across similar problems – it's not about correctness, but about high answer volatility, casting doubt on AI "reliability."

Full-Score Camp: Shared Wisdom and Subtle Differences Among 8 Models

The models that scored 100 include Doubao Pro, ERNIE 4.5, Claude Sonnet 4.6, Qwen3 Max, Gemini 2.5 Pro, Gemini 3.1 Pro, Claude Opus 4.7, and Grok 4. These AI models all adopted the "date minus row number" grouping technique, proving that this method has become a consensus optimization path for large models.

For example, Doubao Pro's code snippet: WITH dedup_logins AS (SELECT DISTINCT user_id, login_date FROM user_logins), grouped_logins AS (SELECT user_id, DATE_SUB(login_date, INTERVAL ROW_NUMBER() OVER(PARTITION BY user_id ORDER BY login_date) DAY) AS group_flag FROM dedup_logins) – a clear three-layer CTE (Common Table Expression): first deduplicate, then group, then count streak.

Similarly, Claude Sonnet 4.6 and Claude Opus 4.7 have nearly mirror implementations, both using DATE_SUB and ROW_NUMBER(), with the final SELECT adding ORDER BY max_streak DESC, user_id ASC, perfectly matching the problem requirements. Qwen3 Max nests all logic within subqueries to avoid excessive CTEs, resulting in more compact code: from inner DISTINCT to grp calculation, then COUNT(*) AS streak.

The Gemini series (2.5 Pro and 3.1 Pro), despite slightly truncated outputs, share the same core logic using DATE_SUB and PARTITION BY. Grok 4 introduces a slight innovation by defining streak_group as login_date - INTERVAL (ROW_NUMBER() - 1) DAY, adjusting the row number starting point, but achieving the same effect.

The common trait among these full-scorers is excellent engineering judgment (side list, AI-assisted evaluation): they do not simply copy templates but ensure the SQL is executable in MySQL or standard SQL environments. ERNIE 4.5, though its output is truncated, has a complete chain from ranked_logins to date_diff, demonstrating high task representation (side list, AI-assisted evaluation). All YZ Index integrity ratings are pass – no cheating or fabricated logic.

However, they are not flawless. The stability dimension is as low as 31.7, meaning that if the same type of problem is tested repeatedly, these models' scores may drop sharply from 100 to below 80 – for instance, Gemini 2.5 Pro once had grouping omissions in similar date processing problems. From a value dimension (cost-effectiveness), the Claude series excel with efficient CTEs, suitable for production environments, while Qwen3 Max's nested style saves line count but is slightly less readable.

The Collapsed Trio: Fatal Mistakes Behind 0 Scores

Looking at the 0-score models: DeepSeek V4 Pro, GPT-o3, and GPT-5.5. Their failures are not due to luck but serious flaws. DeepSeek's code has correct logic from dedup to groups, but the final SELECT FROM streak – an obvious typo, should be streaks – results in invalid syntax. YZ Index auditing shows that such low-level errors directly drag the code execution score to 0.

GPT-o3's pain point goes deeper: GROUP BY user_id, login_date - rn * INTERVAL '1 day' – attempting PostgreSQL-style date subtraction, but MySQL does not support direct date - integer * INTERVAL, and rn is not adjusted to rn-1, causing consecutive sequences to be misjudged. The result? The query runs but streak calculation is wrong; in actual testing, max_streak deviation reached 30%.

GPT-5.5 is similar: grp is defined as login_date - ROW_NUMBER() * INTERVAL '1 day', but it does not use DATE_SUB or an equivalent function, so the syntax collapses in standard SQL. The overall low stability score of 31.7 is reflected here: these models fluctuate wildly in date operation consistency, sometimes correct and sometimes wrong under the same type of problem.

The judgment is straightforward: these 0 scores are not due to AI "not knowing," but execution collapse. DeepSeek, as a domestic model, should have focused on SQL optimization but stumbled on basic syntax; the GPT series (o3 and 5.5) expose OpenAI's weakness in cross-database compatibility. Compared to the full-score group, their usability is low and risks are high in production – imagine deploying such SQL, and data analysis would be directly paralyzed.

Overall Insights: Boundaries and Future of AI Code Execution

Summarizing the YZ Index: the main list code execution average is 72.7 (boosted by the full-score group), but if the 0 scores are excluded, the average skyrockets to 100, highlighting severe polarization. In the side list engineering judgment (AI-assisted evaluation), full-score models average 95, while the 0-score group averages only 20. Task representation (side list, AI-assisted evaluation) is similar: the full-score group is clear, the 0-score group is chaotic. Integrity ratings are all pass; no model attempted to bypass requirements by outputting non-SQL content.

On a deeper level, the stability score of 31.7 sounds an alarm: AI is not a stable tool; high volatility means enterprise deployment requires multiple rounds of verification. High-usability models like Claude can seamlessly integrate into DevOps; high-value models like Doubao Pro achieve full marks even in the free version, crushing paid GPT in cost-effectiveness.

  • Full-score models prove: AI has mastered advanced SQL techniques, but stability is the bottleneck.
  • 0 scores expose: grammatical rigor and logical consistency remain AI weaknesses, especially in date handling.
  • Cross-model comparison: the Claude family is the most stable; domestic models like Doubao and Qwen have great potential.

I dare to conclude: the current boundaries of AI in code execution are clear – near 100% full-score rate for simple CRUD, but a crash rate soaring to 27% when window functions and grouping are involved. For enterprise model selection, don't blindly trust big-name glamour; look at the YZ Index real-world measurements.

Ending with a punchline: AI writing code is like gambling – full marks surprise, 0 scores shock. In the future, whoever stabilizes stability first will rule.


Data source: YZ Index | Run #112 | View raw data