AI Coding Benchmarks

44 articles · Page 1 of 3
Which AI model writes the best code? HumanEval and MBPP are common benchmarks, but they only test function-level completion — far from real-world development. The YZ Index Execution dimension runs model-generated programs in isolated sandboxes, verifying compilation, runtime correctness, and edge-case handling. It is one of the few independent benchmarks using real code execution verification rather than model-as-judge scoring. This topic tracks coding capability rankings, programming tool updates, and AI-assisted development practices.
编程的未来已来:Anthropic用Claude展示AI编码新范式
在Anthropic于伦敦举办的开发者活动“Code with Claude”上,公司展示了AI辅助编程的最新成果。与会者被问及是否曾用AI生成代码——这一问题的答案揭示了一个不可逆转的趋势:无论我们是否愿意,AI正在重塑软件开发的基础。本文深入分析Claude的编码能力、行业影响以及背后的技术挑战
May 22, 2026
Review Gemini 3.1 Pro Drops 8.5 Points on Main Leaderboard, Code Execution Plummets 9.5 – Lottery or Degradation?
In today's Smoke evaluation, Gemini 3.1 Pro saw a sharp 8.5-point drop on the main leaderboard, with code execution falling from 66.70 to 57.20 and ma
May 22, 2026
Review Smoke Quick Test: Doubao Pro Scores 100 in Execution, 9 Models Plunge Over 30 Points on Main Leaderboard
Doubao Pro achieved 91.23 points with a perfect 100 in code execution and a pass in integrity, while most other models saw their execution scores drop
May 22, 2026
Anthropic的Code with Claude:编程的未来已来,你准备好了吗?
Anthropic在伦敦举办为期两天的开发者活动Code with Claude,展示了AI辅助编程的最新进展。活动与Google I/O同期举行,但并非巧合。Claude作为编程助手,正改变开发者工作流,提升效率的同时也引发关于人类角色、代码质量等深层思考。本文编译自MIT Technology
May 22, 2026
Review Doubao Pro main index plummets 18.4 points, code execution drops 30.8 in one day: real degradation or sampling luck?
Doubao Pro's main index in the Smoke evaluation dropped sharply by 18.4 points in a single day, with code execution falling 30.8 points. This could be
May 21, 2026
Review Grok 4 Tops with 98.34 Points, Claude Opus Plunges 31.3 Points on Main Leaderboard
In today's 10-question quick test by Smoke, Grok 4 ranked first with 98.34 points, while Claude Opus 4.7 saw a sharp drop of 31.3 points on the main l
May 21, 2026
谷歌Gemini 3.5 Flash:押注AI代理,而非聊天机器人
在年度开发者大会上,谷歌发布了迄今最强大的编程与智能体AI模型Gemini 3.5 Flash。该模型能自主执行复杂任务,并从零开始构建软件,标志着谷歌正式转向以智能体(agent)为核心的新一代AI浪潮,而非仅仅停留在聊天机器人层面。
May 20, 2026
Review Claude Opus 4.7 Main Ranking Plummets 22.6 Points, Code Execution Halved from 100
Claude Opus 4.7's main ranking in today's Smoke evaluation dropped from 93.48 to 70.93, a single-day decline of 22.6 points. The code execution dimens
May 19, 2026
Review Grok 4 Tops with 97.44 Points, GPT-o3 Plunges 28 Points on Main Leaderboard
In Smoke's latest 10-question quick test, execution weaknesses of AI models were laid bare. Grok 4 reached the top with 97.44 points, while GPT-o3's m
May 19, 2026
普通人也能玩转Vibe Code?我和Claude做了个数据库
如今,似乎任何人都能通过“Vibe Code”创建任何东西。作为技术小白,作者与AI助手Claude合作,尝试构建一个记录大众日常小怨气的数据库。本文探索了这一新兴编程范式的可行性,并反思了AI辅助编程对普通人的意义。
May 18, 2026
Review 11 AI Models Solve Consecutive Login SQL Problem: 8 Full Scores, 3 Crashed Directly
The same classic SQL problem of consecutive logins split 11 mainstream models into two camps: 8 gave complete correct answers, and 3 completely collap
May 18, 2026
Review 11 Models Attempt SQL Retention Task: 9 Score Zero, DeepSeek and Grok Only 66.7
In the YZ Index v6 code execution test, the "SQL Monthly Retention Cohort" problem laid bare the true capabilities of 11 models. The result was brutal
May 18, 2026
Review 11 AI Models Take the Same SQL Quiz: 3 Score Zero, Why Claude and GPT Collapsed?
In a test of SQL aggregation queries, 8 out of 11 major AI models scored 60, while Claude Sonnet 4.6, Claude Opus 4.7, and GPT-o3 scored 0 due to date
May 18, 2026
Review This Week's 11-Model Overhaul: Newcomer Qwen3 Max Enters with 68.5, Veterans at 75 Exit En Masse
This week’s YZ Index v6 main leaderboard saw six legacy models removed and five new ones added simultaneously, reshuffling the top ten within a single
May 18, 2026
Review Gemini 3.1 Pro Main Score Plunges 11.1 Points, Code Execution Halved from 100
In today's Smoke quick test, Gemini 3.1 Pro's main score dropped 11.1 points, primarily due to code execution falling from 100 to 75, while material c
May 18, 2026
Review Qwen3 Max Main Index Plummets 10.9 Points, Code Execution Halved by 25 Points in a Single Day
Qwen3 Max's main index dropped 10.9 points in today's Smoke test, with the code execution dimension falling from a perfect 100 to 75. This one-day flu
May 18, 2026
Review GPT-5.5's Main Ranking Plunges 28 Points: Is It Real Degradation?
GPT-5.5's code execution score dropped from 100 to 50, causing a 28-point drop in the main ranking. But is this degradation or just sampling noise?
May 16, 2026
Review Three Models Plunge by 28 Points, Claude Still Near Perfect Score
Today's YZ Index Smoke lightweight test reveals that three leading models suffered significant drops, while Claude models dominate near-perfect scores
May 16, 2026
OpenAI宣布Codex将登陆手机,编程助手随时在线
OpenAI近日宣布,其AI编程助手Codex即将推出移动端版本,用户可通过手机直接使用代码生成与补全功能。此举旨在打破桌面设备的限制,让开发者即使不在电脑前也能高效管理编程工作流。更新将带来更强的灵活性,支持语音输入和轻量级任务处理,有望改变移动编程的生态格局。
May 15, 2026
Clawdmeter:让Claude Code使用数据实时呈现桌面小仪表盘
一款名为Clawdmeter的开源小工具将Claude Code的使用统计转化为袖珍桌面仪表盘,专为AI编码重度用户设计。它实时显示API调用次数、Token消耗、费用等关键指标,支持高度自定义,帮助开发者高效管理AI编程助手的使用成本与性能。本文深入分析这款工具的功能、技术背景及行业意义,并探讨A
May 15, 2026