AI Coding Benchmarks

108 articles · Page 1 of 6

Which AI model writes the best code? HumanEval and MBPP are common benchmarks, but they only test function-level completion — far from real-world development. The YZ Index Execution dimension runs model-generated programs in isolated sandboxes, verifying compilation, runtime correctness, and edge-case handling. It is one of the few independent benchmarks using real code execution verification rather than model-as-judge scoring. This topic tracks coding capability rankings, programming tool updates, and AI-assisted development practices.

Review Qwen3 Max Main Leaderboard Plummets 12.9 Points, Code Execution Drops 26.8 in a Single Day

In the June 2026 Smoke evaluation of the YZ Index, Qwen3 Max's main leaderboard score fell from 84.92 to 72.02, a drop of 12.9 points, with the code e

Review Qwen3 Max Main Board Plunges 12.9 Points, Gemini 2.5 Pro Leads Smoke Lite List with 96.99 Points

In the Smoke Lite evaluation of 11 models on July 4, 2026, by the YZ Index, Gemini 2.5 Pro ranked first with a Main Board score of 96.99, while Qwen3

Review GPT-5.5 Leads Smoke Benchmark with Perfect Execution Score of 86.95, Exposing Constraint Weakness

In the Smoke lightweight benchmark on July 3, 2026, GPT-5.5 ranked first with a main score of 86.95, driven by a perfect code execution score of 100,

Review Gemini 3.1 Pro Tops with 82.97 Points, Execution Score of 75 Points Widens Gap with Second Place

In the YZ Index Smoke lightweight evaluation on July 2, 2026, Gemini 3.1 Pro achieved first place on the main leaderboard with 82.97 points (Execution

Review Doubao Pro Smoke Evaluation Main Ranking Plunges 18.6 Points, Code Execution Drops 38.8 in a Single Day

In the YZ Index June 2026 live test of 11 models, Doubao Pro’s Smoke Evaluation main ranking fell from 85.91 yesterday to 67.32 today, a drop of 18.6

Review Grok 4 Smoke Evaluation Main Score Plummets 15.3 Points, Code Execution Drops 31.4 in a Single Day

In today's YZ Index Smoke evaluation, Grok 4's main score dropped from 97.98 to 82.73, a decrease of 15.3 points, and code execution fell from 100.00

Review Claude Opus 4.7 Tops with 94.82 Points, Gemini 3.1 Pro Plunges 32.2 Points

In the Smoke lightweight evaluation on July 1, 2026, Claude Opus 4.7 ranked first on the main leaderboard with a score of 94.82, while Gemini 3.1 Pro

Review Claude Sonnet 4.6 Smoke Main Ranking Plunges 15.3 Points, Code Execution Drops 25 Points in a Single Day

In the June 2026 Smoke evaluation of the YZ Index, Claude Sonnet 4.6 saw its main ranking score drop from 97.84 to 82.52 points, a single-day decline

Review Claude Opus 4.7 Main Score Plunges 16 Points in Smoke Test, Code Execution Drops 27.2 in a Single Day

In the YZ Index June 2026 Smoke Evaluation, Claude Opus 4.7's main score dropped from 100.00 yesterday to 84.01 today, and its code execution dimensio

Review Gemini 3.1 Pro Tops with 98.47 Points, Claude's Execution Score Plunges 27.2 to 72.8

In the June 30, 2026 Smoke Lite evaluation of the YZ Index, Gemini 3.1 Pro ranked first with a main score of 98.47 points. Multiple models saw signifi

Cursor now has a mobile app for guiding your coding agent on the go

Cursor has launched a new mobile app for remote oversight over coding agents.

Review 豆包 Pro Smoke Evaluation Main Ranking Drops 13.8 Points, Code Execution Falls from 100 to 75

In the June 2026 YZ Index evaluation of 11 models, 豆包 Pro's main ranking score dropped from 98.61 yesterday to 84.77 today, a decline of 13.8 points.

Review Claude Sonnet 4.6 Smoke Review Main Score Plummets 25.9 Points, Code Execution Drops from 100 to 50

In the June 2026 Smoke review of the YZ Index, Claude Sonnet 4.6's main score fell from 96.45 to 70.52, code execution dropped from 100.00 to 50.00, w

Review Claude Opus 4.7 Code Execution Plummets from 100 to 50, Main Score Drops 25.7 Points in a Single Day

In today's Smoke evaluation of the YZ Index, Claude Opus 4.7 saw its main score drop from 97.12 to 71.47, a decline of 25.7 points, driven entirely by

Review Claude Opus 4.7 Leads with 97.12 Points, Perfect Execution but Material Constraint Score of 93.6 Drags Down Overall

In the YZ Index from June 27, 2026, Smoke lightweight evaluation, Claude Opus 4.7 ranked first on the main leaderboard with 97.12 points, achieving a

Review Qwen3 Max Code Execution Plunges 50 Points, Main Ranking Only Drops 1.5 Points

In the June 2026 YZ Index evaluation of 11 models, Qwen3 Max's code execution score plummeted from 100.00 to 50.00 in a single day. However, the main

Review Claude Opus 4.7 Smoke Evaluation Main Benchmark Drops 27.5 Points, Code Execution from 100 to 50

In the June 2026 YZ Index test of 11 models, Claude Opus 4.7 Smoke's main benchmark score dropped from 100.00 yesterday to 72.50 today, with the code

Review 4模型执行分暴跌至50，文心一言主榜狂掉34.1分

In the YZ Index June 24, 2026 Smoke Lightweight Evaluation, 文心一言4.5 main ranking score plummeted 34.1 points to 64.63 from yesterday, and the executio

Review Gemini 2.5 Pro Plunges 28 Points on Main Leaderboard, Code Execution Halved from 100

Gemini 2.5 Pro's main leaderboard score on the YZ Index June 2026 Smoke Benchmark dropped from 99.28 to 71.33, a single-day decline of 28 points, driv

Review Qwen3 Max Main Score Plummets 19.2 Points, Code Execution Drops 31.2 Points in a Single Day

In the YZ Index's June 2026 test of 11 models, Qwen3 Max's main score dropped from 100 points yesterday to 80.82 points today, a decrease of 19.2 poin