Winzheng — AI Model Benchmarking · Change Intelligence · Selection Guide

"We pissed off a lot of people": Giant data center plan cut 50% amid protests

Developer felt "beaten up," with "no choice" but to shrink data center.

2026-06-06 06:01

Startup Battlefield 200 applications officially close in 3 days

Applications for Startup Battlefield 200 officially close on June 8, 11:59 p.m.

The ‘together tech’ wave might be the most intriguing startup bet of 2026

While the AI fundraising machine keeps breaking its own records, some found

Overall Top 5

#1 Gemini 2.5 Pro 79 ▲29.7 · #2 Claude Opus 4.7 78.8 ▼3.1 · #3 豆包 Pro 78.8 ▼2.8 · #4 Grok 4 78.4 ▼5.3 · #5 GPT-5.5 78.2 ▼1.2 · #6 Claude Sonnet 4.6 78 ▼3.2 · #7 Qwen3 Max 77.7 ▼3.1 · #8 Gemini 3.1 Pro 77.1 ▲24.3 · #9 DeepSeek V4 Pro 76.9 ▼4.2 · #10 GPT-o3 75.9 ▼2.6 · #11 文心一言 4.5 61.7 ▼12.5 · ▲ Qwen3 Max +66.5 · ▼ DeepSeek V3 -75.1 · #1 Gemini 2.5 Pro 79 ▲29.7 · #2 Claude Opus 4.7 78.8 ▼3.1 · #3 豆包 Pro 78.8 ▼2.8 · #4 Grok 4 78.4 ▼5.3 · #5 GPT-5.5 78.2 ▼1.2 · #6 Claude Sonnet 4.6 78 ▼3.2 · #7 Qwen3 Max 77.7 ▼3.1 · #8 Gemini 3.1 Pro 77.1 ▲24.3 · #9 DeepSeek V4 Pro 76.9 ▼4.2 · #10 GPT-o3 75.9 ▼2.6 · #11 文心一言 4.5 61.7 ▼12.5 · ▲ Qwen3 Max +66.5 · ▼ DeepSeek V3 -75.1 ·

Full Rankings →

Latest News

View All News →

News 06-06 06:00 TC

Startup Battlefield 200 applications officially close in 3 days

Applications for Startup Battlefield 200 officially close on June 8, 11:59 p.m. PT. Don't wait any longer. Secure your s

News 06-06 04:02 TC

The ‘together tech’ wave might be the most intriguing startup bet of 2026

While the AI fundraising machine keeps breaking its own records, some founders are building in the other direction.

News 06-06 04:01 ARS

S&P 500 rejects SpaceX, also blocking entry for OpenAI and Anthropic

SpaceX won’t get easy access to billions of dollars from passive investors.

News 06-06 04:00 TC

Google will pay SpaceX $920M per month for compute

The companies announced the deal on Friday, just one week ahead of SpaceX's historic IPO.

News 06-06 02:00 TC

The most interesting startups right now want to get you off your phone

While the AI fundraising machine keeps breaking its own records, some founders are building in the other direction.

News 06-06 00:02 TC

The token bill comes due: Inside the industry scramble to manage AI’s runaway costs

"The whole conversation shifted from tokenmaxxing and 'go fast' to 'we need guardrails, how do we control this?'"

News 06-06 00:01 WD

Has Microsoft Lost Its Mojo (Again)?

Microsoft’s AI products aren’t selling and Github’s been plagued with troubles. WIRED spoke with VP Scott Hanselman abou

News 06-06 00:00 ARS

The Fitbit Air is a good wearable weighed down by a chatty AI "coach"

The Air succeeds as a minimalist, reliable fitness tracker, but Google's AI Health Coach feels unnecessary.

News 06-05 22:01 MIT

The Download: AI hacking beyond Mythos, and chatbots’ impact on our brains

This is today’s edition of The Download, our weekday newsletter that provides a daily dose of what’s going o

News 06-05 22:00 TC

AirTrunk commits $30B to build 5GW of AI data centers in India

The Australian data center operator plans to set up 5GW of capacity in India.

News 06-05 20:02 MIT

The Meta hack shows there’s more to AI security than Mythos

On June 5, 404 Media reported that attackers had been using Meta’s AI customer support agent to steal Instagram accounts

News 06-05 20:01 WD

Why Apple Might Put Cameras Into Its Next AirPods

From battery life to privacy, there are many hurdles to the idea taking off.

Reviews

View All →

9 Models Tie at 77.5 on Main Leaderboard, Code Execution Full Score but Material Constraint Only 50

The results of the Smoke Lite evaluation on June 5, 2026, show that 9 out of 11 models tied at 77.5 on the main leaderbo

Smoke Quick Test: 文心一言4.5 and Grok 4 Tie at 99.24, GPT-5.5's Execution Score Only 50

Smoke's quick test results today clearly show that the code execution dimension is nearly saturated. Ten out of eleven m

Grok 4 Surges 10.8 Points to Dominate, Qwen3 Max Plunges 10.8 Points – Major Shuffle in WDCD Cycle

Run #141 data shows that Grok 4 improved by 10.8 points in a single round, GPT-5.5 improved by 9.2 points, while Qwen3 M

WDCD Compliance

#1 Claude Opus 4.7 70 #2 GPT-5.5 70 #3 GPT-o3 70 #4 Claude Sonnet 4.6 67.5 #5 Gemini 2.5 Pro 67.5 #6 豆包 Pro 62.5 #7 Gemini 3.1 Pro 62.5

View full compliance rankings →

Research Lab

WDCD Run #146: Average Instruction Decay Hits 24.7% Across 11 Models, Claude Opus 4.7 and GPT-5.5 Tie at Top

WDCD Run #146 (2026-06-03) tested 11 frontier models on multi-turn commitment integrity, recording a

3 Major Model Translation Showdown: Week 23 Quality Evaluation, gpt-o3 Leads with a Score of 9

This week, 270 translation tasks were completed by 3 models. Two samples were selected for multi-mod

WDCD Run #140: Qwen3 Max Leads with 17% Instruction Decay as Average Hits 36.5%

WDCD Run #140 (2026-05-31) evaluated 11 frontier models on multi-turn commitment integrity, finding

Enter Research Lab →