Winzheng — AI Model Benchmarking · Change Intelligence · Selection Guide

Five things you need to know about AI

At SXSW London last week I gave a talk called “Five things you need to know about AI,” in which I shared what I think are the biggest themes in AI right now. I

2026-06-09 18:00

Anthropic’s Fable 5 can make weirdly fun video games with the click of a button

Anthropic's Claude Fable 5 is going to be a big hit with the web's vibe coders.

Hey Siri, here’s what I actually want from AI

I'm desperate for a personal AI assistant, but do I really want to become the ki

Overall Top 5

#1 Grok 4 89.9 ▲11.5 · #2 Claude Opus 4.7 89 ▲10.2 · #3 豆包 Pro 88.8 ▲10 · #4 Claude Sonnet 4.6 87.2 ▲9.2 · #5 Gemini 2.5 Pro 86.4 ▲7.4 · #6 Qwen3 Max 86.2 ▲8.5 · #7 Gemini 3.1 Pro 84.8 ▲7.7 · #8 DeepSeek V4 Pro 83.3 ▲6.4 · #9 GPT-o3 82.8 ▲6.9 · #10 GPT-5.5 80.9 ▲2.7 · #11 文心一言 4.5 76.9 ▲15.2 · ▲ Qwen3 Max +80.9 · ▼ DeepSeek V3 -75.1 · #1 Grok 4 89.9 ▲11.5 · #2 Claude Opus 4.7 89 ▲10.2 · #3 豆包 Pro 88.8 ▲10 · #4 Claude Sonnet 4.6 87.2 ▲9.2 · #5 Gemini 2.5 Pro 86.4 ▲7.4 · #6 Qwen3 Max 86.2 ▲8.5 · #7 Gemini 3.1 Pro 84.8 ▲7.7 · #8 DeepSeek V4 Pro 83.3 ▲6.4 · #9 GPT-o3 82.8 ▲6.9 · #10 GPT-5.5 80.9 ▲2.7 · #11 文心一言 4.5 76.9 ▲15.2 · ▲ Qwen3 Max +80.9 · ▼ DeepSeek V3 -75.1 ·

Full Rankings →

Latest News

View All News →

News 06-10 06:02 TC

Anthropic’s Claude Fable 5 is a version of Mythos the public can access today

Anthropic is releasing Claude Fable 5, its first Mythos-class model available to the public. The model comes with guardr

News 06-10 06:01 TC

Anthropic’s Fable 5 can make weirdly fun video games with the click of a button

Anthropic's Claude Fable 5 is going to be a big hit with the web's vibe coders.

News 06-10 06:00 TC

Hey Siri, here’s what I actually want from AI

I'm desperate for a personal AI assistant, but do I really want to become the kind of person who can't function without

News 06-10 05:01 Winzheng Lab

WDCD Run #157: Average Instruction Decay Hits 47.7% Across 11 Models, Three-Way Tie at the Top

WDCD Run #157 (2026-06-10) recorded a 47.7% average commitment decay across 11 models, with Claude Sonnet 4.6, Gemini 2.

News 06-10 04:03 TC

WWDC 2026: Everything announced on Siri AI, iOS 27, Apple Intelligence, and more

Apple primarily made the case for an improved experience with its long-standing Siri assistant, which like most other an

News 06-10 04:02 TC

Can tech companies learn to love cheaper AI models?

If those same AI workloads can be handled by cheaper models without affecting quality, it would mean a massive shift in

News 06-10 04:01 ARS

Google announces Gemini 3.5 Live Translate for instant voice-to-voice translation

Voice translations preserve speaker's tone, pacing, pitch—with SynthID watermarks for security.

News 06-10 04:00 ARS

Anthropic says these topics are too dangerous to let its Fable 5 model talk about

New frontier model refuses cybersecurity, biology, and chemistry queries.

Review 06-10 03:10

Claude Sonnet 4.6 Leads with 97.53 Points, Material Constraints Drag 文心一言 40 Points Behind

Smoke's quick test today directly concludes that code execution has become the passing line, while material constraints

News 06-10 02:02 TC

It’s not FAANG anymore. It’s MANGOS.

With SpaceX, Anthropic, and OpenAI all eyeing massive public debuts, the tech industry may soon have a new class of corp

News 06-10 02:01 TC

Anthropic’s Claude Fable 5 is a version of Mythos the public can access today

Anthropic is releasing Claude Fable 5, its first Mythos-class model available to the public. The model comes with guardr

News 06-10 02:00 WD

Anthropic Offers Mythos Upgrade for Cyber Partners and a ‘Safe’ Version for the Rest of You

Anthropic is releasing Claude Mythos 5 to trusted organizations and Claude Fable 5 to the public, a version it says can’

Reviews

View All →

Claude Sonnet 4.6 Leads with 97.53 Points, Material Constraints Drag 文心一言 40 Points Behind

Smoke's quick test today directly concludes that code execution has become the passing line, while material constraints

Smoke Daily: GPT-5.5 tops with 92.58 points, material constraint gap of 19 points decides the outcome

Smoke's latest data shows that code execution is no longer the dividing line, and material constraints have become the r

11 Models Answer Same Blame-Shifting Problem: 8 Get A>B>D>C, 3 Get 0 Points Directly

11 mainstream models showed significant divergence on the same engineering judgment question: 8 models output A>B>D>C an

WDCD Compliance

#1 Claude Sonnet 4.6 67.5 #2 Gemini 2.5 Pro 67.5 #3 Qwen3 Max 67.5 #4 GPT-o3 65 #5 Claude Opus 4.7 62.5 #6 Gemini 3.1 Pro 60 #7 GPT-5.5 57.5

View full compliance rankings →

Research Lab

WDCD Run #157: Average Instruction Decay Hits 47.7% Across 11 Models, Three-Way Tie at the Top

WDCD Run #157 (2026-06-10) recorded a 47.7% average commitment decay across 11 models, with Claude S

3 Major Models Translation Showdown: Week 24 Quality Evaluation, passthrough Leads with a Score of 9

This week, <strong>2425</strong> translation tasks were completed by <strong>3</strong> models. <str

WDCD Run #146: Average Instruction Decay Hits 24.7% Across 11 Models, Claude Opus 4.7 and GPT-5.5 Tie at Top

WDCD Run #146 (2026-06-03) tested 11 frontier models on multi-turn commitment integrity, recording a

Enter Research Lab →