Winzheng — AI Model Benchmarking · Change Intelligence

Anthropic’s Claude Fable 5 is a version of Mythos the public can access today

Anthropic is releasing Claude Fable 5, its first Mythos-class model available to the public. The model comes with guardrails that block responses in high-risk a

2026-06-10 06:02

Google just fired a warning shot in the AI subscription price wars

Google just made it significantly cheaper to enjoy its budget AI subscription ti

How Justin Ernest invested nearly $400M into hot startups without a traditional VC fund

Instead of spending a year raising a formal venture fund, the Sabertooth VC foun

Overall Top 5

#1 Grok 4 89.9 ▲11.5 · #2 Claude Opus 4.7 89 ▲10.2 · #3 豆包 Pro 88.8 ▲10 · #4 Claude Sonnet 4.6 87.2 ▲9.2 · #5 Gemini 2.5 Pro 86.4 ▲7.4 · #6 Qwen3 Max 86.2 ▲8.5 · #7 Gemini 3.1 Pro 84.8 ▲7.7 · #8 DeepSeek V4 Pro 83.3 ▲6.4 · #9 GPT-o3 82.8 ▲6.9 · #10 GPT-5.5 80.9 ▲2.7 · #11 文心一言 4.5 76.9 ▲15.2 · ▲ Qwen3 Max +80.9 · ▼ DeepSeek V3 -75.1 · #1 Grok 4 89.9 ▲11.5 · #2 Claude Opus 4.7 89 ▲10.2 · #3 豆包 Pro 88.8 ▲10 · #4 Claude Sonnet 4.6 87.2 ▲9.2 · #5 Gemini 2.5 Pro 86.4 ▲7.4 · #6 Qwen3 Max 86.2 ▲8.5 · #7 Gemini 3.1 Pro 84.8 ▲7.7 · #8 DeepSeek V4 Pro 83.3 ▲6.4 · #9 GPT-o3 82.8 ▲6.9 · #10 GPT-5.5 80.9 ▲2.7 · #11 文心一言 4.5 76.9 ▲15.2 · ▲ Qwen3 Max +80.9 · ▼ DeepSeek V3 -75.1 ·

Full Rankings →

Latest News

View All News →

News 06-10 10:01 TC

How Justin Ernest invested nearly $500M into hot startups without a traditional VC fund

Instead of spending a year raising a formal venture fund, the Sabertooth VC founder used a captive network of LPs to inv

News 06-10 10:00 TC

Google just fired a warning shot in the AI subscription price wars

Google just made it significantly cheaper to enjoy its budget AI subscription tier.

News 06-10 08:01 TC

How Justin Ernest invested nearly $400M into hot startups without a traditional VC fund

Instead of spending a year raising a formal venture fund, the Sabertooth VC founder used a captive network of LPs to inv

News 06-10 06:01 TC

Anthropic’s Fable 5 can make weirdly fun video games with the click of a button

Anthropic's Claude Fable 5 is going to be a big hit with the web's vibe coders.

News 06-10 06:00 TC

Hey Siri, here’s what I actually want from AI

I'm desperate for a personal AI assistant, but do I really want to become the kind of person who can't function without

News 06-10 05:01 Winzheng Lab

WDCD Run #157: Average Instruction Decay Hits 47.7% Across 11 Models, Three-Way Tie at the Top

WDCD Run #157 (2026-06-10) recorded a 47.7% average commitment decay across 11 models, with Claude Sonnet 4.6, Gemini 2.

Review 06-10 05:01

WDCD Compliance Test Shakes: 5 Models Plunge Up to 12.5 Points, Qwen3 Max Rallies

In the latest WDCD cycle compared to Run #146, five mainstream models experienced significant declines, with a maximum d

Review 06-10 05:01

11 Models WDCD Horizontal Review: Resource Constraints All Collapse to 1 Point, Business Rules Show 4-Point Gap

WDCD pilot data shows that the Resource Constraints scenario scored the lowest overall, with champion gemini-3.1-pro onl

Review 06-10 05:00

R3 Integrity Rate Plunges to 24.5%, 72 Crashes Reveal True Colors of 11 Models

The WDCD test's most striking finding is that while models perform well in R1 and R2 stages, their overall integrity rat

Review 06-10 05:00

67.5 Points Three-Way Tie for First, Grok4 Only 50 Points at Bottom - WDCD Compliance Leaderboard

The first results of the WDCD Compliance Test are out, with three models tied for first at 67.50 points, while Grok 4 an

News 06-10 04:03 TC

WWDC 2026: Everything announced on Siri AI, iOS 27, Apple Intelligence, and more

Apple primarily made the case for an improved experience with its long-standing Siri assistant, which like most other an

News 06-10 04:02 TC

Can tech companies learn to love cheaper AI models?

If those same AI workloads can be handled by cheaper models without affecting quality, it would mean a massive shift in

Reviews

View All →

Review 06-10

WDCD Compliance Test Shakes: 5 Models Plunge Up to 12.5 Points, Qwen3 Max Rallies

In the latest WDCD cycle compared to Run #146, five mainstream models experienced significant declines, with a maximum d

Review 06-10

11 Models WDCD Horizontal Review: Resource Constraints All Collapse to 1 Point, Business Rules Show 4-Point Gap

WDCD pilot data shows that the Resource Constraints scenario scored the lowest overall, with champion gemini-3.1-pro onl

Review 06-10

R3 Integrity Rate Plunges to 24.5%, 72 Crashes Reveal True Colors of 11 Models

The WDCD test's most striking finding is that while models perform well in R1 and R2 stages, their overall integrity rat

WDCD Compliance

#1 Claude Sonnet 4.6 67.5 #2 Gemini 2.5 Pro 67.5 #3 Qwen3 Max 67.5 #4 GPT-o3 65 #5 Claude Opus 4.7 62.5 #6 Gemini 3.1 Pro 60 #7 GPT-5.5 57.5

View full compliance rankings →

Research Lab

WDCD Run #157: Average Instruction Decay Hits 47.7% Across 11 Models, Three-Way Tie at the Top

WDCD Run #157 (2026-06-10) recorded a 47.7% average commitment decay across 11 models, with Claude S

3 Major Models Translation Showdown: Week 24 Quality Evaluation, passthrough Leads with a Score of 9

This week, <strong>2425</strong> translation tasks were completed by <strong>3</strong> models. <str

WDCD Run #146: Average Instruction Decay Hits 24.7% Across 11 Models, Claude Opus 4.7 and GPT-5.5 Tie at Top

WDCD Run #146 (2026-06-03) tested 11 frontier models on multi-turn commitment integrity, recording a

Enter Research Lab →

YZ Index — AI Model Benchmarks, News & Research

Latest News

Reviews

WDCD Compliance

Research Lab