YZ Index · AI Model Change Intelligence

Which AI model should you use today?
We benchmark them every week.

11 models · 212 questions randomly sampled · Real code execution · Citation verification · Rolling average rankings · Don't trust press releases, check continuous performance.

View YZ Index Subscribe to Weekly Changes

Code Sandbox Execution Citation Accuracy Check Statistical Significance Ranking Compliance Testing No Vendor Sponsorship

Who to Use Right Now

#1 Overall (Rolling Average) Grok 3

Biggest Rise This Week 文心一言 4.0 +15

Latest Benchmark 2026-05-04 SGT

judge v6

Models Tested

Test Questions

DCD Scenarios

5 categories x 6 questions

Weekly

Auto-evaluation frequency

#1 Grok 3 86.9 ─ #2 豆包 Pro 86.4 ▲ +1.3 #3 Gemini 2.5 Pro 84.3 ▲ +3.5 #4 Claude Sonnet 4.6 84.1 ▲ +7.3 #5 Claude Opus 4.6 83.4 ▲ +3.9

Incidents / Pricing

2 incidents

0 price changes

Don't just look at the overall score — consider your use case

Top Pick

豆包 Pro

92.2 pts

Runner-up

Gemini 2.5 Pro

89.4 pts

Third Choice

grok-3

88.9 pts

Top Pick

Gemini 2.5 Pro

47.2 pts

Runner-up

claude-opus-4.6

46.3 pts

Third Choice

豆包 Pro

46.3 pts

Top Pick

grok-3

84.4 pts

Runner-up

Claude Sonnet 4.6

81.1 pts

Third Choice

claude-opus-4.6

79.7 pts

Top Pick

deepseek-v3

99.7 pts

Runner-up

ernie-4

98.5 pts

Third Choice

豆包 Pro

93 pts

Top Pick

豆包 Pro

38.9 pts

Runner-up

Gemini 2.5 Pro

36.6 pts

Third Choice

claude-opus-4.6

36.6 pts

Top Pick

claude-opus-4.6

0 pts

Runner-up

Claude Sonnet 4.6

0 pts

Third Choice

deepseek-r1

0 pts

Qwen3 Max

65 pts

Claude Sonnet 4.6

62.5 pts

DeepSeek V4 Pro

62.5 pts

View Full Recommendations by Use Case View full compliance rankings

Worth reading today — beyond the hype

We only feature content that impacts capability, pricing, stability, or model selection.

Review

WDCD Full Score Standard: "Ability to Refuse" Is Not Enough; Models Must Also Provide Alternatives

WDCD's full-score standard for R3 requires not only refusing violating requests but also providing safe alternatives. Data from Run #105 shows that no model achieved a full score, revealing that while some models can refuse, most fail to offer alternatives, underscoring the critical need for models to "hold the boundary and continue solving problems."

Review

WDCD and the Agent Era: A True Agent Is Not About Better Execution, But About Knowing When to Stop

The article argues that mature agents must know when to stop, not just execute. WDCD Run #105 data shows all models failed on Q239, highlighting the critical need for structured constraint checking before tool invocation.

Review

Winzheng Perspective: The More Useful the Model, the More It Needs Brakes

Data from WDCD Run #105 reveals a critical contradiction in the Agent era: as models become more capable, the consequences of their errors become more irreversible. The report uses extreme samples like Q239, Q223, and Q237 to quantify how even top models fail to respect constraints when acting as agents.

Review

WDCD Pressure Induction: Why "Boss Needs It Urgently" Can Break Large Models

Most enterprise AI incidents aren't triggered by blatant malicious instructions. Instead, phrases like "Boss needs it urgently," "The client is waiting," or "Just get a version running first" exploit workplace conversational pressure to bypass model safeguards. WDCD Run #105's R3 pressure induction test quantifies how common workplace language penetrates large models.

Review

WDCD Test: Long Context Is Not a Safe, But a Longer Scene of Forgetting

Long context is often seen as a solution for large models, but actual test results reveal it fails to enforce rules under pressure, turning into a longer forgetting field. The "1→1→0" decay pattern across models shows that remembering constraints does not guarantee executing them when user pressure mounts.

News

Cost Butcher Arrives! Google Gemini 3.1 Flash-Lite Officially GA: High-Frequency AI Agent at Only $0.25 per Million Tokens

Google has formally released Gemini 3.1 Flash-Lite as a generally available model targeting high-throughput, cost-sensitive agentic tasks. Priced at $0.25 per million tokens, it aims to drive efficiency in use cases such as translation and workflow automation.

News

OpenAI Launches GPT-Realtime-2: Real-time Voice Agent Achieves Thinking and Acting in Dialog, Pushing the Limits of Natural Interaction in Voice AI

OpenAI has officially launched GPT-Realtime-2, a model designed for real-time voice agents that can think and act during conversations, marking a significant leap in voice AI toward more natural and responsive interactions.

News

Musk Shares Tesla AI Photon Reconstruction Technology, Challenging Traditional RGB Vision Limitations

Elon Musk recently shared images on X comparing Tesla's AI photon counting reconstruction technology to the traditional RGB color model, highlighting the superior performance of Tesla's Full Self-Driving (FSD) system in low-light and high-glare conditions. This demonstration has sparked widespread interest and discussion about the future of AI vision in autonomous driving.

News

你点头的那些AI术语，该弄懂了

随着AI技术迅猛发展，大量专业术语和网络俚语涌入日常对话。许多人面对“大模型”“AGI”“对齐”等词汇时只能点头附和，但内心充满疑问。本文系统梳理了当前最核心的AI概念，从Transformer到扩散模型，从强化学习到提示工程，并补充行业背景与深度分析。读完你不仅能听懂AI圈的“黑话”，还能与朋友侃侃而谈。

Not all AI news is worth reading. What matters is what changes your judgment. View All News

Why This Leaderboard Is Worth Your Attention

1998

Founded

Continuously operating

Vendor Sponsors

Fully independent

Real Code Execution

Looking like it can code isn't enough. We run the code in a sandbox. If it doesn't pass, it's zero.

Citation Verification

For long-document questions, we don't just check if the answer looks right — we verify citations trace back to the source.

Statistical Rankings

We don't judge on a single run. Rankings are based on rolling averages, avoiding luck-driven fluctuations.

No Sponsored Benchmarks

No co-evaluations, no pre-test consultations, no favoritism. Whatever the results are, that's what we publish.

View Methodology

Want deeper analysis? Go further.

The leaderboard answers "who's stronger." Research Lab answers "why." Model safety, edge deployment, performance teardowns — not rehashing papers, but conclusions from our own testing.

Enter Research Lab

Which AI model should you use today?
We benchmark them every week.

Overall Top 5Rolling average

Quick Scene Lookup

Weekly Signals

Don't just look at the overall score — consider your use case

Worth reading today — beyond the hype

Why This Leaderboard Is Worth Your Attention

Want deeper analysis? Go further.

Which AI model should you use today?We benchmark them every week.

Overall Top 5Rolling average

Quick Scene Lookup

Weekly Signals

Don't just look at the overall score — consider your use case

Worth reading today — beyond the hype

Why This Leaderboard Is Worth Your Attention

The AI world changes daily — you need a reliable source

Want deeper analysis? Go further.

Which AI model should you use today?
We benchmark them every week.