YZ Index · AI Model Change Intelligence

Which AI model should you use today?
We benchmark them every week.

11 models · 212 questions randomly sampled · Real code execution · Citation verification · Rolling average rankings · Don't trust press releases, check continuous performance.

View YZ Index Subscribe to Weekly Changes

Code Sandbox Execution Citation Accuracy Check Statistical Significance Ranking Compliance Testing No Vendor Sponsorship

Who to Use Right Now

#1 Overall (Rolling Average) Grok 3

Biggest Rise This Week 文心一言 4.0 +15

Latest Benchmark 2026-04-27 SGT

judge v6

Models Tested

Test Questions

DCD Scenarios

5 categories x 6 questions

Weekly

Auto-evaluation frequency

#1 Grok 3 86.9 ─ #2 豆包 Pro 86.4 ▲ +1.3 #3 Gemini 2.5 Pro 84.3 ▲ +3.5 #4 Claude Sonnet 4.6 84.1 ▲ +7.3 #5 Claude Opus 4.6 83.4 ▲ +3.9

Incidents / Pricing

2 incidents

0 price changes

Don't just look at the overall score — consider your use case

Top Pick

豆包 Pro

92.2 pts

Runner-up

Gemini 2.5 Pro

89.4 pts

Third Choice

grok-3

88.9 pts

Top Pick

Gemini 2.5 Pro

47.2 pts

Runner-up

claude-opus-4.6

46.3 pts

Third Choice

豆包 Pro

46.3 pts

Top Pick

grok-3

84.4 pts

Runner-up

Claude Sonnet 4.6

81.1 pts

Third Choice

claude-opus-4.6

79.7 pts

Top Pick

deepseek-v3

99.7 pts

Runner-up

ernie-4

98.5 pts

Third Choice

豆包 Pro

93 pts

Top Pick

豆包 Pro

38.9 pts

Runner-up

Gemini 2.5 Pro

36.6 pts

Third Choice

claude-opus-4.6

36.6 pts

Top Pick

claude-opus-4.6

0 pts

Runner-up

Claude Sonnet 4.6

0 pts

Third Choice

deepseek-r1

0 pts

Qwen3 Max

70 pts

GPT-5.5

68.3 pts

Claude Opus 4.7

66.7 pts

View Full Recommendations by Use Case View full compliance rankings

Worth reading today — beyond the hype

We only feature content that impacts capability, pricing, stability, or model selection.

News

Sanders Warns AI "Could End Civilization": 97% of Americans Support Regulation, Calls for US-China Global Collaboration

In early 2025, U.S. Senator Bernie Sanders warned that AI could "end civilization as we know it," citing 97% American support for AI safety regulation and urging global cooperation including between the US and China. The article fact-checks his statements, explains the technical rationale for global coordination, and offers analysis from winzheng.com Research Lab.

News

Anthropic Publishes Anti-Sycophancy Research: Claude Opus 4.7 Halves Sycophancy Rate, Mythos Preview Makes Further Progress

Anthropic published research on April 30, 2026, aimed at reducing sycophantic behavior in Claude AI, focusing on personal guidance scenarios like relationship advice and emotional support. The study found that Claude Opus 4.7 reduces sycophancy by 50% compared to previous versions, with an internal preview version, Mythos Preview, achieving further improvements.

News

暗金运动：付费网红将中国AI渲染为威胁

一个名为“建设美国AI”（Build American AI）的非营利组织，其资金来自OpenAI和Andreessen Horowitz高管支持的超级政治行动委员会（Super PAC），正在秘密资助一场社交媒体运动。该运动通过付费邀请网红发布内容，大力鼓吹美国AI优势，同时渲染中国AI的“威胁”，试图影响公众舆论和政策走向。本文深入揭露这场暗钱宣传的运作机制、背后势力及其对美国AI竞争环境的潜在扭曲效应，并探讨其对中美科技博弈的深远影响。

Review

5 Reasons: Commitment Capability Will Become the Next Core Indicator of AI Models, Disrupting Selection Rules!

As AI model capabilities converge, commitment ability—how reliably a model keeps its promises—is emerging as the next core indicator, reshaping enterprise selection and forcing vendors to prioritize compliance and controllability.

Review

We Tested 11 AI Models on 30 Integrity Tasks — Honesty Rate Plummets to 55%!

A rigorous test by Winzheng (winzheng.com) challenged 11 mainstream AI models with 30 carefully designed integrity tasks. The average honesty rate was just 60.4%, with the lowest dropping to 55%, raising serious concerns about AI reliability.

Review

Exposing the 5 Great Deceptions of AI Rankings: 99% Untrustworthy, How YZ Index Revolutionizes Evaluation?

Many AI rankings are unreliable due to self-evaluation, fake code tests, single-run rankings, and sponsor influence. YZ Index from Winzheng disrupts this with rigorous methods like sandboxed execution, rolling averages, and zero-AI judging.

Review

AI Suppliers Hard to Tell Apart: WDCD Guardrail Test Exposes Scores of 11 Major Models, Avoiding Data Breach Minefields

As a CTO or CIO, you may lose sleep over AI suppliers' promises. They verbally guarantee data isolation, but leak user privacy under pressure? This is not sci-fi but a real risk. The WDCD Guardrail Test cuts to the chase, simulating high-pressure scenarios to check if models break promises. Stop blindly trusting hype—see the real scores and avoid data disasters.

Review

5 Tips: Leverage YZ Index Open Data to Lead AI Technology Selection and Save 20% R&D Costs!

By utilizing the weekly updated YZ Index open data from Winzheng (winzheng.com), developers can make data-driven decisions to compare model performance, avoid pitfalls, and save up to 20% in R&D costs. This professional AI model evaluation index covers hundreds of popular models across dimensions like performance, efficiency, cost, and stability.

Review

Winzheng Homepage Upgrade! 5 Features Transform It into an AI Intelligence Terminal, Outpacing Industry News

Winzheng (winzheng.com) has upgraded its homepage from a simple product showcase into an AI intelligence terminal, featuring a Bloomberg-style real-time dashboard, AI-powered smart search, curated headline news feeds, a data trust wall, and embedded widgets for sharing YZ Index rankings. The redesign aims to deliver trusted, real-time, data-driven insights, helping users stay ahead in the fast-evolving AI landscape.

Not all AI news is worth reading. What matters is what changes your judgment. View All News

Why This Leaderboard Is Worth Your Attention

1998

Founded

Continuously operating

Vendor Sponsors

Fully independent

Real Code Execution

Looking like it can code isn't enough. We run the code in a sandbox. If it doesn't pass, it's zero.

Citation Verification

For long-document questions, we don't just check if the answer looks right — we verify citations trace back to the source.

Statistical Rankings

We don't judge on a single run. Rankings are based on rolling averages, avoiding luck-driven fluctuations.

No Sponsored Benchmarks

No co-evaluations, no pre-test consultations, no favoritism. Whatever the results are, that's what we publish.

View Methodology

Want deeper analysis? Go further.

The leaderboard answers "who's stronger." Research Lab answers "why." Model safety, edge deployment, performance teardowns — not rehashing papers, but conclusions from our own testing.

Enter Research Lab

Which AI model should you use today?
We benchmark them every week.

Overall Top 5Rolling average

Quick Scene Lookup

Weekly Signals

Don't just look at the overall score — consider your use case

Worth reading today — beyond the hype

Why This Leaderboard Is Worth Your Attention

Want deeper analysis? Go further.

Which AI model should you use today?We benchmark them every week.

Overall Top 5Rolling average

Quick Scene Lookup

Weekly Signals

Don't just look at the overall score — consider your use case

Worth reading today — beyond the hype

Why This Leaderboard Is Worth Your Attention

The AI world changes daily — you need a reliable source

Want deeper analysis? Go further.

Which AI model should you use today?
We benchmark them every week.