YZ Index · AI Model Change Intelligence

Which AI model should you use today?
We benchmark them every week.

11 models · 212 questions randomly sampled · Real code execution · Citation verification · Rolling average rankings · Don't trust press releases, check continuous performance.

Code Sandbox Execution Citation Accuracy Check Statistical Significance Ranking Compliance Testing No Vendor Sponsorship
Who to Use Right Now
#1 Overall (Rolling Average) Grok 3
Biggest Rise This Week 文心一言 4.0 +15
Latest Benchmark 2026-04-27 SGT
judge v6
0
Models Tested
0
Test Questions
0
DCD Scenarios
5 categories x 6 questions
Weekly
Auto-evaluation frequency

Don't just look at the overall score — consider your use case

Top Pick
豆包 Pro
92.2 pts
Runner-up
Gemini 2.5 Pro
89.4 pts
Third Choice
grok-3
88.9 pts
Top Pick
Gemini 2.5 Pro
47.2 pts
Runner-up
claude-opus-4.6
46.3 pts
Third Choice
豆包 Pro
46.3 pts
Top Pick
grok-3
84.4 pts
Runner-up
Claude Sonnet 4.6
81.1 pts
Third Choice
claude-opus-4.6
79.7 pts
Top Pick
deepseek-v3
99.7 pts
Runner-up
ernie-4
98.5 pts
Third Choice
豆包 Pro
93 pts
Top Pick
豆包 Pro
38.9 pts
Runner-up
Gemini 2.5 Pro
36.6 pts
Third Choice
claude-opus-4.6
36.6 pts
Top Pick
claude-opus-4.6
0 pts
Runner-up
Claude Sonnet 4.6
0 pts
Third Choice
deepseek-r1
0 pts
Qwen3 Max
70 pts
GPT-5.5
68.3 pts
Claude Opus 4.7
66.7 pts

View Full Recommendations by Use Case

Worth reading today — beyond the hype

We only feature content that impacts capability, pricing, stability, or model selection.

Review
5 Reasons: Commitment Capability Will Become the Next Core Indicator of AI Models, Disrupting Selection Rules!
As AI model capabilities converge, commitment ability—how reliably a model keeps its promises—is emerging as the next core indicator, reshaping enterprise selection and forcing vendors to prioritize compliance and controllability.
Review
We Tested 11 AI Models on 30 Integrity Tasks — Honesty Rate Plummets to 55%!
A rigorous test by Winzheng (winzheng.com) challenged 11 mainstream AI models with 30 carefully designed integrity tasks. The average honesty rate was just 60.4%, with the lowest dropping to 55%, raising serious concerns about AI reliability.
Review
Exposing the 5 Great Deceptions of AI Rankings: 99% Untrustworthy, How YZ Index Revolutionizes Evaluation?
Many AI rankings are unreliable due to self-evaluation, fake code tests, single-run rankings, and sponsor influence. YZ Index from Winzheng disrupts this with rigorous methods like sandboxed execution, rolling averages, and zero-AI judging.
Review
AI Suppliers Hard to Tell Apart: WDCD Guardrail Test Exposes Scores of 11 Major Models, Avoiding Data Breach Minefields
As a CTO or CIO, you may lose sleep over AI suppliers' promises. They verbally guarantee data isolation, but leak user privacy under pressure? This is not sci-fi but a real risk. The WDCD Guardrail Test cuts to the chase, simulating high-pressure scenarios to check if models break promises. Stop blindly trusting hype—see the real scores and avoid data disasters.
Review
5 Tips: Leverage YZ Index Open Data to Lead AI Technology Selection and Save 20% R&D Costs!
By utilizing the weekly updated YZ Index open data from Winzheng (winzheng.com), developers can make data-driven decisions to compare model performance, avoid pitfalls, and save up to 20% in R&D costs. This professional AI model evaluation index covers hundreds of popular models across dimensions like performance, efficiency, cost, and stability.
Review
Winzheng Homepage Upgrade! 5 Features Transform It into an AI Intelligence Terminal, Outpacing Industry News
Winzheng (winzheng.com) has upgraded its homepage from a simple product showcase into an AI intelligence terminal, featuring a Bloomberg-style real-time dashboard, AI-powered smart search, curated headline news feeds, a data trust wall, and embedded widgets for sharing YZ Index rankings. The redesign aims to deliver trusted, real-time, data-driven insights, helping users stay ahead in the fast-evolving AI landscape.
Review
AI Model Showdown: 5-Dimensional Radar Chart – Claude Opus 4.7 vs GPT-5.5, Who Will Prevail?
This article compares Claude Opus 4.7 and GPT-5.5 using the YZ Index AI model comparison tool from Winzheng, providing data-driven insights across five dimensions with radar charts, bar charts, API pricing, and scenario recommendations.
Review
Grok 3 Unexpectedly Tops the Charts with 86.88 Points! Which AI Models Are Rising and Which Are Declining This Week?
This week, Grok 3 shockingly tops the YZ Index with a score of 86.88, edging out Doubao Pro by just 0.44 points. Dive into the analysis of which models are surging and which are slipping.
Review
Unveiling the WDCD Commitment Test: 3 Rounds, 30 Questions Targeting AI’s “Breach of Trust” Pain Points, Disrupting the Evaluation Landscape!
The YZ Index WDCD Commitment Test, launched by Winzheng (winzheng.com), uses a 3-round, 30-question design to precisely dissect AI’s “credibility crisis.” It exposes the hidden danger of AI failing to honor its promises, urging enterprises to move beyond flashy benchmark scores and focus on true reliability.

Not all AI news is worth reading. What matters is what changes your judgment. View All News

Why This Leaderboard Is Worth Your Attention

Real Code Execution
Looking like it can code isn't enough. We run the code in a sandbox. If it doesn't pass, it's zero.
Citation Verification
For long-document questions, we don't just check if the answer looks right — we verify citations trace back to the source.
Statistical Rankings
We don't judge on a single run. Rankings are based on rolling averages, avoiding luck-driven fluctuations.
No Sponsored Benchmarks
No co-evaluations, no pre-test consultations, no favoritism. Whatever the results are, that's what we publish.

View Methodology

The AI world changes daily — you need a reliable source

3 curated picks daily, weekly index changes, instant alerts for incidents and price shifts. Free, no ads, unsubscribe anytime.

  • Daily Picks — From the flood of AI news, we pick the 3 that truly matter
  • YZ Index Weekly — Who's up, who's down — one email covers it all
  • Model Incident Alerts — When a model you use has an issue, know immediately
  • Price Change Notifications — API price changes — don't find out from the bill
Free | No Ads | No Sponsored Content | Unsubscribe Anytime

Want deeper analysis? Go further.

The leaderboard answers "who's stronger." Research Lab answers "why." Model safety, edge deployment, performance teardowns — not rehashing papers, but conclusions from our own testing.

Enter Research Lab