赢政指数 (23 articles)

R1 Answers Well, R3 Completely Collapses: 63% Defeat Rate Revealed in Commitment Decay Test of 11 Models

The WDCD three-round decay test reveals a sobering reality for technical decision-makers: the R1 confirmation rate is 95%, the R2 resistance rate is 91%, but the R3 integrity rate plummets to 29%. Out of 330 R3 pressure tests, 209 ended in complete collapse (0 points), a breakdown rate of 63.3%. Models that confidently promise constraints in the first round betray them on the spot over 60% of the time when directly pressured in the third round.

WDCD 守约测试 模型衰减
250

Unveiling the WDCD Commitment Test: 3 Rounds, 30 Questions Targeting AI’s “Breach of Trust” Pain Points, Disrupting the Evaluation Landscape!

The YZ Index WDCD Commitment Test, launched by Winzheng (winzheng.com), uses a 3-round, 30-question design to precisely dissect AI’s “credibility crisis.” It exposes the hidden danger of AI failing to honor its promises, urging enterprises to move beyond flashy benchmark scores and focus on true reliability.

AI评测 赢政指数 WDCD测试
291

YZ Index Major Overhaul: 7 New Models Including GPT-5.5, Claude Opus 4.7, and DeepSeek V4 Launch Simultaneously as 9 Veterans Retire

On May 1, 2026, YZ Index completed its largest evaluation roster update since launch last year, replacing 9 models and introducing 7 new flagships in a single sweep. This generational overhaul reflects the rapid pace of AI industry updates, where the evaluation system now needs to keep up with monthly rather than yearly iterations.

赢政指数 AI评测 GPT-5
1,396

DeepSeek V4 Open-Source Model Released: 1.6 Trillion Parameters, Million-Token Context – Can It Overthrow Closed-Source Dominance?

On April 25, 2026, Chinese AI company DeepSeek officially open-sourced its V4 series large models, with the Pro version boasting 1.6 trillion parameters and supporting a 1 million token context window, alongside a low-compute Flash variant and a 75% API discount until May 5, 2026. Winzheng.com's evaluation based on YZ Index v6 methodology reveals that it is the first open-source model to match closed-source leaders in key dimensions like code execution and grounding, while offering superior cost-effectiveness.

DeepSeek V4 开源大模型 AI产品评测
1,452

YZ Index Weekly Report: Collective Leap in Task Expression Capabilities, Claude Series Pioneers Material Constraint Track

This week's YZ Index evaluation captures a rare synchronous improvement in the "task expression" dimension across 10 out of 11 mainstream AI models, while Claude Opus 4.6 uniquely breaks through in the "material constraint" dimension. The report analyzes these developments and offers developer selection advice for different application scenarios.

赢政指数 AI评测
436

Engineering Judgment Test: Comparative Analysis of Database Deletion Recovery Solutions from 8 AI Models

In a database deletion recovery engineering judgment test, 8 mainstream AI models showed significant differences in understanding and response strategies. The models split into two distinct camps: 5 models scored 40 points by providing comprehensive solutions, while 3 models scored 0 by only addressing partial aspects of the problem.

赢政指数 模型横评 工程判断力:数据库误删恢复
459