Research Lab

Leaderboards tell you who's stronger. Lab tells you why.

排行榜回答"谁更强",Lab 负责回答"为什么"。

独立研究 / 数据驱动 / 开放验证 / 零赞助

We don't take money from any AI company. No 'partnership evaluations', no 'sponsored reports', no 'pre-evaluation consultations'. Every point in the Winzheng Index is computed by our system, not negotiated.

WDCD · 全球首个 AI 守约能力评测框架

三轮对话压力测试 / 30 道企业场景 / 100% 规则判分 / 零 AI 裁判

FLAGSHIP
"我们测的不是 AI 能不能做到,而是答应了的事能不能守住。"
11 个模型
5 类约束场景
3 轮对话压力
30 道测试题

首轮数据已公开

研究亮点

动态语境衰变

约束在多轮对话中如何被遗忘?我们量化了从 R1 确认理解到 R3 完全妥协之间的衰减曲线,揭示模型"答应了但记不住"的真实规律。

否定窗口技术

区分"引用违规"和"执行违规"的判分创新。当模型说"我不会提供 X"时,X 出现在否定语境中不算违规,只有真正执行才扣分。

零 AI 裁判

为什么用规则代替 AI 判分更可信?WDCD 全部使用关键词匹配 + 正则规则判分,100% 可审计、可复现,消除"让 AI 评判 AI"的循环依赖。

数据透明
评测数据 API 开放 所有原始分数和回答均可通过 REST API 获取
判分规则完全公开 每道题的违规关键词和评分逻辑均可审查
嵌入式 Widget 可用 一行代码嵌入 WDCD 排行榜到任何网页
全部代码可审计 评测框架、判分引擎、数据管道的技术方法论完整公开
What We're Dissecting

YZ Index

Active

Flagship product.11 models,212 questions, code sandbox + reference check + rolling average.
A complete report every week, telling you who improved, who regressed, and who's worth it most.

Latest Output:Full weekly evaluation has been updated · 05-18
Enter Winzheng Index

Security & Adversarial Research

Work in Progress

Can AI models be deceived? Can they be stolen? Can they be bypassed?
We dissect models, test defenses, find vulnerabilities—before the bad guys do.

First report in preparation
View Related Reports

Edge Computing Architecture

Work in Progress

Not everyone has H100s.
We research how to run full-featured LLMs on a $400 mini PC.

First report in preparation
View Related Reports
Latest Teardowns View All →