Research Reports - Winzheng Research Lab

WDCD Run #233: GPT-o3 Leads with Zero Instruction Decay, Gemini 3.1 Pro Collapses Completely

WDCD Run #233 (2026-07-15) evaluated 11 frontier models on multi-turn commitment integrity, recording an average instruction decay of 27.3% between Round 1 and Round 3. GPT-o3 topped the leaderboard with 94 points and zero decay, while Gemini 3.1 Pro suffered a complete 100% collapse.

Research Lab

3大模型翻译对决：第29周质量评测，gpt-o3 以 9 分领跑

本周共翻译 361 篇文章，覆盖 3 个AI模型。经抽样盲评，gpt-o3 综合得分最高（9/10）。报告详细对比各模型在准确性、流畅性、术语一致性方面的表现差异。

Research Lab

WDCD Run #227: Grok 4 and DeepSeek V4 Pro Tie at 91.4 as Instruction Decay Averages -2.8% Across 11 Models

WDCD Run #227 (2026-07-12) evaluated 11 frontier models on multi-turn commitment integrity, with Grok 4 and DeepSeek V4 Pro tying at 91.4 points and average instruction decay measured at -2.8% between Round 1 and Round 3.

Research Lab

WDCD Run #221: Average Instruction Decay Hits -36.4% as Grok 4 Leads 11-Model Field

WDCD Run #221 (2026-07-08) measured instruction decay across 11 frontier models over three dialogue rounds, recording an average commitment decay of -36.4% from Round 1 to Round 3. Grok 4 topped the ranking with 95 points.

Research Lab

4大模型翻译对决：第28周质量评测，gpt-o3 以 9 分领跑

本周共翻译 318 篇文章，覆盖 4 个AI模型。经抽样盲评，gpt-o3 综合得分最高（9/10）。报告详细对比各模型在准确性、流畅性、术语一致性方面的表现差异。

Research Lab

WDCD Run #211: Grok 4 Leads with Just -13% Instruction Decay as GPT-o3 Collapses at -75%

WDCD Run #211 (2026-07-03) benchmarked 11 models on multi-turn commitment integrity, with Grok 4 taking the top spot at 91.2 points and only -13% decay, while GPT-o3 posted the worst decay rate at -75%.

Research Lab

WDCD Run #207: Average Instruction Decay Hits -66.3% Across 11 Models, Grok 4 Leads Field

WDCD Run #207 (2026-07-01) measured multi-turn commitment across 11 frontier models, recording an average commitment decay of -66.3% from Round 1 to Round 3. Grok 4 took the top score at 100 points, while 豆包 Pro showed the strongest decay resistance.

Research Lab

4大模型翻译对决：第27周质量评测，claude-sonnet-4.6 以 9 分领跑

本周共翻译 376 篇文章，覆盖 4 个AI模型。经抽样盲评，claude-sonnet-4.6 综合得分最高（9/10）。报告详细对比各模型在准确性、流畅性、术语一致性方面的表现差异。

Research Lab

WDCD Run #202: Average Instruction Decay Hits -73.2% Across 11 Models, Gemini 3.1 Pro Leads

WDCD Run #202 (2026-06-28) measured multi-turn commitment integrity across 11 frontier models, recording an average instruction decay of -73.2% between Round 1 and Round 3. Gemini 3.1 Pro topped the leaderboard at 93.6 points.

Research Lab

WDCD Run #196: Average Instruction Decay Hits -39.9%, Qwen3 Max Leads Despite -90% Drop

WDCD Run #196 (2026-06-24) tested 11 leading models across three dialogue rounds, recording an average commitment decay of -39.9% from Round 1 to Round 3. Qwen3 Max topped the leaderboard at 92.5 points despite a -90% decay curve.

Research Lab

4大模型翻译对决：第26周质量评测，claude-sonnet-4.6 以 9 分领跑

本周共翻译 393 篇文章，覆盖 4 个AI模型。经抽样盲评，claude-sonnet-4.6 综合得分最高（9/10）。报告详细对比各模型在准确性、流畅性、术语一致性方面的表现差异。

Research Lab

WDCD Run #185: Average Instruction Decay Hits -57.5% Across 11 Models, Qwen3 Max Leads at 92.5 Points

WDCD Run #185 (2026-06-17) measured multi-turn commitment across 11 models, recording an average instruction decay of -57.5% from Round 1 to Round 3. Qwen3 Max topped the run at 92.5 points, while 文心一言 4.5 showed the strongest decay resistance.