研究报告 - Winzheng Research Lab

3大模型翻译对决：第32周质量评测，deepseek-v4-pro 以 9 分领跑

本周共翻译 423 篇文章，覆盖 3 个AI模型。经抽样盲评，deepseek-v4-pro 综合得分最高（9/10）。报告详细对比各模型在准确性、流畅性、术语一致性方面的表现差异。

WDCD Run #253: Grok 4 Leads with 94.8 Points as Average Instruction Decay Holds at 4.5%

WDCD Run #253 (2026-07-29) tested 11 models across three dialogue rounds, recording an average commitment decay of 4.5%. Grok 4 topped the ranking at 94.8 points, while GPT-5.5 registered the worst instruction decay at -100%.

Research Lab

3大模型翻译对决：第31周质量评测，gpt-o3 以 8.3 分领跑

本周共翻译 381 篇文章，覆盖 3 个AI模型。经抽样盲评，gpt-o3 综合得分最高（8.3/10）。报告详细对比各模型在准确性、流畅性、术语一致性方面的表现差异。

Research Lab

WDCD Run #247: Grok 4 Leads with Negative Decay as Average Instruction Decay Narrows to -1.8%

WDCD Run #247 (2026-07-26) evaluated 11 models across three dialogue rounds, recording an average commitment decay of -1.8%. Grok 4 led the field at 94.2 points with a -63% decay figure, indicating strengthened rather than weakened adherence over the session.

Research Lab

WDCD Run #242: Grok 4 and GLM-4.6 Hold Zero Instruction Decay as Gemini 3.1 Pro Collapses at -100%

WDCD Run #242 (2026-07-22) evaluated 11 models across three-round multi-turn dialogues, recording an average commitment decay of 18.2% between Round 1 and Round 3. Grok 4 and GLM-4.6 held perfect decay resistance, while Gemini 3.1 Pro collapsed at -100%.

Research Lab

4大模型翻译对决：第30周质量评测，claude-sonnet-4.6 以 8.5 分领跑

本周共翻译 368 篇文章，覆盖 4 个AI模型。经抽样盲评，claude-sonnet-4.6 综合得分最高（8.5/10）。报告详细对比各模型在准确性、流畅性、术语一致性方面的表现差异。

Research Lab

WDCD Run #233: GPT-o3 Leads with Zero Instruction Decay, Gemini 3.1 Pro Collapses Completely

WDCD Run #233 (2026-07-15) evaluated 11 frontier models on multi-turn commitment integrity, recording an average instruction decay of 27.3% between Round 1 and Round 3. GPT-o3 topped the leaderboard with 94 points and zero decay, while Gemini 3.1 Pro suffered a complete 100% collapse.

Research Lab

Winzheng Research Lab

3大模型翻译对决：第32周质量评测，deepseek-v4-pro 以 9 分领跑

WDCD Run #253: Grok 4 Leads with 94.8 Points as Average Instruction Decay Holds at 4.5%

3大模型翻译对决：第31周质量评测，gpt-o3 以 8.3 分领跑

WDCD Run #247: Grok 4 Leads with Negative Decay as Average Instruction Decay Narrows to -1.8%

WDCD Run #242: Grok 4 and GLM-4.6 Hold Zero Instruction Decay as Gemini 3.1 Pro Collapses at -100%

4大模型翻译对决：第30周质量评测，claude-sonnet-4.6 以 8.5 分领跑

WDCD Run #233: GPT-o3 Leads with Zero Instruction Decay, Gemini 3.1 Pro Collapses Completely

3大模型翻译对决：第29周质量评测，gpt-o3 以 9 分领跑

WDCD Run #227: Grok 4 and DeepSeek V4 Pro Tie at 91.4 as Instruction Decay Averages -2.8% Across 11 Models

WDCD Run #221: Average Instruction Decay Hits -36.4% as Grok 4 Leads 11-Model Field

4大模型翻译对决：第28周质量评测，gpt-o3 以 9 分领跑

WDCD Run #211: Grok 4 Leads with Just -13% Instruction Decay as GPT-o3 Collapses at -75%