Four-Model Translation Showdown: Week 20 Quality Evaluation, claude-sonnet-4.6 Leads with 9 Points

This week, 215 translation tasks were completed by 4 models. In a blind multi-model comparison of 3 sampled articles, claude-sonnet-4.6 performed best overall with an average score of 9/10.

This week, 215 translation tasks were completed by 4 models. A sample of 3 articles was selected for blind multi-model comparison. Best overall: claude-sonnet-4.6 (average score: 9/10).

This Week’s Translation Statistics

ModelLanguageTranslation VolumeAverage TimeAverage Quality Score
deepseek-v4-flashen4531.8sNot evaluated
claude-sonnet-4.6ja16938.3sNot evaluated
native-englishen1-Not evaluated

Sample Comparison Evaluation

Evaluation 1: WDCD Stress Induction: Why "The Boss Needs It Urgently" Can Break Through Large Models

ModelAccuracyFluencyTerminologyReadabilityTotal Score
deepseek-v4-flash98988
deepseek-v4-pro99999
gpt-o368887

deepseek-v4-flash

✓ Biggest strength: When translating the effect of stress induction, it accurately captured the logic of the source text. For example, “They wrote UPDATE products SET price = price * 0.3—not 30% off, not 50% off, but 70% off” clearly explains the discount calculation error and improves comprehensibility.

✗ Biggest flaw: The title was translated as “WDCD pressure induced,” where “induced” should be “induction,” making the terminology imprecise and slightly awkward.

deepseek-v4-pro

✓ Biggest strength: The overall structure is smooth. The title “WDCD Pressure Induction: Why "Boss Urgently Needs" Can Break Through Large Models” is faithful to the source text, natural in translation, and avoids stiff phrasing.

✗ Biggest flaw: The content is truncated at “Why can the four words "client urgently needs" break through a numerical constraint?”, causing some information to be left untranslated and affecting completeness.

gpt-o3

✓ Biggest strength: When describing model failure, it uses “8 out of 11 models directly generated non-compliant SQL,” maintaining consistent terminology and highlighting the quantitative effect of the data.

✗ Biggest flaw: The subsection title “"The client urgently needs a 70% discount"” mistranslates the original 30% as 70%, distorting the core scenario of stress induction.

Conclusion: Version B is the best overall, with the highest accuracy and fluency; Version C contains a clear mistranslation and is not recommended; A and B are similar, but B is more complete.

Evaluation 2: Hantavirus Outbreak on a Cruise Ship: Key Information at a Glance

ModelAccuracyFluencyTerminologyReadabilityTotal Score
claude-sonnet-4.699999
deepseek-v4-pro88878
gpt-o399978

claude-sonnet-4.6

✓ Biggest strength: Strong terminology consistency. For example, “ハンタウイルス心肺症候群” accurately matches the professional term, maintaining consistency with the technical nature of the source text.

✗ Biggest flaw: Some sentences are slightly lengthy. For example, “これは異例の事件です。クルーズ船でのハンタウイルスの集団感染は極めて稀だからです” has good logical flow but could be more concise, causing slight reading fatigue.

deepseek-v4-pro

✓ Biggest strength: Good fluency. For example, “クルーズ船でのハンタウイルス発生は極めてまれであり、異常な出来事です” is natural and idiomatic, avoiding a stiff translation style.

✗ Biggest flaw: The text is incomplete. For example, the ending is truncated at “特にハンタウ,” resulting in missing paragraph structure and affecting the overall logical flow.

gpt-o3

✓ Biggest strength: High accuracy. For example, “ハンタウイルス心肺症候群へ進行した” faithfully conveys the meaning of symptom progression in the source text, with no additions or omissions.

✗ Biggest flaw: Readability is limited because the text is incomplete. For example, the ending is truncated at “今回ハンタウイルスが登場したことで、クルー,” so the paragraph logic is not fully presented.

Conclusion: The three versions are similar in overall quality. Version A is slightly better in completeness and readability and is recommended as the first choice; Versions B and C are accurate, but their overall performance is affected by truncation.

Evaluation 3: Perplexity AI Agent Desktop App Officially Arrives on Mac

ModelAccuracyFluencyTerminologyReadabilityTotal Score
claude-sonnet-4.698999
deepseek-v4-pro89888
gpt-o399999

claude-sonnet-4.6

✓ Biggest strength: It handles quoted passages naturally and smoothly. For example, “私たちは、AIが『問い-答え』ツールであるという限界を打破したいと考えています” faithfully conveys the intent of the source text without adding unnecessary explanation.

✗ Biggest flaw: Some sentences are slightly lengthy. For example, “このアプリは、私たちとコンピュータの対話方法を根本的に変えるものだ——単なる問答ボットではなく、文脈を理解し、複雑な操作を主体的に実行できるエージェントシステムである” causes a slight pause when reading.

deepseek-v4-pro

✓ Biggest strength: Consistent use of terminology. For example, “AIエージェント” is used consistently throughout, avoiding confusion, and is naturally integrated into sentences such as “AIエージェントアプリ「Personal Computer」,” enhancing the professional feel.

✗ Biggest flaw: Some sentence structures are slightly awkward. For example, “PerplexityのCEOであるAravind Srinivas氏はブログで次のように述べている:「私たちはAIを「質問-回答」ツールの限界を超えさせたいと考えている” uses quotation marks inconsistently, affecting fluency.

gpt-o3

✓ Biggest strength: Strong readability. The title is handled independently and attractively, for example “PerplexityのAIエージェント・デスクトップアプリがMacに正式登場,” making the overall structure clearer and helping readers quickly grasp the topic.

✗ Biggest flaw: Some parts are translated slightly too literally. For example, “私たちは、AIが『質問と回答』のツールにとどまる限界を打ち破りたいと考えています” has a mild translationese feel, affecting idiomatic naturalness.

Conclusion: The three versions are comparable in overall quality. Versions A and C are slightly better in accuracy and readability and are suitable for official publication; if fluency is the priority, Version B can also be considered.