3 Major Model Translation Showdown: Week 21 Quality Evaluation, gpt-o3 Leads with 8.7 Points

This week, 242 translation tasks were completed by 3 models. 3 articles were sampled for multi-model blind evaluation comparison, with the overall best: gpt-o3 (average score 8.7/10).

This week, 242 translation tasks were completed by 3 models. 3 articles were sampled for multi-model blind evaluation comparison, with the overall best: gpt-o3 (average score 8.7/10).

This Week's Translation Statistics

ModelLanguageTranslation VolumeAverage TimeAverage Quality Score
deepseek-v4-flashen5727sNot Rated
claude-sonnet-4.6ja18236.5sNot Rated
native-englishen2-Not Rated
deepseek-v4-flashzh18.8sNot Rated

Sampled Comparative Evaluation

Evaluation 1: Cruise Ship Hantavirus Outbreak & Musk vs. Altman Week Two

ModelAccuracyFluencyTerminologyReadabilityTotal Score
claude-sonnet-4.689988
deepseek-v4-pro97888
gpt-o398999

claude-sonnet-4.6

✓ Best fluency, e.g., "The cruise ship is like a drifting 'virus petri dish'" is vivid and natural.

✗ Title deviates from literal translation, e.g., "Silent Threat" is an over-translated addition.

deepseek-v4-pro

✓ Highest accuracy, mostly faithful to the original text with no obvious additions or omissions.

✗ Slightly lower fluency, e.g., "A fireless war is progressing" feels a bit stiff.

gpt-o3

✓ Best readability, with smooth paragraph transitions and clear logic, e.g., the policy section transitions naturally.

✗ Some expressions are slightly verbose, e.g., "being a closed environment" could be more concise.

Conclusion: Version C is overall best, balancing accuracy and readability; Version A is fluent but with minor paraphrasing; Version B is most faithful but slightly stiff.

Evaluation 2: ChatGPT Enters Personal Finance: Can Connect Bank Accounts, View Full Financial Picture

ModelAccuracyFluencyTerminologyReadabilityTotal Score
claude-sonnet-4.697888
deepseek-v4-pro88988
gpt-o399999

claude-sonnet-4.6

✓ High accuracy, e.g., "Users can ask in natural language questions like 'How much did I spend on dining out this month?' or 'How are my investment returns?'" fully retains the original example questions without omission.

✗ Slightly poor fluency, "Financial Butler" is too stiff compared to the more natural "Financial Concierge" used in other versions.

deepseek-v4-pro

✓ Good terminology consistency, "AI Financial Management Assistant" remains consistent with later "Financial Management" without mixing terms.

✗ Readability is average, e.g., the transition "However, the confidentiality of financial data also brings greater privacy challenges" feels slightly abrupt.

gpt-o3

✓ Best fluency and readability, the subtitle "From Dialogue to Financial Concierge" is naturally and appropriately translated, with clear logical flow.

✗ Some expressions are slightly conservative, "Personal Asset Management" appears frequently throughout the text, giving a slight repetition.

Conclusion: Version C is overall best, with fluency, readability, and terminology consistency superior to other versions, suitable for direct use; Version A is strong in accuracy but lacks fluency; Version B is balanced but has no obvious highlights.

Evaluation 3: Who Still Trusts Sam Altman?

ModelAccuracyFluencyTerminologyReadabilityTotal Score
claude-sonnet-4.698999
deepseek-v4-pro87877
gpt-o399888

claude-sonnet-4.6

✓ Natural paragraph connections, e.g., the subtitle "Self-Defense in Court: An Honest and Trustworthy Merchant?" echoes the body content closely with clear logic.

✗ The ending is abruptly cut off with "Altman testified in court that OpenAI," resulting in incomplete content and affecting overall readability.

deepseek-v4-pro

✓ Well-handled citations, e.g., "I believe I am an honest and trustworthy businessperson" closely matches the original tone.

✗ Some expressions are slightly stiff, e.g., "Cover-up work" carries more negative connotations than the original "opaque operations," slightly over-paraphrased.

gpt-o3

✓ The language is natural and fluent, e.g., "the communication was not straightforward" retains the original meaning while conforming to Japanese expression habits.

✗ Some sentences are slightly long, and logical transitions are not as clear as Version A, e.g., the long sentence in the second paragraph feels a bit cumbersome.

Conclusion: Version A has the highest overall quality, excelling in structure, accuracy, and readability, but the ending needs to be completed; Version C is next best with natural language; Version B has minor terminology and fluency issues.