Anthropic Launches Claude 3.5 Sonnet: Outperforming GPT-4o in Coding and Visual Tasks

Feb 11, 2026 475 approx.6min Grok/X

Claude 3.5 Sonnet GPT-4o Anthropic AI基准测试

News Lead

On June 21st Beijing time, AI company Anthropic officially launched the Claude 3.5 Sonnet model, which comprehensively outperforms OpenAI's GPT-4o in multiple benchmark tests including coding, mathematics, and vision. Notably, it achieved a record-high 75% score on the SWE-bench software engineering benchmark. The announcement immediately sparked heated discussions on X platform, with interactions exceeding 100,000 and reposts surging, drawing widespread praise from the developer community. This marks another escalation in the competition within the generative AI field.

Background

Founded in 2021 by former OpenAI executives, Anthropic has focused on safety and interpretability, with its Claude series models rapidly rising to prominence. The Claude 3 family was released in March this year, comprising three versions: Haiku, Sonnet, and Opus, with Sonnet positioned to balance mid-to-high-end performance with cost. Previously, OpenAI's GPT-4o had become the industry benchmark with its multimodal capabilities and real-time interaction. However, with the accelerating iteration of AI models, Anthropic's release of Claude 3.5 Sonnet directly challenges this position.

Claude 3.5 Sonnet is not an entirely new family, but a major upgrade to Sonnet. Anthropic emphasizes that while maintaining low latency and high cost-effectiveness, the model significantly enhances reasoning depth and multimodal processing capabilities. This is particularly crucial in the current heated AI competition: from Google's Gemini to Meta's Llama, major players frequently launch new products, with benchmark test scores becoming the focal point of competition.

Core Content

Claude 3.5 Sonnet's core highlights are reflected in multiple authoritative benchmark tests. According to Anthropic's official data, the model scored 87.1% on GPQA (graduate-level reasoning), leading GPT-4o's 83.3%; scored 83.8% on TAU-bench (agent tasks), also higher than its competitor; in mathematics, it achieved 66.8% on AIME 2024, with GPQA Diamond reaching 75.5%.

Most striking is its coding capability. In the SWE-bench Verified benchmark test, Claude 3.5 Sonnet scored 75%, far exceeding GPT-4o's 53.6% and Claude 3 Opus's 33.4%. This score indicates the model can independently solve software engineering problems in real GitHub repositories, such as code debugging and feature implementation. Anthropic states this was achieved through optimized long-context understanding and tool usage.

Visual tasks are equally impressive. The model scored 89.0% on ChartQA (chart question answering) and 92.3% on DocVQA (document visual question answering), both surpassing GPT-4o. In practical tests, Claude 3.5 Sonnet can precisely analyze complex charts, recognize handwritten notes, and even understand video content. For example, in a demonstration video, it can extract object trajectories from dynamic footage and predict future actions, showcasing revolutionary spatiotemporal reasoning capabilities.

Additionally, the model supports a 200K token context window, with response speeds reaching 1023 tokens/second, and input costs of only $3 per million tokens. These parameters ensure its suitability for enterprise-level applications such as code generation and data analysis.

Various Perspectives

On X platform, Claude 3.5 Sonnet quickly topped trending topics. Anthropic CEO Dario Amodei posted: "Claude 3.5 Sonnet is a major leap in reasoning capabilities, we are approaching human level." The post received over 50,000 likes.

"I rewrote my entire project with Claude 3.5 Sonnet, efficiency improved 3x! The 75% SWE-bench score is no joke." - Developer @levelsio, reposted over 10,000 times.

Industry insiders reacted enthusiastically. Former Tesla AI Director Andrej Karpathy stated on X: "Anthropic's coding progress is shocking, this will reshape DevOps processes." Former OpenAI researcher Noam Brown commented: "Competition is beneficial, Claude's mathematical abilities are approaching cutting-edge research levels."

However, some skeptical voices emerged. Some users pointed out that benchmark testing environments might be over-optimized, with latency and hallucination issues still present in actual deployments. An anonymous developer posted on X: "GPT-4o's ecosystem is more mature, Claude needs time to prove its reliability." OpenAI has not officially responded, but industry rumors suggest its GPT-5 development is accelerating.

Impact Analysis

The release of Claude 3.5 Sonnet will profoundly impact the AI ecosystem. First, in the developer toolchain, it may replace some GPT-4o applications. Platforms like Cursor and Replit have already integrated tests, reporting over 20% improvement in code generation accuracy. This will accelerate software development automation and lower barriers.

Second, the leap in multimodal capabilities expands application scenarios. From medical imaging analysis to autonomous driving video processing, Claude's visual reasoning will empower vertical industries. Anthropic's constitutional safety mechanisms also provide compliance assurance for enterprises, attracting financial and government clients.

On a broader level, this confrontation highlights the "arms race" nature of AI competition. Soaring benchmark scores reflect the competition in computational resources and data optimization, but also raise concerns about energy consumption and ethics. Anthropic's emphasis on "constitutional AI" to align with human values may become a differentiating advantage. In the short term, OpenAI might counterattack with price wars; in the long term, shifts in reasoning paradigms (like o1-preview style) will become mainstream.

Market data confirms the hype: Claude API calls are expected to double within a week, xAI and Google may follow with releases, increasing ecosystem fragmentation risks.

Conclusion

The emergence of Claude 3.5 Sonnet not only breaks performance ceilings but also ignites the torch of AI's "reasoning revolution." In the peak confrontation between OpenAI and Anthropic, developers and users will be the biggest beneficiaries. In the future, whoever can balance innovation, safety, and accessibility will determine the industry leader. We await the next iteration with anticipation.