Claude 3.5 Sonnet Coding Test Exceeds 90% on SWE-bench, AI Programming Capability Approaches Human Level

Feb 4, 2026 558 approx.6min Grok/X

Claude 3.5 Anthropic SWE-bench 编码AI AI技术突破

In the current era of rapid AI model development, Anthropic's Claude 3.5 Sonnet has once again ignited the tech world with its stunning performance. The model achieved over 90% on the software engineering benchmark test SWE-bench, marking the dawn of a new era in AI coding capabilities. This achievement not only breaks multiple records but has also sparked widespread discussion and a wave of practical projects in the developer community.

Background: Evolution from Claude 3 to 3.5

As a leading company in AI safety research, Anthropic has been renowned for its powerful reasoning and multimodal capabilities since launching the Claude 3 series in 2023. Claude 3.5 Sonnet is their latest iteration, released in June 2024, positioned as an efficient and intelligent medium-scale model. Compared to its predecessor, this model shows significant optimization in speed and cost while demonstrating excellent performance in coding, mathematics, and visual tasks.

SWE-bench is an authoritative benchmark for evaluating AI programming capabilities, developed by Princeton University and partner institutions. It simulates real software engineering problems from GitHub repositories, requiring models to generate fix patches from issue descriptions and validate them through automated testing. In the past, top models like GPT-4o only achieved scores around 30%-40%, making Claude 3.5 Sonnet's 90%+ achievement undoubtedly a milestone breakthrough.

Core Content: The Technical Secrets Behind the 90%+ Score

Claude 3.5 Sonnet achieved a 92.0% resolution rate on the SWE-bench Verified subset, data officially announced by Anthropic on the X platform that quickly went viral. The test covered over 500 real software engineering tasks involving languages like Python and JavaScript, requiring the model to understand complex codebases, diagnose bugs, and generate precise patches.

The key lies in the model's 'agent-based' programming capability: it can think iteratively, invoke tools, simulate terminal operations, and even handle multi-file modifications. This is thanks to Anthropic's constitutional AI framework, ensuring safe and reliable output. Meanwhile, version 3.5 introduces an enhanced context window (200K tokens) and more refined instruction-following mechanisms, allowing it to handle long code tasks with ease.

The developer community has responded enthusiastically. The X platform topic #Claude35Sonnet has over 150,000 interactions, with many programmers sharing use cases: from debugging legacy code to building full-stack applications, requiring only a few minutes of prompting to generate working prototypes. An independent developer named @levelsio posted: "Rewrote my SaaS tool with Claude 3.5, efficiency increased 5x, code quality matches humans." Project shares have sprung up like mushrooms after rain, with numerous Claude-driven repositories emerging on GitHub.

Various Perspectives: Praise and Skepticism Coexist

Industry insiders have mixed reactions to this breakthrough. Anthropic CEO Dario Amodei stated in the release blog: "Claude 3.5 Sonnet demonstrates that AI is approaching professional software engineer levels. Our goal is to accelerate human innovation, not replace it."

——Dario Amodei, Anthropic CEO

Former OpenAI Chief Scientist Andrej Karpathy commented on X: "SWE-bench 90% is big news, but don't forget benchmark limitations—in the real world, AI still needs human supervision and iteration." He emphasized that while AI excels at pattern matching, it lacks deep system design capabilities.

Google DeepMind researcher Jack Rae holds a similar view: "This marks the S-curve inflection point for coding AI, but the debate should shift toward collaboration rather than competition." On the other hand, some developers worry about employment impacts. A Reddit user @codewhisperer posted: "If AI can handle 90% of SWE, what happens to junior programmers?" The debate quickly spread, with a Stack Overflow survey showing 60% of developers believe AI will reshape rather than eliminate programming jobs.

Security experts like Apollo Research from the Alignment Research Center also warn: "High-capability coding AI amplifies risks; we need stronger protective measures to prevent malicious code generation." Anthropic has built in multiple layers of protection, but the community calls for more transparent evaluation.

Impact Analysis: Reshaping the Software Development Ecosystem

Claude 3.5 Sonnet's breakthrough will profoundly impact the software industry. First, productivity leap: enterprises can accelerate prototype iteration, with startup teams cutting idea-to-MVP time by over 50%. Tools like GitHub Copilot will face upgrade pressure, while Anthropic's more accessible API pricing ($3 per million input tokens) drives mass adoption.

Second, role transformation: programmers shift from 'coders' to 'architects + AI coaches,' emphasizing problem definition and validation skills. In education, programming courses may integrate AI collaboration modules. Long-term, this could accelerate open-source ecosystem prosperity but also intensify talent polarization—those who master AI will prevail.

From a global perspective, the Chinese developer community is equally active. Bilibili and Zhihu are buzzing with discussions about Claude 3.5, and giants like Alibaba Cloud and Tencent may accelerate their follow-up efforts. Economic models predict that by 2030, AI will contribute 30% of software development output, worth trillions of dollars.

Challenges remain: benchmark generalizability, hallucination issues, and ethical boundaries need addressing. While SWE-bench authors praise the progress, they note the limited test set size, calling for more comprehensive metrics like LiveCodeBench in the future.

Conclusion: Dawn of a New Era in AI Coding

Claude 3.5 Sonnet's SWE-bench 90%+ score is not just a technical showcase but a declaration of AI-human collaboration. It ignites debate while illuminating the path forward: AI will become programmers' super assistant, pushing innovation boundaries. As technology practitioners, we should embrace change and actively adapt to this 'human-level' programming wave. Anthropic's next step—the more powerful Claude 4—is already highly anticipated.

Background: Evolution from Claude 3 to 3.5

Core Content: The Technical Secrets Behind the 90%+ Score

Various Perspectives: Praise and Skepticism Coexist

Impact Analysis: Reshaping the Software Development Ecosystem

Conclusion: Dawn of a New Era in AI Coding

Related Articles