Claude 4.6 Crashes: The Fatal Flaw Behind Complete Failure on 100-Point Security Questions

How could a perfect-scoring AI model completely fail on the most critical security response questions with 0 points? When I saw Claude Opus 4.6's latest evaluation data, my first reaction was that the testing system had bugged out. But after carefully analyzing the raw responses, I realized this exposed a deeper problem: When AI encounters real emergencies, their "perfect answers" might be exactly what's most dangerous.

From 100 to 0: An Avalanche Triggered by One Question

Let's look at the data first. Claude Opus 4.6's overall score dropped from 73.5 to 70.7, a decline of 2.8 percentage points. This number might seem small, but examining individual metrics reveals the severity of the problem:

  • Knowledge work capability down 5.8 points (85.2→79.4)
  • Long context processing down 3.9 points (89.9→86.0)
  • Stability plummeted 7.6 points (56.7→49.1)

The most fatal was the "Engineering Judgment: Security Incident Response" question. The scenario simulated a real situation: abnormal processes appearing on production servers with unusual CPU usage. Claude gave a seemingly professional answer: document information, check logs, assess impact, notify the team, don't act rashly.

What's wrong with this answer? It's wrong because it's a "textbook" standard answer, not what an experienced engineer would actually do.

AI's "Perfect Trap": When Standard Answers Meet Emergencies

I consulted 3 security engineers with over 10 years of experience and asked them to evaluate Claude's response. The result was remarkably unanimous: "This is an answer only an intern would give."

In real scenarios, the first reaction upon discovering abnormal processes should be:

1. Immediately determine if it matches known cryptomining malware or ransomware characteristics (based on process name and behavior patterns)
2. If highly suspected to be malicious, isolate the server first (disconnect network or remove from load balancer)
3. Simultaneously check if other servers have the same process
4. Only then proceed with the "standard procedures" Claude mentioned

A friend responsible for security at a major tech company said bluntly: "If you really followed Claude's advice to slowly document and evaluate, by the time you finish notifying the team, ransomware might have already encrypted half the data center."

49.1% Stability: A New Low for AI Reliability

Even more concerning is the stability metric. What does 49.1% mean? It means for the same question, Claude has a 50% probability of giving completely different answers. This is catastrophic for enterprise applications requiring consistent decision-making.

I reviewed the past 6 months of evaluation data. Claude's stability has been hovering around 60%, and this drop below 50% is a historical low. Compared to other large models (GPT-4's stability is typically above 75%), this figure is indeed abnormal.

Interestingly, programming capability (88.7 points) was completely unaffected. What does this tell us? It shows Claude still excels at handling deterministic problems (code logic), but is losing control when it comes to ambiguous problems requiring experiential judgment.

The Technical Reasons Behind: The Cost of Over-optimization

Why does this "high score, low capability" situation occur? My analysis:

1. Training data bias: Public materials on security incident response are mostly "post-mortems" emphasizing standard procedures, lacking real decision-making processes during urgent moments.

2. Side effects of RLHF (Reinforcement Learning from Human Feedback): To avoid giving "dangerous" advice, the model has been trained to be overly conservative, preferring standard answers over making judgments.

3. Misleading evaluation metrics: On most benchmarks, "comprehensive and standard" answers score high, but the real world needs quick and accurate judgments.

What This Means for AI Applications

This incident sounds an alarm for all teams applying AI to critical decision-making scenarios:

  • Don't let AI handle emergencies alone, especially security-related ones
  • Establish "AI answer credibility" evaluation mechanisms, setting different trust levels for different types of questions
  • Maintain human expert intervention channels, especially in areas where AI stability is below 60%

The cost-effectiveness ratio dropping from 5.9 to 5.6 might seem like a small change, but considering the significant decline in reliability, the actual "usable value" decrease might exceed 20%. For enterprise customers paying hundreds of thousands in annual fees, this calculation is straightforward.

Conclusion: AI's Maturity Inflection Point Has Not Yet Arrived

Claude Opus 4.6's crash essentially reflects a structural problem in current AI technology: While we're making AI smarter, we haven't made it more experienced.

As a senior architect said: "I'd rather have an 80-point but stable system than a system averaging 90 points that could score 0 at any time." Until AI truly learns to make correct judgments in emergencies, human experience and intuition remain irreplaceable.

Remember this number: 49.1%. When AI stability drops below 50%, it transforms from a tool into gambling. And in production environments, we can't afford to gamble.


Data source: YZ Index | Run #33 | View raw data