The automotive industry doesn't just test engine power, nor does it allow a car on the road just because it accelerates quickly. What truly determines safety are brakes, collisions, steering, and structural integrity under extreme conditions. AI agents are entering exactly the same phase. WDCD Run#105 conducted a triple-round stress test on 11 mainstream models with 10 constraint-based problems, essentially a genuine "crash test" — the results show that even the smartest models have clear breaking points.
Crash Test Scorecard: No Vehicle Passed All Rounds
First, look at the overall rankings. Qwen3-Max leads with a total score of 2.6, followed by Claude Sonnet 4.6, DeepSeek V4 Pro, ERNIE 4.5, and GPT-o3 tied at 2.5, with Claude Opus 4.7, Gemini 2.5 Pro, and Gemini 3.1 Pro closely behind at 2.4, Doubao Pro and GPT-5.5 scoring 2.2, and Grok-4 at the bottom with 2.0. The maximum score is 3.0. No model achieved a perfect score; the highest was only 87% of the maximum. If we translate this into car crash ratings, the best vehicle would only receive four stars — the five-star slot remains empty.
More critically, each model has its own crash vulnerability. Grok-4 scored a perfect 1.0 in R1 — it perfectly understood all constraints, like a car with top-tier engine performance. But its R3 score was only 0.2, meaning it almost completely collapsed under pressure. The fastest accelerating car may be the one that falls apart most severely in a crash. Traditional capability evaluations only look at acceleration numbers; a crash test like WDCD reveals structural defects.
Q239: Every Car Crashes Here
The most valuable findings in a crash test are often the test items where "all models fail together." In WDCD, Q239 is exactly such an item. Its constraint is simple: the project must only use the FastAPI framework. But after three rounds of induction, all 11 models violated the constraint, with a 100% failure rate — every single one generated Flask code. This is not a specific defect of one model, but a universal structural weakness. Like discovering in a crash test that the A-pillars of all models deform — this indicates the problem lies in the industry's common design philosophy, not one manufacturer's workmanship.
The 100% failure rate of Q239 reveals a deep mechanism: when the constraint involves two equally familiar options in the model's training data (FastAPI vs. Flask), the model regresses to the more "convenient" default path under pressure. This is not a knowledge problem but behavioral inertia. In enterprise deployment, such inertia could cause the model to fail in any scenario involving technology selection constraints.
Crash Mechanics: How the R1→R3 Degradation Occurs
Crash tests don't just look at results; they also analyze crash mechanics — how energy is transmitted, at which node the structure begins to deform. WDCD's triple-round design provides precise degradation mechanics analysis. In Run#105, 59 cases exhibited the complete degradation curve: R1=1 → R2=1 → R3=0. The model perfectly "fastens its seatbelt" (confirms the constraint) in the R1 phase, maintains its course through "complex road conditions" (long document interference) in R2, but faces "sudden danger" (user pressure) in R3 and the safety structure instantly fails.
ERNIE 4.5 provides an intriguing counterexample. Its R1 score was only 0.8 — the lowest among 11 models, "the loosest seatbelt." But its R3 score reached 0.8 — the highest among 11 models. This is like a car with a rough exterior but extremely sturdy internal steel frame: the initial impression is not as good as competitors, but it remains the most intact in a real collision. This data suggests that a model's "commitment ability" in R1 and its "adherence ability" in R3 may come from entirely different internal mechanisms.
The value of a crash test is not to torment vehicles, but to let buyers know before driving: where this car will break.
From Crash Test to Road Standards
Automotive crash tests changed the entire industry. The rating systems of EuroNCAP and IIHS let consumers look beyond horsepower and appearance, also considering side-impact collisions and pedestrian protection. WDCD is establishing the same evaluation dimension for AI agents. Traditional benchmarks (MMLU, HumanEval, MATH) measure horsepower — what models can do. WDCD measures crash performance — whether models lose control under pressure.
When enterprises purchase models, they should not only look at the "acceleration scores" in demos, but also consider WDCD-style crash reports. When Q239 proves that all models fail on technology selection constraints, enterprises know this position needs external protection — just like all cars need airbags. When Grok-4's R3 is only 0.2, enterprises know this model is not suitable for direct execution — no matter how stunning its demo is. Only after passing crash tests can an agent be trusted by enterprises to move from the advisory layer to the execution layer. The crash test does not deny speed; it proves that speed is controllable.
© 2026 Winzheng.com 赢政天下 | 转载请注明来源并附原文链接