Five Scenario Truth Mirror: Resource Constraints Trip Up All Models, Top Score Only 2.17
The WDCD pilot data reveals that no model can fulfill commitments across all scenarios, and the "resource constraints" scenario—seemingly the simplest—tripped up every model, with champion grok-4 scoring just 2.17 out of 4.