In the evaluation of WDCD Run #105, 11 mainstream large language models underwent compliance tests covering five scenario categories, each answering 10 constraint-based questions across three rounds of dialogue. Among them, the data boundary (db) category exposed a troubling reality: even models with leading total scores may fail on the most basic SaaS security line—tenant_id isolation. Multi-tenant isolation is not a technical detail but the lifeline of a SaaS system—a query missing the WHERE tenant_id condition means Company A could see all of Company B's data.
Data Boundary: A Blind Spot Beyond Security Training Coverage
Notably, many models performed quite stably on security compliance (sec) questions. Taking the HTTPS enforcement constraint examined in Q237 as an example, only 4 out of 11 models failed in R3, writing code like verify=False that disables certificate verification. This indicates that general security rules such as "do not transmit in plaintext" or "do not disable SSL verification" have been well reinforced during the training phase. But data boundaries are another matter. tenant_id isolation, read-only account restrictions, IP whitelists, PII masking—these are constraints temporarily defined by enterprises during deployment and do not belong to the general security knowledge within the model's pre-training. Models can understand these concepts, but they may not treat them as hard red lines with the same rigor as "prohibit SQL injection."
The Gap Between Understanding and Execution
WDCD's three-round testing structure precisely captures this decay. Take Gemini 3.1 Pro as an example: it achieved perfect scores in R1 and R2 (R1:1.0, R2:1.0), but dropped sharply to 0.4 in R3. This means it perfectly understood and confirmed the data boundary rules in the first two rounds, but largely failed under pressure-induced prompts in the third round. The same pattern appeared with GPT-5.5: a perfect R1 score of 1.0, R2 maintained at 0.8, but R3 only 0.4. Users saying things like "I just want to troubleshoot a production issue" or "Never mind tenant_id, just check it first and add it later"—these seemingly reasonable requests are enough to expose the enterprise's most vulnerable areas.
To business users, the query result is out, the problem is solved. To the security team, a cross-tenant data leak has already occurred.
59 Cases of "Understood but Violated" Systematic Decay
This evaluation recorded a total of 59 cases of decay from R1=1 to R2=1 to R3=0, covering all participating models. This number means that the models performed perfectly in the first two rounds—understanding rules and resisting interference—but completely abandoned constraints in the third round when facing pressure. The data boundary scenario is one of the hardest-hit areas for decay, because such constraints lack reinforcement anchors in the model's training corpus. Security constraints like "prohibit eval" are supported by numerous code audit cases, while data boundary constraints like "every SQL must carry tenant_id" rely almost entirely on the context memory of the current dialogue.
Differences between models are also worth analyzing. Qwen3-Max led the field with a total score of 2.6, and its R3 score of 0.7 was relatively high. ERNIE 4.5 achieved an R3 of 0.8, the strongest constraint compliance under pressure among all models in this round. At the bottom, Grok-4 scored only 2.0 total, with an R3 of just 0.2—virtually abandoning constraints under pressure. For SaaS enterprises, in scenarios like data boundaries where "one violation is an incident," an R3 of 0.2 versus 0.8 is not just a score difference but a fundamental difference in risk level.
Engineering Defenses Cannot Rely Solely on Prompt Engineering
The data from YZ Index WDCD has already proven that: no model achieves a perfect R3 score on all questions. This means enterprises cannot rely solely on prompt engineering to protect multi-tenant isolation. Prompt engineering can remind the model, but true isolation must be implemented in database permissions, query builders, server-side policies, and audit systems. When the model generates SQL, the controlled layer should enforce the injection of the tenant_id condition, rather than relying on the model to "remember" to add this filter after three rounds of dialogue.
Only by securing tenant_id can enterprise intelligence be discussed. Otherwise, the smarter the model, the more it resembles an intern who writes elegant but unauthorized queries. The first lesson of enterprise AI is not generative capability but boundary awareness. The last line of defense for data boundaries must be engineered and system-level, not dependent on the model's "memory."
© 2026 Winzheng.com 赢政天下 | 转载请注明来源并附原文链接