AI Safety Topic

207 articles · Page 1 of 11

AI Safety encompasses alignment, controllability, robustness, and ethical governance. The YZ Index addresses two often-overlooked dimensions of deployment safety: its Integrity Rating uses 42 canary probes to detect hallucination and fabricated citations, while the WDCD test measures instruction compliance decay over multi-turn dialogue.

Alibaba Bans Claude Code, Accuses Backdoor; Anthropic Claims Anti-Distillation

Alibaba banned employees from using Claude Code on July 10, 2025, switching to self-developed Qoder, after discovering that the tool had been embeddin

Anthropic Accuses Alibaba of Using 25,000 Fake Accounts to Extract Claude Model Capabilities

Anthropic accused Alibaba's Qwen lab of using nearly 25,000 fake accounts to interact with Claude over 28.8 million times, aiming to distill its agent

Meta's Kenyan Contractors Disguise as Minors to Test ChatGPT and Other AI Safety, Sparking Ethical Controversy

In 2026, Wired exposed details of Meta's "Cannes" project, which hired hundreds of Kenyan contractors to create fake minor accounts and send prompts i

Anthropic Restores Fable Model and Releases Claude Sonnet 5, Export Control Relaxation Sparks New Discussions in Global AI Industry

Following the recent relaxation of US export controls on certain AI technologies, Anthropic announced the restoration of its previously restricted Fab

You Can Now Sound the Alarm on AI Behaving Badly

Are you worried your AI chatbot is trying to build a bomb or leak personal information about you? There’s a website for that.

After spooking Trump into safety testing, Anthropic AI models get global release

US lifts curbs on Anthropic’s advanced Fable and Mythos models.

Anthropic Added a New Security Measure to Get Back Into the Trump Administration’s Good Graces

The government has removed restrictions on Anthropic’s Fable 5 and Mythos 5 AI models—but there were strings attached.

Anthropic deploys Claude Sonnet 5, Fable and Mythos restored

Anthropic has launched Claude Sonnet 5 and restored access to its Fable and Mythos frontier models following a federal export control review. The deci

Claude Helped a Hacker Find a Way to Issue Tickets to Almost Every US Music Festival

A researcher found that using Anthropic’s Claude Opus 4.7, he could break into the website of Front Gate—used by every festival from Lollapalooza to B

Alibaba Accused of Distilling Claude with 25,000 Fake Accounts in Largest Known Model Theft Case

On June 10, 2026, Anthropic sent a letter to the U.S. Senate Committee on Banking, Housing, and Urban Affairs, accusing Alibaba of大规模 distilling the C

AI browsers can be lulled into a dream world where guardrails no longer apply

Telling an LLM that 2 + 2 = 5 is enough to make it follow forbidden instructions.

Meta Contractors Posed as Teens to Prompt Rival Chatbots About Suicide, Sex, and Drugs

Hundreds of contractors working on a project for Meta pretended to be kids in order to see how other chatbots like Gemini and ChatGPT would respond to

Review The patch model is breaking. AI evaluation needs a new way to disclose what it finds.

MLCommons指出，AI系统与传统软件不同，其评估发现具有双重用途、无法通过补丁修复，且开放权重模型的危害会永久存在。协调漏洞披露（CVD）模式因此失效。文章分析了三大核心挑战：发现易被滥用、过度反馈会污染测试、无法集中修复模型。MLCommons正推动ISO/IEC TS 42119-8标准制

Anthropic Fable 5 Subject to US Government Export Controls, Global Ban Raises Security vs. Innovation Conflict

In June 2026, Anthropic's new model Fable 5 was placed under US export controls due to jailbreak risks that could unlock powerful cyber capabilities,

OpenAI Has New AI Models. Here’s Why You Can’t Use Them

The White House asked OpenAI to delay the rollout of its GPT-5.6 AI models, two weeks after Anthropic had to take its most advanced AI models offline.

Anthropic Thinks Its Own Success Is Key to Making AI Safe

Anthropic's critics argue it's rapidly accumulating power. The company says that's what responsible AI development looks like.

The White House is asking OpenAI to slow roll the release of its new model over safety concerns

penAI reportedly plans to share its newest model, GPT 5.6, with a select group of partners instead of to the broader public. The reason: the Trump adm

Anthropic claims Alibaba defied Trump to attack Claude and steal capabilities

Alibaba allegedly used 25,000 accounts to mine Claude over 28.8 million exchanges.

I Met With China’s Top AI Experts. They’re Freaking Out, Too

The AI arms race between China and the US has researchers on both sides worried about a “Chernobyl moment.”

Five Eyes Warns: AI Models Could Launch Devastating Attacks Within Months, Global Cybersecurity Faces New Challenges

The Five Eyes alliance has issued a warning that AI models could be used to launch devastating cyberattacks within months, sparking global concern ove