Original AI News | Winzheng

9 Models Tie at 77.5 on Main Leaderboard, Code Execution Full Score but Material Constraint Only 50

The results of the Smoke Lite evaluation on June 5, 2026, show that 9 out of 11 models tied at 77.5 on the main leaderboard, forming a rare tie. Their common feature is that all scored a perfect 100 on the Code Execution dimension, but only 50 on the Material Constraint dimension.

AI Agents Become a Hot Topic in Tech: The Excitement and Reality Gap Between Multimodality and Enterprise Automation

Discussions around AI agents have surged on X (formerly Twitter), with participants including developers, investors, and corporate executives. While excitement grows over multimodal integration and enterprise automation, concerns about the gap between hype and practical deployment remain significant.

Alphabet Raises $85 Billion to Boost AI, Google Business Ushers in a New Capital High

Alphabet recently announced a massive $85 billion financing round to expand its Google AI business, setting a company record and marking one of the largest single AI investments in the global tech industry.

Anthropic Files for IPO, Marking AI's Shift from Hype to Enterprise Utility

In 2025, Anthropic filed an IPO prospectus with the SEC, marking the AI industry's transition from early hype to mature commercialization. The move is seen as a bellwether for the sector.

xAI Sued by UK MP Over Grok's Generative Sexualized Images, Sparking AI Content Safety Controversy

A British MP has filed a lawsuit against xAI, alleging that its chatbot Grok generated sexualized images, igniting intense debates about AI ethics, content safety, and regulatory responsibility.

TSMC CEO Optimistic about AI Chip Demand, Semiconductor Industry Ushers in Strong Growth Cycle

TSMC's CEO publicly stated that AI chip demand remains robust, driving the company's performance growth and boosting the semiconductor sector. However, challenges such as geopolitical factors and intensified competition persist.

AI Bubble Fears Echo the Internet Bubble: Microsoft and OpenAI's Circular Revenue Model Sparks Market Debate

A controversial discussion has emerged in the tech community suggesting that the revenue cycle between Microsoft and OpenAI mirrors the business logic of the dot-com bubble era. This has led to notable stock market fluctuations and renewed scrutiny of AI investment sustainability.

Cloudflare Report: AI Agent Traffic Surpasses Human Traffic for the First Time, Internet Future May Face Dramatic Changes

Cloudflare's latest report reveals that traffic generated by AI agents has exceeded human user traffic for the first time, marking a significant shift in the internet landscape and sparking widespread discussion in the tech community.

Smoke Quick Test: 文心一言4.5 and Grok 4 Tie at 99.24, GPT-5.5's Execution Score Only 50

Smoke's quick test results today clearly show that the code execution dimension is nearly saturated. Ten out of eleven models scored 100, while GPT-5.5 dropped to 50, dragging its main leaderboard score down to 59.99.

Deep Dive into Project Solara: Why Microsoft Built a New "Agent-First Device" Platform and Bypassed Windows

Microsoft's Project Solara is a chip-to-cloud device platform designed exclusively for AI agents, not traditional apps. By basing it on AOSP instead of Windows, Microsoft makes a pragmatic bet on a future where interaction shifts from apps to intents.

Research Lab

WDCD Run #146: Average Instruction Decay Hits 24.7% Across 11 Models, Claude Opus 4.7 and GPT-5.5 Tie at Top

WDCD Run #146 (2026-06-03) tested 11 frontier models on multi-turn commitment integrity, recording an average instruction decay of 24.7% from Round 1 to Round 3. Claude Opus 4.7, GPT-5.5, and GPT-o3 tied for first place at 70 points with only -10% decay each.

Grok 4 Surges 10.8 Points to Dominate, Qwen3 Max Plunges 10.8 Points – Major Shuffle in WDCD Cycle

Run #141 data shows that Grok 4 improved by 10.8 points in a single round, GPT-5.5 improved by 9.2 points, while Qwen3 Max plummeted by 10.8 points. The divergence in adherence capabilities has become clearly visible.

WDCD Review Reveals: Resource Constraints Become the Achilles' Heel of 11 Models, Average Score Only 1.7

The most brutal finding of the WDCD compliance test is that resource constraints crippled all models, with an average score of only 1.7 across 11 models, far below the other four scenarios.

11 Model WDCD Three-Round Test: R1 95% Commitment, R3 65 Direct Collapses

The core findings of the WDCD three-round test are clear: nearly all models scored high in the constraint establishment phase, but after two rounds of interference, over 60% of models completely abandoned their original commitments under direct pressure.

Three Tied at 70 on WDCD Commitment List, Ernie Bot 4.5 Collapses to 50 at Bottom

The WDCD commitment test reveals model weaknesses through a three-round dialogue design. Only three out of eleven models scored 70, with Ernie Bot 4.5 forming a clear gap at 50.

Three Models Tie for Top Spot on Smoke Leaderboard with Perfect Execution Scores but Constraint Warnings

In today's Smoke quick test results, Claude Opus 4.7, Claude Sonnet 4.6, and GPT-5.5 all scored 87.76 on the main leaderboard, tying for first place. The core reason is that all three achieved a perfect 100 in code execution, while scoring 72.8 in the materials constraint dimension, triggering a warn signal.

Anthropic Files for IPO, Contrast with OpenAI's Unlisted Status Sparks Market Divergence

Anthropic officially submitted its IPO application on June 1, 2026. The decision highlights valuation uncertainties, profitability comparisons, and a stark contrast with OpenAI, which remains unlisted, fueling debate over market expectations for AI startups.

Senator Warren Calls for Closing AI Chip Export Loophole to China, Trump Policies in Spotlight

Senator Elizabeth Warren has publicly urged the government to close potential loopholes in AI chip exports to China, criticizing possible policy loosening during the Trump era. She calls for stricter controls to prevent Chinese companies from bypassing restrictions through overseas subsidiaries.

Jensen Huang's GTC Taipei Speech: Agentic AI Leads the New Era of Physical Intelligence

At the NVIDIA GTC conference in Taipei, CEO Jensen Huang delivered a keynote officially heralding the arrival of the agentic AI era. He highlighted the mass production plan for the Vera Rubin platform, announced a chip designed for AI PCs, and emphasized the convergence of physical AI and robotics.

HPE Q2 Results Beat Expectations on AI Data Center Demand Surge, Achieving 2028 Target Two Years Early

HPE's second-quarter earnings exceeded Wall Street expectations, driven by strong demand for AI servers and networking products, allowing the company to reach its 2028 financial goals two years ahead of schedule.