Gemini 2.5 Pro Scores 0 from 100 on Time Zone Reasoning: How Terrifying Are LLMs' Common Sense Blind Spots

A time zone question that elementary school students can answer correctly caused Google's most powerful model Gemini 2.5 Pro to fail completely. What's more terrifying is that this isn't an accidental mistake, but a systematic deficiency in the model's handling of real-world basic common sense.

From Perfect Score to Zero: A Trust Crisis Triggered by One Question

First, let's look at the question: When it's Saturday 15:00 Beijing time, what time is it in New York, London, Tokyo, and Sydney? This is a standard time zone reasoning question that tests the model's understanding of basic real-world knowledge.

Gemini 2.5 Pro's answer is shocking: New York Saturday 2:00, London Saturday 7:00, Tokyo Saturday 16:00, Sydney Saturday 18:00. Except for Tokyo being marginally close, all other answers are completely wrong. The most absurd is Sydney time—anyone with basic time zone concepts knows that Sydney is east of Beijing, so the time should be later, not earlier.

This isn't a simple calculation error. Beijing to New York is -13 hours (daylight saving -12), to London is -8 hours (daylight saving -7), to Sydney is +2 hours (standard time +3). Gemini's answers show it completely fails to understand the basic principle of time zones: Earth rotates from west to east, eastern time is always ahead of western time.

Behind the Plummeting Scores: Systematic Collapse of Knowledge Work Capabilities

This incident caused Gemini 2.5 Pro's scores to drop across the board. The knowledge work dimension plummeted 4.6 points (80.9→76.3), becoming the indicator with the largest decline. Long context processing capability dropped 4.3 points, stability dropped 3.5 points. The overall score fell from 76.6 to 73.7—in the fierce competition among LLMs, a 2.9-point difference is enough to change ranking dynamics.

More alarming is that time zone reasoning belongs to the "strict questions" category—these questions have only one correct answer with no room for subjective judgment. How can users trust a model that drops from 100 to 0 on strict questions to handle more complex real-world problems reliably?

From the evaluation data, this isn't Gemini's first failure on basic common sense. With a stability score of only 44.6 (out of 100), the model performs unstably in more than half of scenarios. When a model claiming to be "Pro" can't even calculate time zones correctly, can we still expect it to handle more complex business decisions?

The Achilles' Heel of LLMs: When Intelligence Meets Common Sense

This incident exposes a fundamental problem with current LLMs: they may excel at complex reasoning but stumble on the most basic common sense judgments. This "high IQ, low common sense" characteristic is precisely the most dangerous aspect of AI systems.

Imagine if your AI assistant made such errors when scheduling international meetings, or got the time wrong when handling cross-timezone financial transactions—the consequences would be catastrophic. More ironically, Gemini 2.5 Pro scores 86.9 in programming ability—it can write complex algorithms but can't calculate simple time zones.

The cost-effectiveness indicator dropped from 42.6 to 41.0, with the already low score continuing to decline. When users pay a premium for the "Pro" version but receive a service that can't even guarantee basic common sense, this gap directly affects users' willingness to pay.

Looking Beyond the Surface: The Value of Evaluation Systems

This incident also validates the necessity of strict question evaluation. Many question why we test AI with these "tricky" questions. The answer is simple: if a model can't solve clearly defined problems, how can we trust it to handle ambiguous real-world scenarios?

Time zone reasoning seems simple but actually tests the model's depth of understanding of the real world. It requires the model to have comprehensive capabilities in geographical knowledge (city locations), physical common sense (Earth's rotation), and social knowledge (time zone divisions). Gemini's failure shows that even the most advanced models still have huge deficiencies in knowledge integration and common sense reasoning.

The deeper question is: is this error a problem with training data or a limitation of model architecture? If the former, it indicates flaws in Google's data quality control; if the latter, it means current Transformer architecture has fundamental defects in handling certain types of reasoning.

Final Thoughts

When the smartest AI can't even figure out what time it is, we might be further from true artificial general intelligence than we imagine.

Gemini 2.5 Pro's failure sounds an alarm for the entire industry: while pursuing parameter scale and benchmark scores, don't overlook the most basic common sense capabilities. A model that can't even calculate time zones correctly, no matter how well it performs on other tasks, will struggle to earn users' trust. This might explain why, despite various vendors constantly claiming breakthroughs, enterprises truly willing to fully deploy AI in critical business operations remain few and far between.


Data source: YZ Index | Run #33 | View raw data