Why AI Thinks 4.11 Is Bigger Than 4.9 — And What That Reveals About LLMs

Spread the love

t feels like a “gotcha” moment. You’re testing out a state-of-the-art Large Language Model (LLM)—a piece of software that can write poetry, debug code, and summarize legal documents—and you throw it a curveball: “Which is larger: 4.11 or 4.9?”

To your surprise, the AI confidently asserts that 4.11 is greater than 4.9.

In the world of pure mathematics, we know that $4.9 > 4.11$ because $4.9$ is equivalent to $4.90$. But the AI isn’t necessarily “bad at math”; it’s just not doing math the way you think it is. This specific error pulls back the curtain on how LLMs actually perceive the world.

1. The Tokenization Trap

The most fundamental reason for this error is tokenization. LLMs don’t see words or numbers the way humans do; they see “tokens”—chunks of characters that the model processes as units.

When an AI encounters 4.11 and 4.9, it doesn’t automatically convert them into floating-point numbers on a number line. Instead, it breaks them down:

4.9 might be seen as [4] [.] [9]
4.11 might be seen as [4] [.] [11]

Because “11” is numerically larger than “9” in almost every other context (counting objects, ages, or years), the model’s internal probability weights nudge it toward the bigger string of digits. It is performing a pattern match, not a calculation.

2. The “Version Number” Confusion

AI models are trained on the entirety of the public internet. In the world of software development and documentation, numbers don’t always follow decimal logic.

Consider Semantic Versioning (SemVer):

In software, Version 4.11 comes after Version 4.9.
In this context, 11 is the eleventh minor update, and 9 is the ninth.

Because the AI has “read” millions of lines of GitHub repositories and tech release notes, it has learned a conflicting rule: In a sequence of dot-separated numbers, the larger number on the right usually indicates a later (and thus “greater”) version.

3. Linguistic Reasoning vs. Numeric Computation

LLMs are, at their core, prediction engines. They predict the next most likely token in a sequence.

Linguistic Logic: “11” comes after “9.”
Mathematical Logic: $0.11 < 0.90$.

Unless the model is specifically prompted to use a “Chain of Thought” (explaining its steps) or to use an internal tool (like a Python interpreter), it defaults to linguistic logic. It isn’t “thinking” about the value; it’s calculating the most probable answer based on how humans write about numbers.

The Reality Check: Why This Matters

This isn’t just a funny quirk to post on social media; it highlights a critical boundary in AI implementation.

The “Calculator” Fallacy: Never assume an LLM is a calculator. If your workflow requires high-precision financial or engineering data, “raw” LLM outputs are a risk.
System Design: This is why modern AI assistants are increasingly integrated with Code Interpreters. When you ask a modern version of Gemini or ChatGPT a math question, you’ll often see it “working”—it’s actually writing and running a script to verify the math externally.
Prompt Engineering: You can often “fix” this by telling the AI: “Treat these as decimal values and compare them step-by-step.” This forces the model to move past simple pattern matching.

Conclusion

The “4.11 vs 4.9” glitch is a perfect reminder of what AI really is: A brilliant linguist, but a distracted mathematician.

It reminds us that while AI can mimic human reasoning, it doesn’t possess human intuition. It sees the world through the lens of text. So, the next time an AI gives you a confident but slightly “off” answer, remember: it’s not failing to count; it’s just following the patterns we’ve left behind in our data.

What’s your most surprising “AI math fail” moment? Let’s discuss in the comments.