The GPT-5 Wake-Up Call: When Bigger Stopped Being Better

This is a guest post by one of our most esteemed clients, Manuel Herranz, CEO of Pangeanic. We collect, classify and supply data for AI training at NLP Consultancy: and working with data allows us to test models and understand what the market wants. Our close relationship with Pangeanic as a vendor has helped us learn and improve in our own journey.
– – – – – – – – – – – – – – – – – – – – – – – – – – – – –

The much anticipated release of GPT-5 was supposed to mark a watershed moment in artificial intelligence—the long-predicted leap toward artificial general intelligence (AGI). Instead, as noted by AI expert Gary Marcus, it became something entirely different: a stark revelation that our current approach to AI development has hit a formidable wall . Within hours of its debut, researchers and users alike discovered baffling shortcomings—elementary math errors, counting failures, and absurd responses to simple riddles. Despite billions in investment and nearly three years of development, GPT-5 proved to be merely an incremental improvement rather than the revolutionary system many had promised .

This disappointment transcends mere technical failure; it represents a crisis of paradigm. The relentless pursuit of scaling—ever-larger models trained on ever-more data—has revealed itself as scientifically limited, economically unsustainable, and increasingly divorced from how actual intelligence works. As Marcus astutely observes in his New York Times article and Substack, “Large language models, which power systems like GPT-5, are nothing more than souped-up statistical regurgitation machines, so they will continue to stumble into problems around truth, hallucinations and reasoning” .

Bigger is not better in AI, it’s just a bigger fish tank

Our own research journey has increasingly moved away from the bigger-is-better dogma toward a more nuanced, cognitively-inspired approach to artificial intelligence.

After DeepSeek is delaying new releases, Claude 4 seems to have modest improvements over 3.5 (well, it’s an improvement on itself!) but not miles better than ChatGPT 3.5

Bigger LLM models are just bigger, but the situation for the fish is the same: we’re still in a fish tank

Let’s explore why the scaling paradigm has failed to deliver on its promises, how insights from cognitive science can illuminate a better path forward, and why task-specific small language models—particularly in areas such as translation—represent not just an alternative approach, but arguably the only sustainable way forward for enterprise-ready AI.

The Scaling Fallacy: When Diminishing Returns Become Apparent

The doctrine of scaling has dominated AI research and investment for nearly a decade. Its premise is seductively simple: if we continue to increase model size, training data, and computational resources, we will inevitably progress toward human-level intelligence. This belief has fueled an arms race characterized by astronomical investments and increasingly grandiose claims .

However, as Marcus noted in his 2022 essay “Deep Learning Is Hitting a Wall,” so-called scaling laws aren’t physical laws of the universe like gravity, but hypotheses based on historical trends . The disappointing reality of GPT-5 merely confirms what skeptics have long suspected: there are fundamental limitations to what scaling alone can achieve. Despite its massive training dataset and parameters, GPT-5 continues to struggle with core challenges that have plagued earlier models, for example:

1. Brittle Reasoning and Logic: GPT-5, like its predecessors, remains fundamentally a pattern-matching system. It excels at identifying statistical relationships within its training data but lacks genuine understanding or causal reasoning. This explains its persistent failures in tasks requiring step-by-step logical deduction, problem-solving, or handling novel situations outside its exact training distribution. It can flawlessly write a sonnet in the style of Shakespeare but stumble on a basic physics problem a high schooler could solve. It beats humans at what spin doctors like to call “reasoning” … because it is an impressive library of knowledge, math, code, literature, essays… (and not many sourced ethically, by the way).

2. Persistent Hallucinations and Factual Inaccuracy: The larger the model, the more convincing its “confabulations” become. GPT-5 generates elaborate, often grammatically perfect, but entirely false information with unwavering confidence. This isn’t a bug; it’s a feature of its design. Without an internal model of reality or a mechanism for truth-checking beyond statistical correlation, every output sounds plausible, but it is a guess rather than a verified fact. For businesses relying on AI for critical information, this is a significant liability.

3. Data Gluttony and Environmental Impact: The sheer scale of data required to train these colossal models is staggering. GPT-5 likely consumed petabytes of text and code, a process that is not only economically prohibitive for most organizations but also carries a substantial environmental footprint. The energy consumption of training and running these models contributes significantly to carbon emissions, raising serious ethical questions about the sustainability of this “bigger is better” paradigm.

4. Lack of Explainability and Interpretability: As models grow in complexity, their internal workings become increasingly opaque. Claude has done some work towards explainabilty but it remains short and insufficient (see: “Tracing the thoughts of a large language model” ). The “black box” problem prevents us from understanding why a model makes a particular decision or arrives at a specific answer. For industries requiring transparency, accountability, and the ability to audit AI systems—such as healthcare, finance, or legal—this lack of explainability is a deal-breaker. Regulators are increasingly demanding transparency, and current LLMs simply cannot provide it.

The Scaling Fallacy is about a fundamental misunderstanding of intelligence itself. We’ve been building larger and larger statistical engines, hoping they would spontaneously ignite into genuine reason. They haven’t. And they won’t. Scientists have marveled at “emerging capabilities” which happened unexpectedly after certain depths of training and a certain amounts of data. This proves that we simply don’t grasp and can’t predict what may happen at certain depths of training and what unexpected patterns neural networks may find.

Cognitive Science: The Blueprint for True Intelligence

While the AI world was obsessed with scale, another field has spent decades meticulously unraveling the mysteries of the human mind: cognitive science. This discipline offers a wealth of insights that could provide the “wake-up call” the AI community desperately needs.

Plenty of spin doctors have helped in the confusion between LLMs powered by chatbots and AGI. This image pictures some hands working on a laptop–productivity increases — *Plenty of spin doctors have helped in the confusion between LLMs powered by chatbots and AGI. Productivity can increase but within controlled environments and with defined use cases.*

Human intelligence is not about processing vast amounts of data, it’s about efficiently learning, reasoning, and adapting with surprisingly little data (what we described early this year in our company post “Why abstract thinking is AI’s insurmountable wall”). We don’t need to read the entire internet to understand a new concept or solve a novel problem. Cognitive science highlights several key principles that are conspicuously absent in large language models. These are just a few concepts that come to mind and that prove how far we are from what Big Tech (or Bug Tech) is trying to sell:

- Compositionality: Humans can combine existing concepts in novel ways to understand new ones. We grasp “red car” even if we’ve only seen “red” and “car” separately. LLMs struggle with this, often treating phrases as atomic units rather than compositions of meaning. And we understand that our next door neighbor is doing rather well because he’s just bought a Ferrari.

- Causal Reasoning: We understand cause and effect. We know that pushing a glass off a table causes it to break. LLMs can describe the correlation between pushing and breaking but don’t inherently understand the underlying physical laws. Advanced systems trained with noises will recognize the sound of a glass breaking. Robotic systems will calculate how to place the glass on the table. Try to reason that the large one goes to grandad, the family patriarch whose 80 today but not to John and Mary, who don’t drink alcohol. Kids will have smaller glasses as they only drink soda (or healthier drinks like water!).

- Symbolic Representation: Humans use abstract symbols and rules to manipulate knowledge. Logic, mathematics, and language itself are symbolic systems. While LLMs process text, they don’t operate on underlying symbolic representations of meaning in the same way humans do. They’re very helpful tools to increase manual, repetitive tasks.

- Common Sense: We possess a vast, intuitive understanding of the world—what objects are for, how people behave, basic physics. This “common sense” is largely missing in LLMs, leading to absurd errors in seemingly simple scenarios.

- Learning with Less Data: Children learn language and complex concepts from limited exposure, driven by intrinsic curiosity and interaction with the world. LLMs, in contrast, are data-hungry monsters requiring astronomical datasets. A kid of 8 years of age has processed so much world visual, physics, speech data that you would need half a data center and trillions of datasets to match it. In the following 8-10 years, that same kid will acquire vasts amounts of knowledge, nuances in speech, word games, innuendos, world knowledge for which massive computing will be required. LLMs based on instruct chatbots will beat the kid and its adult version in the vastness of processing (more languages, humongous amounts of books processed). Perhaps not so in cultural references. LLMs will be very useful recall machines saving time for basic facts.

Imagine an AI system designed not just to mimic surface-level linguistic patterns, but to build internal models of the world, reason about cause and effect, and learn new concepts hierarchically, much like a child does. This is the promise of cognitively-inspired AI, a path that emphasizes depth of understanding over breadth of data. We will require a complete overhaul of current architectures where the confusion between what we’re being sold (LLMs) and AI is made clear: we need real-time input data that modifies training parameters quickly.

Short-Term: The Future is Small and Task-Specific Language Models for Enterprise Solutions

For enterprises, the GPT-5 debacle is not just a scientific curiosity; it’s a clear signal to pivot away from generic, “one-size-fits-all” large language models. The future of enterprise AI lies not in bigger, but in smarter, more efficient, and critically, task-specific Small Language Models (SLMs).

Here’s why SLMs, especially those focused on specific domains like translation, are the only sustainable way forward:

- Efficiency and Cost-Effectiveness: Training and running SLMs requires significantly less computational power, data, and energy. This translates directly to lower operational costs, making advanced AI accessible to a much broader range of businesses, not just tech giants.

- Accuracy and Reliability: By focusing on a narrow domain, SLMs can achieve superior accuracy and reduce hallucinations within their specific task. For example, a translation SLM trained exclusively on legal documents will outperform a general LLM in translating legal jargon, simply because it has a deeper, more specialized understanding of that specific linguistic context.

- Explainability and Control: Smaller models are inherently more interpretable. Businesses can better understand how they arrive at conclusions, facilitating auditing, compliance, and fine-tuning. This level of control is crucial for applications in regulated industries.

- Data Privacy and Security: Training and deploying SLMs often involve smaller, more controlled datasets, making it easier to manage data privacy and security concerns. For enterprises handling sensitive information, this is paramount.

- Customization and Specialization: Businesses don’t need a model that can write poetry and code and translate Sanskrit. They need a model that can excel at their specific business needs. SLMs allow for deep customization, creating highly optimized tools for tasks like customer service, internal knowledge management, code generation for specific languages, or, as we specialize in, highly accurate domain-specific translation.

Translation Technologies Started Everything : A Case Study for SLMs

Consider the field of machine translation. A general LLM might produce grammatically correct but contextually awkward or factually inaccurate translations, especially for specialized terminology. A dedicated translation Small Language Model (SLM), however, can be trained on massive parallel corpora or relatively small amounts of parallel texts within a specific industry (e.g., medical, financial, technical). This focused training allows it to:

- Grasp nuanced terminology: It learns the specific meaning of terms within that domain, avoiding common translation errors.

- Understand stylistic conventions: It can accurately reproduce the formal tone of a legal document or the precise language of a technical manual.

- Deliver higher fidelity: The translations are not just fluent; they are accurate, reliable, and fit for purpose.

- Integrate seamlessly: SLMs can be more easily integrated into existing enterprise workflows and legacy systems.

The GPT-5 wake-up call wasn’t a failure; it was an opportunity. An opportunity to abandon a scientifically limited paradigm and embrace a future where AI is smart, not just big.

– – – – – – – – – – – – – – – – – – – – – – – – – – – – – – –

At NLPConsultancy.com, and together with our partners we are pioneering this shift. Our focus is on developing and implementing sophisticated, cognitively-inspired SLMs that solve real-world enterprise challenges. We believe that by moving beyond the “bigger is better” fallacy, we can unlock the true potential of AI—creating intelligent systems that are not just powerful, but also reliable, sustainable, and genuinely useful.

What are your thoughts? Is bigger really better, or is it time for a more nuanced approach to AI? Share your views below!

Why Choose Us

Why Choose NLP CONSULTANCY?

We Understand You

Our team is made up of Machine Learning and Deep Learning engineers, linguists, software personnel with years of experience in the development of machine translation and other NLP systems.

We don’t just sell data – we understand your business case.

Extend Your Team

Our worldwide teams have been carefully picked and have served hundreds of clients across thousands of use cases, from the from simple to the most demanding.

Quality that Scales

Proven record of successfully delivering accurate data in a secure way, on time and on budget. Our processes are designed to scale and also change with your growing needs and projects.

Predictability through subscription model

Do you need a regular influx of annotated data services? Are you working on a yearly budget? Our contract terms include all you need to predict ROI and succeed thanks to predictable hourly pricing designed to remove the risk of hidden costs.

Ethical, Task-Specific Data To Train Smarter AI

The GPT-5 Wake-Up Call: When Bigger Stopped Being Better

Bigger is not better in AI, it’s just a bigger fish tank

The Scaling Fallacy: When Diminishing Returns Become Apparent

Cognitive Science: The Blueprint for True Intelligence

Short-Term: The Future is Small and Task-Specific Language Models for Enterprise Solutions

Translation Technologies Started Everything : A Case Study for SLMs

Why Choose Us

Why Choose NLP CONSULTANCY?

Exploring machine learning or have a specific use case? Let’s talk.

Service

Company

Newsletter