The Eloquence Penalty
Why your most expensive AI model might be your dumbest
Key Takeaways
- On 7.7% of benchmark problems, larger AI models produce worse output than smaller ones by 28.4 percentage points. The mechanism: spontaneous verbosity that accumulates errors through overelaboration.
- The fix is almost absurdly simple: tell the model to be brief. Result: 26-point accuracy gain. Capability rankings reverse completely on mathematical reasoning and scientific knowledge tasks.
- AI is in its megapixel era. Vendors compete on parameter count (legible, marketable). The actual differentiator is constraint architecture (illegible, unsold).
- The Eloquence Penalty: the compounding cost of unconstrained AI output across enterprise workflows. Per-token costs fell more than 99% in two years. Enterprise AI bills rose 320% in the same period. The paradox is volume, not price.
- Constraint Intelligence - the discipline of designing AI output environments - is currently invisible, underpriced, and more valuable than the models themselves.
You are paying a premium for AI systems that overthink, over-elaborate, and over-explain their way into worse answers.
Not occasionally. Not on edge cases.
On 7.7% of benchmark problems - spanning mathematics, science, and reasoning - larger language models produce worse output than smaller ones by 28.4 percentage points. The mechanism: spontaneous scale-dependent verbosity. Your most powerful model generates responses 59% longer than smaller alternatives. Not because the task demands it. Because the model cannot help itself.
In that unconstrained elaboration, errors accumulate. Context drifts. The model reasons its way past the correct answer and into an elaborate wrong one.
Think about that. The $20K/month enterprise plan might be producing measurably worse output than the $2K/month plan. Not from a capability deficit. From a constraint deficit. By design.
The Megapixel Era of AI
The camera industry spent a decade in a war nobody needed.
Every manufacturer competed on megapixel count. Marketing departments produced charts showing 12MP was better than 8MP was better than 5MP. Consumers paid premiums for higher numbers on a spec sheet. Camera reviews centred on resolution comparisons. The megapixel number became the proxy for quality in the public mind.
The problem: cramming more megapixels onto the same sensor size actually degraded image quality. More pixels per square millimetre meant each individual pixel received less light. Low-light capability suffered. Noise increased. The photographs from a 20MP phone camera were measurably worse than those from a 12MP camera with a larger sensor and better image processing.
The legible metric - the one you could put in a comparison chart - was pointing in the opposite direction from actual quality. The industry optimised for what was sellable rather than what was real.
The resolution was not “fewer megapixels.” It was a different axis entirely. Sensor design. Image processing algorithms. Lens quality. Computational photography. The factors that determined whether you got a good photograph were invisible to the marketing comparison chart because they were harder to quantify and harder to sell.
The camera industry eventually grew past the megapixel war. It took a decade and required consumers to learn that the number on the box was not the number that mattered.
AI is in its megapixel era right now.
Vendors compete on parameter count. 7 billion. 70 billion. 405 billion. The marketing charts look compelling. The enterprise sales pitch is structured around capability tiers: the bigger the model, the higher the price, the better the output. The assumption is structural and unquestioned: more parameters equals more capability equals better results.
The research says otherwise.
What the Research Actually Found
A systematic evaluation of 31 models ranging from 0.5 billion to 405 billion parameters, tested across 1,485 problems spanning five datasets, reveals a pattern the scaling narrative cannot accommodate (Hakim, arXiv:2604.00025, 11 March 2026).
On the problems where larger models fail, the failure mode is consistent and identifiable: verbose implicit reasoning. The model elaborates without explicit step markers. It sets up context that is not needed. It hedges. It qualifies. It adds nuance that introduces ambiguity. It considers alternatives that lead away from the answer. Context accumulates. The reasoning chain grows longer. And in that growing length, errors compound.
The model is not reasoning more carefully. It is reasoning more loudly. And volume, in this case, correlates with inaccuracy.
The intervention experiments make this precise. Researchers constrained seven models (three small, four large) to produce brief responses under three conditions: control (unconstrained), brief (forced brevity), and direct (answer-only).
The results:
Under brevity constraints, large models reduced their token generation from a median of 197 tokens to 78 tokens - a 60.4% reduction. This manipulation of output length, coupled with dramatic accuracy improvements, confirms that the verbosity is causal, not correlational.
Accuracy improved by 26 percentage points under brevity constraints. Capability rankings reversed completely on mathematical reasoning and scientific knowledge tasks. Large models achieved 7.7 to 15.9 percentage point advantages over small models - direct inversions of the original gaps.
The model is not incapable. It is unconstrained. And unconstrained, it is measurably less intelligent. The superior latent capabilities are masked by the tendency toward over-elaboration under standard conditions.
Causal intervention experiments confirm: this reflects correctable prompt design rather than fundamental capability limitations. The intelligence was present throughout. It was buried under verbose reasoning that nobody asked for.
The Ontological Shift
Here is the part that should concern anyone making AI investment decisions.
The current paradigm says capability is intrinsic to the model. More parameters means more capability. Full stop. This is what vendors sell. This is what procurement teams evaluate. This is what comparison charts rank. This is what enterprise budgets are built around.
The research proves something structurally different.
Capability is relational. It emerges from the interaction between model capacity and constraint environment. A 405-billion parameter model operating without constraints can be measurably less capable than a 7-billion parameter model operating within good constraints. The intelligence is not a fixed property of the system. It is a function of how the system is deployed.
This is not a minor correction to the scaling narrative. This is a different ontology. The location of intelligence shifts from “inside the model” to “in the architecture around the model.” The model is a necessary condition. It is not a sufficient one. And for a significant class of problems (7.7% in this study, likely higher in uncounted real-world tasks), the constraint architecture matters more than the model selection.
The implication for every founder-operator making AI spending decisions: you may be paying for the wrong thing. The model tier (and its associated cost) is the legible purchase. The constraint architecture (which costs nothing additional) is the invisible lever that determines whether that purchase produces value.
The Eloquence Penalty
Name the cost: The Eloquence Penalty.
Every time your AI system over-elaborates, hedges, adds context nobody requested, and expands a three-sentence answer into three paragraphs, you pay the Eloquence Penalty. The cost compounds across four dimensions simultaneously:
Decision quality: Verbose AI output obscures the core recommendation. Decision-makers must extract the signal from noise. The extraction takes time and introduces interpretation error. Short, direct output produces faster, cleaner decisions.
Workflow speed: Every additional token in an AI response adds latency. In agentic workflows where outputs feed into subsequent steps, verbose responses at each stage compound into dramatically slower end-to-end execution.
Token economics: Gartner reports that agentic AI models require 5 to 30 times more tokens per task than standard approaches (Gartner, March 2026). Enterprise AI budgets grew from $1.2 million in 2024 to $7 million in 2026. Here is the paradox that should stop any finance team in its tracks: per-token costs fell more than 99% in two years - a 280-fold reduction in price per token - yet enterprise AI bills rose 320% in the same period. Falling unit prices, exploding total spend. The driver is volume. Unconstrained agentic workflows trigger 10 to 20 model calls per task. Verbose responses inflate every downstream context window. The bill is not a pricing problem. It is a constraint problem.
Compounding capability loss: On the 7.7% of tasks where verbosity actively degrades accuracy, the Eloquence Penalty is not just cost - it is worse outcomes. The model is producing wrong answers at premium prices.
The Eloquence Penalty is not theoretical. It is the gap between what your AI could produce and what it actually produces because nobody told it to shut up.
Only 28% of AI infrastructure projects fully deliver returns (Gartner, April 2026). The Eloquence Penalty is not the only cause. But in every organisation running unconstrained AI systems, it is a contributing factor that nobody is measuring - because it requires comparing constrained output to unconstrained output, and almost nobody is running that comparison.
Constraint Intelligence: The Discipline That Does Not Exist Yet
There is a design discipline hiding in plain sight.
Not prompt engineering - which sounds like tweaking, like adjusting a dial. Not “best practices” - which sounds optional, like a suggestion rather than architecture. A discipline. A design layer as important as model selection that determines whether your AI investment produces value or produces verbose waste.
Name it: Constraint Intelligence.
Constraint Intelligence is the systematic design of output environments for AI systems. It encompasses:
- Length constraints: How many tokens should this response contain?
- Format constraints: What structure should the output take?
- Reasoning constraints: Should the model show its work, or produce only conclusions?
- Scope constraints: What is the boundary of this response?
- Role constraints: What persona or authority level should the model adopt?
Each of these is a design decision. Each affects output quality. In combination, they determine whether the model’s latent capability manifests as value or dissipates as eloquence.
The model is a commodity. You can subscribe to any of them. The constraint architecture - how you shape what the model produces - is the competitive advantage. And right now, almost nobody is investing in it.
Why? Because parameter count is legible. It fits in a chart. It appears in vendor comparisons. It justifies price tiers. Constraint architecture is illegible. You cannot sell “our model works better when you tell it to be concise.” There is no line item on the enterprise contract for “output environment design.”
This will change. But the founders who build constraint architecture now - while the discipline is unnamed and the advantage unclaimed - will compound from it for years before it becomes standard practice.
The GP Vulnerability
For a founder running a 5 to 25 person operation, the Eloquence Penalty hits differently than in an enterprise.
In a large organisation, verbose AI output gets filtered through layers of human review. Someone reads the 500-word response, extracts the 50 words that matter, and passes those forward. The penalty exists but is partially absorbed by organisational redundancy.
In a GP-scale company, AI output often feeds directly into decisions. The founder reads the AI analysis and acts. There is no middle layer to filter verbosity. No editing team to extract signal. The Eloquence Penalty arrives unmediated at the decision point.
Consider what this looks like in practice. A founder asks an AI system to analyse a contract risk. The unconstrained response produces four paragraphs of hedged commentary, two alternative interpretations, and a conclusion buried in the final line. The founder reads it, draws their own inference, and acts. That inference may be accurate. It may not be. The signal was there - it was just surrounded by noise the model generated without permission. In an enterprise, a lawyer reads the same output, extracts the risk clause, and passes one sentence to the decision-maker. The GP founder is both analyst and decision-maker. The extraction cost lands entirely on them.
This means:
- Your AI costs are higher per unit of useful output (more tokens, same value)
- Your decision quality is more sensitive to output verbosity (no filtering layer)
- Your workflow speed is directly affected (no team to parallelise the extraction)
- Your compounding error risk is higher (fewer checkpoints between AI output and action)
The diagnostic question for every GP-scale founder:
When was the last time you evaluated your AI systems by how they constrain output rather than how much capability they have?
If the answer is never, you are likely paying the Eloquence Penalty across every AI-assisted workflow in your organisation. The cost is invisible because you have no baseline to compare against. You have never experienced your AI systems operating under good constraints.
Countermeasures: Building Constraint Architecture
The fix is not expensive. It requires design discipline, not budget.
Countermeasure 1: The Brevity Instruction
The paper demonstrates that a simple “be brief” instruction improves accuracy by 26 percentage points on tasks where large models fail. This is the minimum viable constraint.
In practice: every system prompt, every workflow instruction, every agent configuration should include explicit length and format constraints. “Respond in under 100 words.” “Provide the conclusion first, reasoning only if asked.” “Answer in bullet points, maximum 5.”
Start by auditing your three most-used AI prompts. Do any of them specify a length limit? If not, you have no constraint architecture - you have a blank canvas. A blank canvas produces whatever the model finds comfortable, and large models find elaboration comfortable. The brevity instruction is not a nicety. It is the first structural decision in building constraint architecture.
Countermeasure 2: Output Format Specification
Beyond length, specify structure. “Respond with: Recommendation (1 sentence), Reasoning (3 bullets maximum), Risk (1 sentence).” This prevents the model from elaborating freely and forces compression of reasoning into the most essential points.
The before and after is instructive. An unconstrained analysis prompt produces paragraphs. The same prompt with a structured format produces: one recommendation, three reasons, one risk. Both contain the same essential information. The second takes thirty seconds to act on. The first takes three minutes and introduces interpretation variance that the structured version eliminates entirely. Format specification is not formatting preference. It is decision architecture.
Countermeasure 3: Reasoning-on-Demand
The default for most AI systems is to show reasoning. But the research shows that verbose implicit reasoning is the failure mechanism. Design workflows where the default output is the conclusion only. Reasoning is available on request (“explain your reasoning” as a follow-up) rather than included by default.
This is a workflow design decision, not a model decision. Most founders have never made it explicitly because the AI default - show your work - feels thorough. It is not thorough. It is the exact failure mode the research identifies. Thorough reasoning that the reader cannot verify and did not request is indistinguishable from noise at the point of decision. Separate the conclusion from the explanation. Default to the conclusion. Request the explanation when you need it - not as a reflex.
Countermeasure 4: Comparative Testing
Run the same prompts with and without brevity constraints. Compare accuracy. Compare decision quality. Compare workflow speed. The Eloquence Penalty is invisible until you measure it. Measurement makes it undeniable.
The test takes twenty minutes. Take your three most-used AI workflows. Run each prompt twice: once unconstrained, once with explicit length and format constraints. Score the outputs against the same criteria - accuracy, actionability, time to act. The gap will be visible. In most cases, the constrained output wins on all three dimensions. Once you have seen the gap, you cannot unsee it. This is not a theoretical exercise. It is a calibration session that permanently changes how you configure AI systems.
Countermeasure 5: Model-Appropriate Constraints
Different models have different verbosity profiles. The research tested 31 models and found that larger models are systematically more verbose. This means larger models need stronger constraints. The $20K/month plan needs tighter output specifications than the $2K/month plan - counterintuitive, but empirically demonstrated.
The practical implication: your constraint architecture should be calibrated per-model, not applied uniformly. A system prompt that produces clean output from a smaller model may produce over-elaborated output from a larger one. When you upgrade to a higher-tier plan, your constraints need upgrading too. The default assumption - larger model, better output, job done - is precisely backwards for the class of tasks where verbosity is the failure mode. Stronger model, stronger constraints. This is the counterintuitive discipline that separates constraint-aware deployment from naive scaling.
The Pattern Underneath
I recognise this pattern because I have lived it in a different domain.
In 2008 and again in 2011, facing paralysis, every resource was constrained. Every movement required design. Every word carried weight because energy was finite. I could not afford eloquence. I could not afford elaboration. Every action had to produce maximum value from minimum expenditure.
The frameworks that worked under those conditions were not the most elaborate. They were the most constrained. The most precise. The ones that did maximum work with minimum resources. Constraint was not a limitation on capability. It was the architecture that revealed capability. The approaches that seemed “more” - more complex, more thorough, more comprehensive - consumed energy without proportional returns. The approaches that seemed “less” - simpler, shorter, more direct - consistently outproduced them.
The same structural logic applies here. The model that produces forty tokens of precise output is demonstrating more capability than the model that produces four hundred tokens of hedging elaboration. The constraint does not limit the intelligence. It reveals it. Without constraint, intelligence dissipates into verbosity. With constraint, it concentrates into value.
More is not more. Designed constraint is more.
Conclusions
-
The Eloquence Penalty is real and measurable. 28.4 percentage points of accuracy loss on 7.7% of tasks, caused by unconstrained verbosity. The fix is free. The cost of not fixing is compounding across every workflow.
-
Capability is relational, not intrinsic. Intelligence lives in the interaction between model capacity and constraint environment. The model is necessary but not sufficient. The constraint architecture determines whether capability manifests as value.
-
AI is in its megapixel era. Vendors sell the legible metric (parameters). The actual differentiator is the illegible architecture (constraints). This gap between what is sold and what matters is where the founder-operator’s advantage lives.
-
Constraint Intelligence is an unnamed discipline with more leverage than model selection for a significant class of tasks. The founders who build it now compound from it before it becomes standard practice.
-
The GP is asymmetrically exposed. No filtering layer between AI output and decisions. Higher sensitivity to verbosity. Direct cost impact. The Eloquence Penalty hits founder-operators harder because there is nothing absorbing the hit.
-
The $0 fix exists. Twenty-six percentage points of accuracy improvement from a brevity instruction. No vendor upgrade required. No procurement cycle. Just the discipline to constrain what you are already paying for.
One Diagnostic
The next time you receive a response from any AI system you use in your business, ask:
Could this response have been half as long without losing the essential information?
If yes, you are paying the Eloquence Penalty. The model produced double the tokens required. You paid for every one. And somewhere in the unnecessary half, accuracy may have degraded without your knowing.
The megapixel era ends when enough people realise the lens matters more than the sensor.
It just got easier to realise.
How constrained is your AI architecture?
The Sovereignty Index evaluates 10 dimensions of operational design - including system architecture and resource efficiency. Discover whether your AI investment is producing value or funding eloquence.