In the rush to adopt AI, most teams reach for large language models (LLMs) like GPT-4 or Claude as the default solution. It makes sense - LLMs are flexible, powerful, and broadly capable. But there's a quiet cost to this approach: high latency, opaque decision-making, spiraling API bills, and dependence on external cloud infrastructure.
The truth is, most AI agents don’t need to reason broadly. They need to operate consistently within narrow, well-defined tasks-think classification, information extraction, form filling, UI navigation, customer support triage.
This is where small language models (SLMs) shine.
A recent Stanford and Carnegie Mellon paper - "Small Language Models are the Future of Agentic AI"-backs this up with real data. SLMs (≤10B parameters) are:
So why aren’t they the default yet?
The paper outlines a clear heuristic: match model size to task ambiguity.
Examples:
In real-world terms, 80-90% of queries hitting production agents fall into the first category. Yet we often handle them using a commercial LLM that costs $10-$30 per million tokens, when a fine-tuned open-source SLM could deliver comparable results for as little as $0.001-$0.005 per million tokens.
Add to that: the ability to run SLMs on-device, with sub-500ms latency and zero data egress, and the architectural shift becomes obvious.
Apple’s Foundation Models prove this at scale. Their AI system uses:
This mirrors the Stanford framework exactly: an SLM-first architecture, with an LLM safety net.
Their rationale?
In parallel, the Stanford team proposes a “LLM-to-SLM conversion algorithm”:
Suggestions for Implementation
If you're building or running agentic AI systems today, consider:
These aren’t theoretical improvements. In practice, teams see 50-90% cost reduction while maintaining accuracy and improving latency.
SLMs aren’t just a cost-saving trick. They enable better system design:
There are trade-offs. SLMs require investment in task-specific fine-tuning. You need routing logic to manage fallbacks. And you still need LLMs for creative or ambiguous work.
But for any team building AI agents that serve repeatable functions, the ROI math is now clear:
Don’t just scale AI up. Scale it smart.
Prompt for reflection:
What’s one agent in your system that you’re currently running on GPT-4 that could be just as effective (and 90% cheaper) on a fine-tuned SLM?
Further reading: