Why Loading All 97 Skills Into Every Prompt Doesn't Scale

AI Agents May 13, 2026

I run an AI agent with 97 installed skills. Every time I send a message — even something trivial like "list my running containers" — the agent's system prompt includes the name and description of all 97 skills. That's roughly 1,700 tokens of static payload, copied into every single inference call, regardless of whether the task involves blog publishing or playing Pokemon.

This is the kind of architectural decision that seems fine at 10 skills and becomes increasingly absurd as the library grows. I want to explain why, what the actual costs are, and what a better approach looks like — because this isn't just my agent's problem. It's a pattern I see across almost every agent framework I've worked with.

How It Works Today

The current design in most agent systems, including mine, is straightforward:

  1. At session start, scan a skills directory for all installed modules
  2. Parse each module's metadata (name, description, category)
  3. Inject the full index into the system prompt as an <available_skills> block
  4. The full skill body is loaded on demand via a separate tool call

Step 4 is fine — lazy-loading the actual content makes sense. The problem is step 3. That index is injected into every turn, unconditionally. There's no filtering by relevance, no prioritization by usage frequency, no mechanism for the model to say "I don't need any of these right now."

The Token Math

Let's make this concrete with real numbers from my setup:

  • Skill index per turn: ~1,700 tokens
  • Typical session length: 60 turns
  • Total index tokens per session: ~102,000
  • Context window: 128K tokens
  • Context consumed by skill index: up to 80% before any real conversation

That last point is the critical one. 102K tokens out of a 128K context window means that by turn 30, the system is compressing or discarding earlier conversation to make room for the same 97 skill descriptions that haven't changed since session start.

And this is with 97 skills. Agent libraries grow organically — you automate a workflow, save it as a skill, add another, then another. At 200 skills, the index alone would consume more tokens than most models process in a reasonable session.

The MCP Parallel

What made this click for me was studying how MCP (Model Context Protocol) servers handle the same problem. MCP doesn't dump every available tool into the prompt upfront. Instead, it uses a search-and-inject pattern:

  1. The agent sends a query describing what it needs
  2. The MCP server searches its index
  3. Only relevant results are injected into the current context

This is fundamentally different from the "inject everything, hope the model ignores what it doesn't need" approach. The model is good at using information it has, but it's not great at ignoring information it doesn't need. Every irrelevant skill description is a small but real distraction — something the attention mechanism has to process, weigh, and discard.

Multiply that by 97 entries across 60 turns, and it becomes not just a token cost problem but a signal-to-noise problem. You're paying for noise, and the model is spending attention on it.

What a Better Approach Looks Like

I'm not proposing anything radical — just applying the same retrieval-augmented pattern that's become standard everywhere else:

1. A Skill Search Tool

Instead of baking the full skill index into the system prompt, give the model a skill_search(query: str) tool. The model calls it with something like "ghost cms blog publishing" and gets back the 2-3 most relevant skills with their descriptions. Same pattern as web search, RAG retrieval, or MCP tool discovery.

2. A Small Always-Loaded Core

Not every skill needs discovery. A handful of skills — the agent's own configuration, the blog publishing workflow, note-taking integrations — get used almost every session. Those should have a free pass: always in the system prompt, no search needed. I'd cap this at 5-8 skills.

3. Lazy Index for Everything Else

The remaining skills sit in a searchable index — could be keyword matching, embedding search over names and descriptions, or something more sophisticated. The model pulls them on demand, only when the task actually calls for them.

The Tradeoffs

I want to be honest about the costs, because nothing in systems design is free:

  • Latency: A search tool call adds one round-trip before the model can use a skill. Roughly 200-500ms. Against 102K wasted tokens per session, I'd take that trade every time.
  • Missed skills: The model might not search for a skill it would have recognized from seeing it in a full list. This is a real risk, mitigated by keeping the core skills always-visible and making the search recall broad enough to catch near-matches.
  • Implementation complexity: This requires changes to the agent's core — a new tool definition, a search backend, and a configuration layer for which skills qualify as "core." Not trivial, but not a rewrite either.

The Bigger Pattern

What strikes me about this problem is how universal it is. Across agent frameworks, tool registries, prompt templates, and context injection systems, the same pattern keeps recurring: grow the static payload monotonically, never prune it.

System prompts have become the new autoload directory. Personality instructions, tool descriptions, usage examples, user preferences, context from the last session — everything goes in, nothing comes out. The model is treated as both a reasoning engine and a filing cabinet.

But models aren't filing cabinets. They're reasoning engines that happen to have large context windows. The difference matters because context isn't free — it costs money, it consumes attention, and it has hard limits.

The MCP ecosystem already figured this out for tool discovery. RAG figured it out for knowledge retrieval. The next step is applying the same principle to everything that goes into a prompt: skills, instructions, preferences, examples — all of it should be searchable and injected on demand rather than loaded wholesale.

The first agent framework that makes lazy context injection the default — not a power-user option buried in config — will have a meaningful advantage in both cost and quality.

What I'm Doing About It

Full disclosure: I asked my agent to analyze this problem and propose concrete solutions. The analysis identified three viable approaches, each with different tradeoffs:

  1. Terse injection: Strip descriptions, inject only skill names (~500 tokens instead of ~1,700). Cheap to implement, but the model loses the ability to judge relevance without descriptions.
  2. Split directories: Move rarely-used skills to a secondary directory that's only searched on demand. Simple to implement with existing tooling, no core framework changes needed.
  3. Full search tool: The approach I described above — a dedicated search tool, a small always-loaded core, and lazy retrieval for everything else. Most token savings, most implementation work.

I haven't implemented any of these yet. The analysis is done, the options are on the table, and now it's a question of priorities. But the math is clear enough that I don't think this is optional — it's a scaling problem that gets worse with every skill I add.

If you're building or using an agent framework, I'd encourage you to check how much of your context window is consumed by static indices that never change within a session. The number might surprise you.

Tags