Skip to main content
Back to Blog
AIDeveloper ToolsLLMToken OptimizationRAGOpen Source

Headroom: LLM Token Compression for RAG Chunks, Tool Outputs, and Log Files

ghosty
Founder, SaaSCity
Headroom: LLM Token Compression for RAG Chunks, Tool Outputs, and Log Files

RAG pipelines are the most expensive way to make a model look informed.

You retrieve 20 chunks from your vector store, embed all of them in the prompt, and pay for every token — including the 15 the model skimmed and ignored. LLM token compression at the retrieval layer is the fix, and most teams aren't doing it. For a coding agent running tool calls all day, the problem compounds: JSON responses with 4,000-row arrays, log dumps with 900 identical INFO lines, search results where hits two through eight are variations of hit one. The model sees all of it. Your API bill reflects all of it.

Headroom is the open-source compression layer that intercepts that content before it reaches the LLM. It handles each content type differently, which is the thing most people miss when they first look at it — and there's a second half of the cost equation (output tokens) that almost nobody talks about, which Headroom also addresses.

This post covers how the compression works by content type, where the accuracy holds, the output shaping feature worth knowing about, and how to drop it into an existing SaaS stack with no code changes.


Why Generic Compression Fails

Truncating a log array from the front loses the error at line 847. Summarizing a retrieved code chunk loses the exact function signature the model needed. Stripping JSON fields alphabetically might delete status: "failed" while preserving 40 boilerplate rows.

Generic compression tools fail because they treat all content equally. The signal-to-noise ratio in a 5,000-row JSON array is totally different from a 3,000-word prose document, which is totally different from a Python file. A tool that doesn't know what it's compressing will either be too aggressive (destroying signal) or too conservative (not worth deploying).

Headroom's ContentRouter detects content type first, then routes to the algorithm built for that specific structure. The routing decision happens in milliseconds and applies the right compression strategy to each piece of content independently, even within a single prompt.


How Headroom Handles Each Content Type

JSON, API Responses, and Log Files: SmartCrusher

SmartCrusher is where the largest savings come from on agentic workloads.

Rather than slicing from the top or truncating at a byte limit, it applies four operations in sequence:

  1. Schema preservation — the opening records (types, headers, field definitions) always survive intact
  2. Recency preservation — the most recent entries are always kept
  3. Outlier extraction — a Kneedle algorithm runs on bigrams to identify lines that statistically diverge from surrounding patterns (errors, exceptions, anomalies)
  4. Near-duplicate deduplication — before any other processing, records with near-identical content are collapsed to a representative sample

The deduplication step is why it works on log files. 900 lines of INFO: request received from 192.168.x.x become a single representative line plus a count. The two ERROR: database connection timeout lines in the middle survive — because they're outliers in the statistical sense. The model sees the signal. Everything else gets deduplicated into summary markers.

Real-world compression results from the Headroom benchmarks:

WorkloadBeforeAfterReduction
Code search (100 results)17,765 tokens1,408 tokens92%
SRE incident debugging65,694 tokens5,118 tokens92%
GitHub issue triage54,174 tokens14,761 tokens73%
Codebase exploration78,502 tokens41,254 tokens47%

The 47% on codebase exploration is the honest number — when every file is unique and relevant, there's less structural redundancy to compress. Headroom is smart enough to back off when gains wouldn't be meaningful, which means the routing check runs but the compression step gets skipped.

Code Files: AST-Aware Compression

CodeCompressor parses source code into its AST using tree-sitter before touching anything. Python, JS, Go, Rust, Java, and C++ are all supported.

The practical application: when a RAG pipeline retrieves five similar utility functions from a codebase, AST-aware compression can strip function bodies while preserving signatures — exactly what the model needs to understand the API surface for a planning task. When the calling context needs full fidelity (executing a specific change, not planning), it passes through exact content instead.

This distinction matters a lot for retrieval-augmented code assistants. The model rarely needs the full body of a function to understand what it does. It does need the signature, parameters, return type, and any docstring. AST-awareness gives you that split without writing custom extraction logic.

Prose and Retrieved Documents: Kompress-base

General text — documentation snippets, retrieved paragraphs, article content — goes to Kompress-base, a custom transformer model hosted on HuggingFace that was trained specifically on agentic traces. Not a general-purpose summarizer retargeted at agents. It learned what matters in context by watching agents work.

Standard benchmark performance with Kompress-base in the loop:

BenchmarkBaselineWith HeadroomNotes
GSM8K (math)0.8700.870Zero degradation
TruthfulQA0.5300.560+3 points
SQuAD v2maintainedmaintained±0
BFCL (tool use)97% preservedat 32% compression

The TruthfulQA improvement is the finding that surprises people. When low-signal noise gets compressed away, models sometimes answer more accurately — the relevant signal they're weighting isn't diluted. It shows up consistently in the evals and has a logical explanation. This isn't cherry-picked.

Images: ML-Routed Compression

Less well-known: Headroom handles images too, with a 40–90% reduction via ML router. For multimodal pipelines processing screenshots, UI mockups, diagrams, or design assets repeatedly — this is the install flag you want: pip install "headroom-ai[image]".


The Part Nobody Talks About: Output Token Costs

Every AI context optimization tool focuses on input. Headroom added something different: output shaping.

On Opus-class models, input and output token costs are asymmetric. Claude Opus 4.8 is $15/M input tokens — $75/M output tokens. A model that writes multi-paragraph reasoning preambles, over-explains every decision, or generates lengthy boilerplate before the actual answer can easily 5× your output token consumption.

Headroom's output shaper intercepts the response stream and applies verbosity steering. The HEADROOM_OUTPUT_SHAPER environment variable controls behavior — from concise (direct answers, minimal preamble) to detailed (preserves reasoning chains for complex tasks). No restart required when you change it. Live environment sync means you can tune it in production without downtime.

💡 Related: For a complementary approach to AI cost optimization, shadcn/improve's model-splitting technique uses expensive models only for the audit phase and cheap models for execution — the two approaches compose well in a production stack.

Claimed savings on Opus-class models: 5× reduction in output tokens on typical agent workloads. At $75/M output tokens, that's real money at any serious usage volume. Input compression gets more attention because the numbers are dramatic. Output shaping is where the per-dollar ROI often wins.


The Safety Net: Compress-Cache-Retrieve

The obvious objection to compression: what if the model needed that data?

Headroom's CCR architecture (Compress-Cache-Retrieve) answers this without requiring trust. When SmartCrusher compresses 500 items to 20:

  1. The full 500 originals are stored locally (SQLite/LRU cache)
  2. The compressed output includes a summary marker: "87 passed, 2 failed, 1 error"
  3. A headroom_retrieve tool is exposed to the model

If the model determines it actually needs an item it didn't receive, it calls headroom_retrieve with the hash. Local lookup, ~1ms latency, no network round-trip. Your data never leaves the machine.

In practice, the model calls retrieve rarely — because the compression preserved the signal. But the retrieval path means you're never betting production correctness on a compression algorithm's confidence. That's what makes this deployable in production versus a research demo.


Integration Without Code Changes

Three paths, ordered by how much you need to touch your existing stack.

Proxy — zero code changes:

pip install "headroom-ai[proxy]"
headroom proxy --port 8787

Change base_url to http://localhost:8787 in your client. Done. Works with any OpenAI-compatible client — Claude Code, Cursor, Codex, Aider, whatever you're using. Compression happens transparently on every request.

Wrap a specific agent:

headroom wrap claude --memory --code-graph
headroom wrap cursor
headroom wrap copilot   # Routes GitHub Copilot CLI through the proxy

TypeScript/Node projects:

npm install headroom-ai

LangChain (drop-in replacement):

from langchain_headroom import HeadroomChatModel
llm = HeadroomChatModel(model="claude-opus-4-8")

Drop HeadroomChatModel anywhere you'd use a standard LangChain chat model. Compression happens before every LLM call, retrieval tools are injected automatically.

Docker:

docker pull ghcr.io/chopratejas/headroom:latest

Other integrations available: LiteLLM callback, Vercel AI SDK middleware, Agno, Strands. The extras system (pip install "headroom-ai[langchain]", [mcp], [image], [memory], etc.) keeps installs lean — pull only what your stack needs.

💡 Related: Building an AI agent on a budget? Check the best AI agent coding token plans in 2026 — context compression changes the economics at every tier.


Where This Pays Off in Production

RAG pipelines are the primary use case. Every retrieval result that goes into your prompt is a compression opportunity. If you're building a knowledge assistant, document Q&A, or retrieval-augmented coding tool, Headroom sits between your retrieval layer and your LLM call. You don't need to optimize your embedding strategy or tune chunk sizes — the compression handles the context density problem downstream.

SRE and debugging agents are the second. Automated code review, test failure analysis, log triage — these workloads pump large, repetitive structured outputs through the model repeatedly. The 92% savings on SRE incident debugging in the benchmarks came from a real scenario, not synthetic data.

Multi-agent architectures get SharedContext: a shared store with automatic deduplication and provenance tracking across Claude, Codex, Gemini, or whatever agents are in your fleet. Inter-agent message compression reaches up to 80% in tests. When five agents all need awareness of the same codebase state, this is infrastructure-level savings, not a nice-to-have.

headroom learn deserves a mention for teams running this seriously. It mines failed sessions, identifies where compression hurt rather than helped, and writes corrections directly to CLAUDE.md, AGENTS.md, or GEMINI.md. Your compression configuration adapts to your actual usage patterns over time — not a static benchmark profile that may or may not match your workload.


List Your Dev Tool on SaaSCity

Building something in the AI cost optimization, RAG infrastructure, or agent tooling space? This category is growing fast and early distribution compounds.

SaaSCity.io is a gamified directory for developer tools and AI products — your listing gets a permanent SEO-indexed page, a building in the interactive 3D city map, and a dofollow backlink that actually moves rankings.

  • 100% free to list — submit in under 2 minutes at /live/submit
  • Dofollow backlinks that build early SEO traction
  • 3D city map visibility — your product is a building in the SaaSCity, visible to founders actively browsing live launches
  • Direct reach to early adopters who are specifically looking for tools like yours

If you're shipping in this space, get on the map.


The Economics at Scale

Token costs scale linearly. Complexity — and therefore per-task token consumption — scales faster. Every production agent deployment has to answer the question: what does this cost at 10,000 runs per day?

For RAG pipelines specifically: if your average query retrieves 20 chunks at ~500 tokens each (10,000 input tokens per query), and Headroom achieves 73% compression on that retrieval set, you go from 10,000 tokens to ~2,700. At Claude Opus 4.8 pricing ($15/M), that's a difference of $0.109 per query — $109 per 1,000 queries — before accounting for output shaping savings on the response side.

These aren't toy numbers. They're the reason 59+ contributors have shipped to this repo under Apache 2.0, why the LangChain and LiteLLM integrations landed quickly, and why community telemetry shows 1.4 billion tokens saved across sampled fleets.

Context compression middleware isn't the part you tweet about. It's the part that determines whether your unit economics survive contact with production traffic.

Run the evals on your own workloads: python -m headroom.evals. The numbers from your specific content types will tell you more than any table here can.


SaaSCity.io covers developer tools and AI infrastructure. Explore the SaaSCity directory to discover what's shipping right now — or list your own product.

Get your SaaS in front of founders

List your product on the SaaSCity live city map — a permanent listing, real discovery, and a backlink from a high-DR directory. Free to start; upgrade for a dofollow link and a building on the map.