Compression control

Context compression

Compresses the messages[] array before forwarding to the upstream provider. Targets the cache token reads that dominate the bill in long agent sessions — the part output compression alone never touches.

Why this exists

On a long Claude Code or autonomous-agent session, the conversation history balloons. File reads, tool results, and prior assistant turns all sit in messages[]and get re-sent on every new turn. Anthropic's prompt cache makes those re-reads cheaper but doesn't make them smaller — your bill still scales with session length.

We sit in the proxy layer and see the full array. Tier 1 (current, live) applies two heuristics that are guaranteed lossless on accuracy: tool-result truncation and cross-turn deduplication. Future tiers will add structured summarization of stale assistant turns and semantic retrieval of relevant past context.

What Tier 1 does

  • Truncates oversized tool results — any tool_result block larger than the level threshold is reduced to its head + tail (50 lines each on standard) with a clear marker showing how much was dropped. Code structure is preserved at the boundary.
  • Deduplicates repeated content — if the same file (by content hash) appears in multiple older turns, only the most recent copy survives verbatim. Earlier copies become a tiny reference: [earlier version, see message N].
  • Skips short conversations— if the request is already under the level's threshold, we don't touch it. Compression overhead would only make small requests larger.

What it never touches

  • The system prompt
  • The most recent N turns (the recency window) — 6 by default, varies by level
  • Code blocks (preserved exactly, never paraphrased)
  • Error messages and tracebacks (exact text required for debugging)
  • File paths, commands, URLs (one character matters)
  • In-flight tool_useblocks that haven't received a result yet
  • Any user message — your input is sacred. We only compress old tool-result content the model produced.
Safety net
If compression would yield zero or negative savings (rare — e.g. a very small request with overhead from the truncation marker), the original body is forwarded untouched. The compressor never makes a request worse.

Enabling it

Off by default. Enable per-API-key from the dashboard or via the API:

# Enable for an existing key
curl -X PATCH https://api.mintoken.dev/v1/auth/keys/<key_id> \
  -H "Authorization: Bearer <supabase_jwt>" \
  -H "Content-Type: application/json" \
  -d '{
    "context_compress": true,
    "context_compress_level": "standard"
  }'

Compression levels

  • light— kicks in at 8k tokens, truncates tool results >4k, recency window of 8. Conservative; biggest tool outputs only.
  • standard(default) — kicks in at 4k tokens, truncates tool results >2k, recency window of 6. Balanced for typical agent workflows.
  • aggressive— kicks in at 2k tokens, truncates tool results >1k, recency window of 4. For very long sessions where you want maximum savings.

What you get back

The response body is exactly the upstream provider's response, unchanged. We add three response headers when compression actually ran and saved tokens:

X-Mintoken-Intensity: full
X-Mintoken-Duration-Ms: 1843
X-Mintoken-Context-Tokens-Saved: 12407
X-Mintoken-Context-Tokens-Before: 28104
X-Mintoken-Context-Tokens-After: 15697
  • X-Mintoken-Context-Tokens-Saved — delta between before and after
  • X-Mintoken-Context-Tokens-Before — token count of messages[] before compression
  • X-Mintoken-Context-Tokens-After — token count after

The same numbers are recorded in your usage records and surfaced on the dashboard so you can see savings over time, by API key, and by model.

Anthropic prompt cache: read this

Cache invalidation cost
Rewriting messages[] changes the cache prefix, which invalidates any existing Anthropic prompt cache entry. The first request after the array changes is a cache write (1.25× base price), not a read. Break-even is two subsequent requests with the new prefix. For sessions with many turns, this is fine. For one-off exchanges, leave context compression off.

Tier 1 only modifies the older portion of the array (outside the recency window), so the most recent turns — which are most likely to drive the next cache prefix — stay stable.

When to enable it

  • You run agents that read files repeatedly across turns (Claude Code, Cursor agent mode, AutoGPT-style loops, LangGraph)
  • Your sessions regularly exceed 20+ turns or 30k tokens of context
  • Your monthly bill is dominated by cache reads

When NOT to enable it

  • Single-turn or stateless requests (chat completion endpoints treated as one-shot calls)
  • Short conversations under ~4k tokens (the heuristic skips them anyway, but no point opting in)
  • Workflows that explicitly depend on byte-perfect history of every tool output (rare — but in this case, leave it off)

What's next

  • Tier 2:structured rolling summary of middle turns using a 4-field template (intent / changes / decisions / open items). Beats Anthropic's auto-compaction in published benchmarks. ETA: 4-6 weeks.
  • Tier 3: semantic retrieval — RAG for conversation history, evicting low-relevance turns entirely. Future.