← Index
Nº 089
Workbench — Active

Local LLM context caching finally holds

04.29.26 · 3 min read · Workbench
#ollama#local-llm#modelfile

For three days I've been fighting context window drift in my local Ollama setup. The model would cache correctly for the first few turns, then suddenly drop the entire context and start fresh. No error. Just silence and reset.

The working configuration

I suspected the quantization flags first. I’d been using Q4_K_M for speed, but the context extension to 32k tokens wasn’t designed for that quantization level. The modelfile documentation is vague on this point — it mentions compatibility but doesn’t specify which quantization levels support extended context reliably.

The breakthrough came when I stopped treating the modelfile as configuration and started treating it as a build artifact. Instead of editing and restarting, I began versioning each attempt.

FROM qwen2.5-coder:14b

PARAMETER temperature 0.7
PARAMETER num_ctx 32768
PARAMETER num_predict -1
PARAMETER repeat_penalty 1.1

SYSTEM """You are a helpful coding assistant. Prioritize correctness over speed."""

The key was removing explicit quantization flags and letting Ollama pull the appropriate layer automatically. When I forced Q4_K_M, the context extension couldn’t bind correctly to the KV cache. Without the constraint, it defaulted to a configuration that supported both the extended context and the model’s attention mechanism.

What didn’t work

I tried three failed approaches before landing on this one. First, I increased num_gpu layers, thinking the issue was offloading. Then I adjusted repeat_penalty to aggressive levels, which only made outputs incoherent. Finally, I attempted a custom template rewrite, which broke the chat format entirely.

The temptation with local LLMs is to optimize for speed first. But context integrity is the foundation everything else builds on. A fast model that forgets what you told it three turns ago is worse than a slow model that remembers.

I’m still testing edge cases — very long code blocks (500+ lines), multi-file context, and sustained conversations over 50 turns. So far, the cache holds. I’ll update this entry if anything changes.1

Next steps

The next experiment is running this same configuration on the qwq reasoning model to see if the context extension plays nicely with chain-of-thought outputs. If it does, I might have a single local setup that handles both rapid coding queries and deep reasoning tasks without switching models.

Footnotes

  1. Update 05.01.26: Cache still holding after 72 hours of mixed use. No drift observed across 200+ turns.