Show HN: Librarian – Cut token costs by up to 85% for LangGraph and OpenClaw

(uselibrarian.dev)

8 points | by Pinkert 5 hours ago ago

2 comments

Pinkert 4 hours ago ago
One architectural tradeoff we are actively working on right now is the latency of the "Select" step for shorter conversations.
Currently, the open-source version of Librarian uses a general-purpose model to read the summary index and route the relevant messages. It works great for accuracy and drastically cuts token costs, but it does introduce a latency penalty for shorter conversations because it requires an initial LLM inference step before your actual agent can respond.
To solve this, we are currently training a heavily quantized, fine-tuned model specifically optimized only for this context-selection task. The goal is to push the selection latency below 1 second so the entire pipeline feels completely transparent. (We have a waitlist up for this hosted version on the site).
If anyone here has experience fine-tuning smaller models (like Llama 3 or Mistral) strictly for high-speed classification/routing over context indexes, I'd love to hear what pitfalls we should watch out for.
findjashua an hour ago ago
won't this essentially disable prompt caching, that you get from a standard append-only chat history?