2 comments

  • Pinkert 4 hours ago ago

    One architectural tradeoff we are actively working on right now is the latency of the "Select" step for shorter conversations.

    Currently, the open-source version of Librarian uses a general-purpose model to read the summary index and route the relevant messages. It works great for accuracy and drastically cuts token costs, but it does introduce a latency penalty for shorter conversations because it requires an initial LLM inference step before your actual agent can respond.

    To solve this, we are currently training a heavily quantized, fine-tuned model specifically optimized only for this context-selection task. The goal is to push the selection latency below 1 second so the entire pipeline feels completely transparent. (We have a waitlist up for this hosted version on the site).

    If anyone here has experience fine-tuning smaller models (like Llama 3 or Mistral) strictly for high-speed classification/routing over context indexes, I'd love to hear what pitfalls we should watch out for.

  • findjashua an hour ago ago

    won't this essentially disable prompt caching, that you get from a standard append-only chat history?