One architectural tradeoff we are actively working on right now is the latency of the "Select" step for shorter conversations.
Currently, the open-source version of Librarian uses a general-purpose model to read the summary index and route the relevant messages. It works great for accuracy and drastically cuts token costs, but it does introduce a latency penalty for shorter conversations because it requires an initial LLM inference step before your actual agent can respond.
To solve this, we are currently training a heavily quantized, fine-tuned model specifically optimized only for this context-selection task. The goal is to push the selection latency below 1 second so the entire pipeline feels completely transparent. (We have a waitlist up for this hosted version on the site).
If anyone here has experience fine-tuning smaller models (like Llama 3 or Mistral) strictly for high-speed classification/routing over context indexes, I'd love to hear what pitfalls we should watch out for.
One architectural tradeoff we are actively working on right now is the latency of the "Select" step for shorter conversations.
Currently, the open-source version of Librarian uses a general-purpose model to read the summary index and route the relevant messages. It works great for accuracy and drastically cuts token costs, but it does introduce a latency penalty for shorter conversations because it requires an initial LLM inference step before your actual agent can respond.
To solve this, we are currently training a heavily quantized, fine-tuned model specifically optimized only for this context-selection task. The goal is to push the selection latency below 1 second so the entire pipeline feels completely transparent. (We have a waitlist up for this hosted version on the site).
If anyone here has experience fine-tuning smaller models (like Llama 3 or Mistral) strictly for high-speed classification/routing over context indexes, I'd love to hear what pitfalls we should watch out for.
won't this essentially disable prompt caching, that you get from a standard append-only chat history?