The Anthropic API was already supported by llama.cpp (The project Ollama ripped off and typically lags in features by 3-6 months), which already works perfectly fine with Claude Code by setting a simple environment variable.
As others said this was possible for months already with llama-cop’s support for Anthropic messages API. You just need to set the ANTHROPIC_BASE_URL. The specific llama-server settings/flags were a pain to figure out and required some hunting, so I collected them in this guide to using CC with local models:
One tricky thing that took me a whole day to figure out is that using Claude Code in this setup was causing total network failures due to telemetry pings, so I had to set this env var to 1: CLAUDE_CODE_DISABLE_NONESSENTIAL_TRAFFIC
There are already various proxies to translate between OpenAI-style models (local or otherwise) and an Anthropic endpoint that Claude Code can talk to. Is the advantage here just one less piece of infrastructure to worry about?
siderailing here - but got one that _actually_ works?
in particular i´d like to call claude-models - in openai-schema hosted by a reseller - with some proxy that offers anthropic format to my claude --- but it seems like nothing gets to fully line things up (double-translated tool names for example)
reseller is abacus.ai - tried BerriAI/litellm, musistudio/claude-code-router, ziozzang/claude2openai-proxy, 1rgs/claude-code-proxy, fuergaosi233/claude-code-proxy,
I've been hacking on this one for a few months now and it works for me https://github.com/elidickinson/claude-code-mux Been optimizing for routing to different models within one session so maybe overkill.
But I'm surprised litellm (and its wrappers) don't work for you and I wonder if there's something wrong with your provider or model. Which model were you using?
The general rule to follow is that you need as much VRAM as the model size. 30b models are usually around 19GB. So, most likely a GPU with 24GB of VRAM.
I'd like to know this, too. I'm just getting started getting my feet wet with ollama and local models using just CPU, and it's obviously terribly slow (even 24 cores, 128GB DRAM. It's hard to gauge how much GPU money I'd need to plonk down to get acceptable performance for coding workflows.
I tried to build a similar local stack recently to save on API costs. In practice I found the hardware savings are a bit of a mirage for coding workflows. The local models hallucinate just enough that you end up spending more in lost time debugging than you would have paid for Sonnet or Opus to get it right the first time.
I was trying to get Claude code to work with llama.cpp but could never figure out anything functional. It always insisted on a phone home login for first time setup. In cline I’m getting better results with glm-4.7-flash than with qwen3-coder:30b
this is cool. not sure it is the first claude code style coding agent that runs against Ollama models though. goose, opencode and others have been able to do that a while no?
The Anthropic API was already supported by llama.cpp (The project Ollama ripped off and typically lags in features by 3-6 months), which already works perfectly fine with Claude Code by setting a simple environment variable.
And they reference that announcement and related information in the second line.
Which announcement are you looking at? I see no references to llama-cpp in either Ollama's blog post or this project's github page.
As others said this was possible for months already with llama-cop’s support for Anthropic messages API. You just need to set the ANTHROPIC_BASE_URL. The specific llama-server settings/flags were a pain to figure out and required some hunting, so I collected them in this guide to using CC with local models:
https://github.com/pchalasani/claude-code-tools/blob/main/do...
One tricky thing that took me a whole day to figure out is that using Claude Code in this setup was causing total network failures due to telemetry pings, so I had to set this env var to 1: CLAUDE_CODE_DISABLE_NONESSENTIAL_TRAFFIC
There are already various proxies to translate between OpenAI-style models (local or otherwise) and an Anthropic endpoint that Claude Code can talk to. Is the advantage here just one less piece of infrastructure to worry about?
siderailing here - but got one that _actually_ works?
in particular i´d like to call claude-models - in openai-schema hosted by a reseller - with some proxy that offers anthropic format to my claude --- but it seems like nothing gets to fully line things up (double-translated tool names for example)
reseller is abacus.ai - tried BerriAI/litellm, musistudio/claude-code-router, ziozzang/claude2openai-proxy, 1rgs/claude-code-proxy, fuergaosi233/claude-code-proxy,
What probably needs to exist is something like `llsed`.
The invocation would be like this
Where the json has something like So if one call is two, you can call multiple in the pre or post or rearrange things accordingly.This sounds like the proper separation of concerns here... probably
The pre/post should probably be json-rpc that get lazy loaded.
Writing that now. Let's do this: https://github.com/day50-dev/llsed
Some unsolicited advice: Streaming support is tricky. I'd strip the streaming out when you proxy until everything else is solid.
Cool. Sounds good. Thanks. I'll do it.
This will be a bit challenging I'm sure but I agree, litellm and friends do too many things and take too long to get simple asks from
I've been pitching this suite I'm building as "GNU coreutils for the LLM era"
It's not sticking and nobody is hyped by it.
I don't know if I should keep going or if this is my same old pattern cropping up again of things I really really like but just kinda me
I've been hacking on this one for a few months now and it works for me https://github.com/elidickinson/claude-code-mux Been optimizing for routing to different models within one session so maybe overkill.
But I'm surprised litellm (and its wrappers) don't work for you and I wonder if there's something wrong with your provider or model. Which model were you using?
What hardware are you running the 30b model on? I guess it needs at least 24GB VRAM for decent inference speeds.
The general rule to follow is that you need as much VRAM as the model size. 30b models are usually around 19GB. So, most likely a GPU with 24GB of VRAM.
But this also means tiny context windows. You can't fit gpt-oss:20b + more than a tiny file + instructions into 24GB
Gpt-oss is natively 4-bit, so you kinda can
I'd like to know this, too. I'm just getting started getting my feet wet with ollama and local models using just CPU, and it's obviously terribly slow (even 24 cores, 128GB DRAM. It's hard to gauge how much GPU money I'd need to plonk down to get acceptable performance for coding workflows.
I tried to build a similar local stack recently to save on API costs. In practice I found the hardware savings are a bit of a mirage for coding workflows. The local models hallucinate just enough that you end up spending more in lost time debugging than you would have paid for Sonnet or Opus to get it right the first time.
I was trying to get Claude code to work with llama.cpp but could never figure out anything functional. It always insisted on a phone home login for first time setup. In cline I’m getting better results with glm-4.7-flash than with qwen3-coder:30b
~/.claude.json with {"hasCompletedOnboarding":true} is the key, then ANTHROPIC_BASE_URL and ANTHROPIC_AUTH_TOKEN work as expected
Curious what llama-server flags you used. On my M1 Max 64GB MacBook I tried it in Claude Code (which has a 25K system message) and I get 3 tps.
But with Qwen3-30B-A3B I get 20 tps in CC.
this is cool. not sure it is the first claude code style coding agent that runs against Ollama models though. goose, opencode and others have been able to do that a while no?
Does this UI work with Open Code?
hey, thanks for sharing. I had to go to the Twitter feed to find the GitHub link:
https://github.com/21st-dev/1code
Thanks for catching that. I've changed the URL at the top to that from https://twitter.com/serafimcloud/status/2014266928853110862 now.