For most users that wanted to run LLM locally, ollama solved the UX problem.
One command, and you are running the models even with the rocm drivers without knowing.
If llama provides such UX, they failed terrible at communicating that. Starting with the name. Llama.cpp: that's a cpp library! Ollama is the wrapper. That's the mental model. I don't want to build my own program! I just want to have fun :-P
Having read above article, I just gave llama.cpp a shot. It is as easy as the author says now, though definitely not documented quite as well. My quickstart:
Go to localhost:8000 for the Web UI. On Linux it accelerates correctly on my AMD GPU, which Ollama failed to do, though of course everyone's mileage seems to vary on this.
Was hoping it was so easy :) But I probably need to look into it some more.
llama_model_load: error loading model: error loading model architecture: unknown model architecture: 'gemma4'
llama_model_load_from_file_impl: failed to load model
Edit: @below, I used `nix-shell -p llama-cpp` so not brew related. Could indeed be an older version indeed! I'll check.
As it has been discussed in a few recent threads on HN, whenever a new model is released, running it successfully may need changes in the inference backends, such as llama.cpp.
There are 2 main reasons. One is the tokenizer, where new tokenizer definitions may be mishandled by the older tokenizer parsers.
The second reason is that each model may implement differently the tool invocations, e.g. by using different delimiter tokens and different text layouts for describing the parameters of a tool invocation.
Therefore running the Gemma-4 models encountered various problems during the first days after their release, especially for the dense 31B model.
Solving these problems required both a new version of llama.cpp (also for other inference backends) and updates in the model chat template and tokenizer configuration files.
So anyone who wants to use Gemma-4 should update to the latest version of llama.cpp and to the latest models from Huggingface, because the latest updates have been a couple of days ago.
I just hit that error a few minutes ago. I build my llama.cpp from source because I use CUDA on Linux. So I made the mistake of trying to run Gemma4 on an older version I had and I got the same error. It’s possible brew installs an older version which doens’t support Gemma4 yet.
And that's exactly why llama.cpp is not usable by casual users. They follow the "move fast and break things" model. With ollama, you just have to make sure you're getting/building the latest version.
Its not possible to run the latest model architectures without 'moving fast'. The only thing broken here is that they are trying to use an old version with a new model.
I'm a bit unsure what that has to do with someone running an outdated version of the program while trying to use a model that is supported in the latest release.
LlamaBarn is the MacOS app, not the HTTP API server, which is "llama-server".
On non-Apple PCs, "llama-server" is what you use, and you can connect to it either with a browser or with an application compatible with the OpenAI API.
Perhaps using "llama-server" as the name of the project would have been less confusing for newbies than "llama.cpp".
I confess that when I first heard about "llama.cpp" I also thought that it is just a library and that I have to write my own program in order to implement a complete LLM inference backend.
This is correct, and I avoided it for this reason, did not have the bandwidth to get into any cpp rabbit hole so just used whatever seemed to abstract it away.
I don't care about the GUI so much. Ollama lets me download, adjust and run a whole bunch of models and they are reasonably fast. Last time I compared it with Llama.cpp, finding out how to download and install models was a pain in Llama.cpp and it was also _much_ slower than Ollama.
If you today visit a models page on huggingface, the site will show you the exact oneliner you need to run to it on llama.cpp.
I didn't measure it, but both download and inference felt faster than ollama. One thing that was definitely better was memory usage, which may be important if you want to run small models on SCB.
agree. We can easily compare it with docker. Of course people can use runc directly, but most people select not to and use `docker run` instead.
And you can blame docker in a similar manner. LXC existed for at least 5 years before docker. But docker was just much more convenient to use for an average user.
UX is a huge factor for adoption of technology. If a project fails at creating the right interface, there is nothing wrong with creating a wrapper.
Notwithstanding the fact that there's about zero difference between `ollama run model-name` and `llama-cpp -hf model-name`, and that running things in the terminal is already a gigantic UX blocker (Ollama's popularity comes from the fact that it has a GUI), why are you putting the blame back on an open source project that owes you approximately zero communication ?
> Notwithstanding the fact that there's about zero difference between `ollama run model-name` and `llama-cpp -hf model-name`
There is a TON of difference. Ollama downloads the model from its own model library server, sticks it somewhere in your home folder with a hashed name and a proprietary configuration that doesn't use the in built metadata specified by the model creator. So you can't share it with any other tool, you can't change parameters like temp on the fly, and you are stuck with whatever quants they offer.
This was my issue with current client ecosystem. I get a .guff file. I should be able to open my AI Client of choice and File -> Open and select a .guff. Same as opening a .txt file. Alternatively, I have cloned a HF model, all AI Clients should automatically check for the HF cache folder.
The current offering have interfaces to HuggingFace or some model repo. They get you the model based on what they think your hardware can handle and save it to %user%/App Data/Local/%app name%/... (on windows). When I evaluated running locally I ended up with 3 different folders containing copies of the same model in different directory structures.
It seems like HuggingFace uses %user%/.cache/.. however, some of the apps still get the HF models and save them to their own directories.
Those features are 'fine' for a casual user who sticks with one program. It seems designed from the start to lock you into their wrapper. In the end they are all using llama cpp, comfy ui, openvino etc to abstract away the backed. Again this is fine but hiding the files from the user seems strange to me. If you're leaning on HF then why now use their own .cache?
In the end I get the latest llama.cpp releases for CUDA and SYCL and run llama-server. My best UX has been with LM Studio and AI Playground. I want to try Local AI and vLLM next. I just want control over the damn files.
That's one of my major annoyances with the current state of local model infrastructure: All the cruft around what should be a simple matter of downloading and using a file. All these cache directories and file renaming and config files that point to all of these things. The special, bespoke downloading cli tools. It's just kind of awkward from the point of view of someone who is used to just using simple CLI tools that do one thing. Imagine if sqlite3 required all of these paths and hashes and downloaders and configs rather than letting you just run:
Check out Koboldcpp. The dev has a specific philosophy about things (minimal or no dependencies, no installers, no logs, don't do anything to user's system they didn't ask for explicitly) that I find particularly agreeable. It's a single exec and includes the kitchen sink so there is no excuse not to try it.
Thanks for writing this, I hope people here will actually read this and not assume this is some unfounded hit piece. I was involved a little bit in llama.cpp and knew most of what you wrote and it’s just disgusting how ollama founders behaved!
For people looking for alternatives, I would also recommend llama-file, it’s a one file executable for any OS that includes your chosen model: https://github.com/mozilla-ai/llamafile?tab=readme-ov-file
It’s truly open source, backed by Mozilla, openly uses llama.cpp and was created by wizard Justine Tunney of CosmopolitanC fame.
No mention of the fact that Ollama is about 1000x easier to use. Llama.cpp is a great project, but it's also one of the least user friendly pieces of software I've used. I don't think anyone in the project cares about normal users.
I started with Ollama, and it was great. But I moved to llama.cpp to have more up-to-date fixes. I still use Ollama to pull and list my models because it's so easy. I then built my own set of scripts to populate a separate cache directory of hardlinks so llama-swap can load the gguf's into llama.cpp.
Exactly. The blog post states that the alternatives listed are similarly intuitive. They are not. If you just need a chat app, then sure, there’s plenty of options. But if you want an OpenAI compatible API with model management, accessibility breaks down fast.
I’m open to suggestions, but the alternatives outlined in the blog post ain’t it.
The reported alternatives seem pretty User-Friendly to me:
> LM Studio gives you a GUI if that’s what you want. It uses llama.cpp under the hood, exposes all the knobs, and supports any GGUF model without lock-in.
> Jan(https://www.jan.ai/) is another open-source desktop app with a clean chat interface and local-first design.
> Msty(https://msty.ai/) offers a polished GUI with multi-model support and built-in RAG. koboldcpp is another option with a web UI and extensive configuration options.
API wise: LM Studio has REST, OpenAI and more API Compatibilities.
All of those options were either too slow, or didnt work for me (Mac with Intel). I could have spent hours googling, but I downloaded Ollama and it just worked.
LMStudio is listed as an alternative. It offers a chat UI, a model server supporting OpenAI, Anthropic and LMStudio API interfaces. It supports loading the models on demand or picking what models you want loaded. And you can tweak every parameter.
And it uses llama.cpp which is the whole point of the blog post.
Thanks for pointing that out. From the description in the blog post it sounded like it was GUI only without an API, and I didn't bother looking into it because of that. But it look pretty nice, so I'll give it a try.
As other posters report, now llama-server implements an OpenAI compatible API and you can also connect to it with any Web browser.
I have not tried yet the OpenAI API, but it should have eliminated the last Ollama advantage.
I do not believe that the Ollama "curated" models are significantly easier to use for a newbie than downloading the models directly from Huggingface.
On Huggingface you have much more details about models, which can allow you to navigate through the jungle of countless model variants, to find what should be more suitable for yourself.
The fact criticized in TFA, that the Ollama "curated" list can be misleading about the characteristics of the models, is a very serious criticism from my point of view, which is enough for me to not use such "curated" models.
I am not aware of any alternative for choosing and downloading the right model for local inference that is superior to using directly the Huggingface site.
I believe that choosing a model is the most intimidating part for a newbie who wants to run inference locally.
If a good choice is made, downloading the model, installing llama.cpp and running llama-server are trivial actions, which require minimal skills.
And why do I use ggml-org/gemma-4-E4B-it-GGUF instead of one of the 162 other models that can be found under the ggml-org namespace? And how do I even know that this is the namespace to look at?
That's what I meant by model management. I'm too tired to scroll through a bazillion models that all have very cryptic names and abbreviations just to find the one that works well on my system with my software stack.
I want a simple interface that a tool like me can scroll through easily, click on, and then have a model that works well enough. If I put in that much brain power to get my LLM working, I might as well do the work myself instead of using an LLM in the first place.
>No mention of the fact that Ollama is about 1000x easier to use
I remember changing the context size from the default unusable 2k to something bigger the model actually supports required creating a new model file in Ollama if you wanted the change to persist (another alternative: set an env var before running ollama; although, if you go that low-level route, why not just launch llama.cpp). How was that easier? Did they change this?
I remember people complaining model X is "dumb" simply because Ollama capped the context size to a ridiculously small number by default.
IMHO trying to model Ollama after Docker actually makes it harder for casual users. And power users will have it easier with llama.cpp directly
Just in case you haven't seen it yet, llama.cpp now has a router mode that lets you hot-swap models. I've switched over from llama-swap and have been happy with it.
Not like it mattered much to me but llama-cpp is way lighter and 10x smaller in size.
Resumable downloads seem to work better in llama-cpp.
I love the inbuilt GUI.
I used ollama first and honestly, llama-cpp has been a much better experience.
Maybe given enough time, I would have seen the benefit of ollama but the inability to turn off updates even after users requested it extensively made me uninstall it. Postman PTSD is real.
> No mention of the fact that Ollama is about 1000x easier to use.
The point of the article is not to expound on how user-friendly "Ollama" is. It's about exposing the deception and shameful moral low ground they took.
I spend like 2 hours trying to get vulkan acceleration working with ollama, no luck (half models are not supported and crash it). With llama.cpp podman container starts and works in 5 minutes.
Koboldcpp is a single executable with a GUI launcher and a built in webui. It also supports tts, stt, image gen, embeddings, music creation, and a bunch of other stuff out of the box, and can download and browse HF models from within the GUI. That's pretty easy to use.
1. MIT-style licenses are "do what you want" as long as you provide a single line of attribution. Including building big closed source business around it.
2. MIT-style licenses are "do what you want" under the law, but they carry moral, GPL-like obligations to think about the "community."
To my knowledge Georgi Gerganov, the creator of llama.cpp, has only complained about attribution when it was missing. As an open-source developer, he selected a permissive license and has not complained about other issues, only the lack of credit. It seems he treats the MIT license as the first kind.
The article has other good points not related to licensing that are good to know. Like performance issues and simplicity that makes me consider llama.cpp.
The second interpretation is nonsense of course. If you want GPL-like obligations, use the GPL.
A license is what it says in the license, nothing extra. It's a legal document not a moral guideline.
I do think it's a very good idea to always use the GPL (even though commercially minded types always get their panties in a bunch about the GPL) for any user-facing software, to force everybody to 'play fair and share'. The only reason to use MIT imho is for a library implementing some sort of standard where you want that standard used by as many people as possible.
I don't understand people who use MIT for their project and then complain some commercial firm takes their contributions and runs with it. If that's not what you want don't use that license.
Apart from license terms and moral obligations being a bad mix, companies don't have morals. Don't get me wrong, I think they should have! But they don't.
People have morals. Groups of people (a company, a country , a mob) not so much. Sadly.
MIT license lets you do what you want with the code. That's the deal.
The blob storage thing is the real problem though. Nobody talks about it until they try to move their models somewhere else.
Well, yeah, which is why it's silly when people use MIT licenses and then complain that those, for example, with the motto "Build > ask. Disrupt or die.", only take and don't contribute anything back, instead of using a license that demands it.
Georgi could have switched to GPL whenever he wanted. He didn't. That's the answer.
The loudest voices here aren't the ones writing the code. Meanwhile both projects kept shipping and users got more options. Hard to see the harm.
Do they still not let you change the default model folder? You had to go through this whole song and dance to manually register a model via a pointless dockerfile wannabe that then seemed to copy the original model into their hash storage (again, unable to change where that storage lived).
At the time I dropped it for LMStudio, which to be fair was not fully open source either, but at least exposed the model folder and integrated with HF rather than a proprietary model garden for no good reason.
This also annoyed me a lot. I was running it before upgrading the SSD storage and I wanted to compare with LM Studio. Figured it would be good to have both interfaces use the same models downloaded from HF.
Had to go down the same rabbit hole of finding where things are, how they're sorted/separated/etc. It was unnecessarily painful
> the file gets copied into Ollama’s hashed blob storage, you still can’t share the GGUF with other tool
This is the reason I had stopped using it. I think they might be doing it for deduplication however it makes it impossible to use the same model with other tools. Every other tool can just point to the same existing gguf and can go. Whether its their intention or not, it's making it difficult to try out other tools. Model files are quite large as you know and storage and download can become issues. (They are for me)
That doesn't make sense. Why would llama.cpp need to move any faster than ollama? For that matter, why not have a llama.cpp package and llama.cpp-git in the AUR?
The claim was that llama.cpp moves too fast to be in Arch's normal repos. But Arch does package ollama. Therefore, either 1. ollama somehow avoids the need to move fast, or 2. it moves at an acceptable pace when packaged.
Edit: Or perhaps put differently: If ollama includes a copy of llama.cpp and has a non-AUR package, why can't there be a non-AUR package that's just llama.cpp without ollama?
Sometimes Arch has the software you want at the version you want, other times it doesn't but other distros do. That's why there's half a billion distros instead of just one.
The name "llama.cpp" doesn't seem very friendly anymore nowadays... Back then, "llama" probably referred to those models from Facebook, and now those Llama series models clearly can't represent the strongest open-source models anymore...
Doesn't the "llama" in "ollama" present exactly the same issue?
Edit: or maybe that was your point. I guess that for historical reasons this is a kind of generic name for local deployments now (see https://www.reddit.com/r/LocalLLaMA) just like people will call anything ChatGPT.
I always avoided Ollama because it smelled like a project that was trying so desperately to own the entire workflow. I guess I dodged a bigger bullet than I knew.
the article buries what's actaully the most practical gotcha: ollama's hashed blob storage means if you've been pulling models for months, switching tools requires re-downloading everything because you can't just point another runtime at those files, and most users won't discover this until they're already invested enough that it genuinely hurts to leave.
I stopped using Ollama a couple of months ago. Not out of frustration, but because llama.cpp has improved a lot recently with router mode, hot-swapping, a modern and simple web UI, MCP support and lots of other improvements.
The attribution and lock-in arguments are the loud parts of this story, but the quieter production reason to move is concurrency. llama.cpp's server takes parallel N with cont-batching enabled by default, which interleaves tokens from multiple requests inside a single batch and keeps the GPU busy. Ollama defaults its parallel slots low and the interaction is less transparent, so the first time three people share a single model instance you feel it before any of the ethics become relevant. For a 70B Q4_K_M on a workstation, the real ceiling is KV cache fragmentation, and you have to size the context window around the parallel count rather than around one user. What is the highest parallel value anyone here has kept stable on a 70B Q4_K_M before the cache eviction pattern starts hurting quality?
It feels like a bit of history is missing... If ollama was founded 3 years before llama.cpp was released, what engine did they use then? When did they transition?
I don't think that is the case. Llama.cpp appeared within weeks after meta released llama to select researchers (which then made it out to the public). 3 years before that nobody knew of the name llama. I'm sure that llama.cpp existed first
So, on a mac, what good alternative to ollama supports mlx for acceleration? My main use case is that I have an old m1 max macbook pro with 64 gb ram that I use as a model server.
I noticed the performance issues too. I started using Jan recently and tried running the same model via llama.cpp vs local ollama, and the llama.cpp one was noticeably faster.
jan.ai would be the ideal route to take here then. its open source, has a simple chat interface, it uses llama.cpp, it lets you search for models and downloads them, and it supports .gguf so youre not locked in if you want to use the models with another program later on
This is a bit like saying stop using Ubuntu, use Debian instead.
Both llama.cpp and ollama are great and focused on different things and yet complement each other (both can be true at the same time!)
Ollama has great ux and also supports inference via mlx, which has better performance on apple silicon than llama.cpp
I'm using llama.cpp, ollama, lm studio, mlx etc etc depending on what is most convenient for me at the time to get done what I want to get done (e.g. a specific model config to run, mcp, just try a prompt quickly, …)
> This is a bit like saying stop using Ubuntu, use Debian instead.
Not really, because Ubuntu has always acknowledged Debian and explicitly documented the dependency:
> Debian is the rock on which Ubuntu is built.
> Ubuntu builds on the Debian architecture and infrastructure and collaborates widely with Debian developers, but there are important differences. Ubuntu has a distinctive user interface, a separate developer community (though many developers participate in both projects) and a different release process.
> Both llama.cpp and ollama are great and focused on different things and yet complement each other
According to the article, ollama is not great (that’s an understatement), focused on making money for the company, stealing clout and nothing else, and hardly complements llama.cpp at all since not long after the initial launch. All of these are backed by evidence.
You may disagree, but then you need to refute OP’s points, not try to handwave them away with a BS analogy that’s nothing like the original.
They might not use the word, but the behavior they describe is evil:
"
This isn’t a matter of open-source etiquette, the MIT license has exactly one major requirement: include the copyright notice. Ollama didn’t.
The community noticed. GitHub issue #3185 was opened in early 2024 requesting license compliance. It went over 400 days without a response from maintainers. When issue #3697 was opened in April 2024 specifically requesting llama.cpp acknowledgment, community PR #3700 followed within hours. Ollama’s co-founder Michael Chiang eventually added a single line to the bottom of the README: “llama.cpp project founded by Georgi Gerganov.”
"
> This creates a recurring pattern on r/LocalLLaMA: new model launches, people try it through Ollama, it’s broken or slow or has botched chat templates, and the model gets blamed instead of the runtime.
Seems like maybe, at least some of the time, you’re being underwhelmed my ollama not the model.
The better performance point alone seems worth switching away
I follow the llama.cpp runtime improvements and it’s also true for this project. They may rush a bit less but you also have to wait for a few days after a model release to get a working runtime with most features.
With Ollama, the initial one-time setup is a little easier, and the CLI is useful, but is it worth dysfunctional templates, worse performance, and the other issues? Not to me.
Jinja templates are very common, and Jinja is not always losslessly convertible to the Go template syntax expected by Ollama. This means that some models simply cannot work correctly with Ollama. Sometimes the effects of this incompatibility are subtle and unpredictable.
Does it have a model registry with an API and hot swapping or you still have to use sometime like llama swap as suggested in the article ? Or is it CLI?
> Red Hat’s ramalama is worth a look too, a container-native model runner that explicitly credits its upstream dependencies front and center. Exactly what Ollama should have done from the start.
% ramalama run qwen3.5-9b
Error: Manifest for qwen3.5-9b:latest was not found in the Ollama registry
I think the biggest advantage for me with ollama is the ability to "hotswap" models with different utility instead of restarting the server with different models combined with the simple "ollama pull model". In other words, it has been quite convenient.
Due to this post I had to search a bit and it seems that llama.cpp recently got router support[1], so I need to have a look at this.
My main use for this is a discord bot where I have different models for different features like replying to messages with images/video or pure text, and non reply generation of sentiment and image descriptions. These all perform best with different models and it has been very convenient for the server to just swap in and out models on request.
It's a joke... but also not really? I mean VLC is "just" an interface to play videos. Videos are content files one "interact" with, mostly play/pause and few other functions like seeking. Because there are different video formats VLC relies on codecs to decode the videos, so basically delegating the "hard" part to codecs.
Now... what's the difference here? A model is a codec, the interactions are sending text/image/etc to it, output is text/image/etc out. It's not even radically bigger in size as videos can be huge, like models.
I'm confused as why this isn't a solved problem, especially (and yes I'm being a big sarcastic here, can't help myself) in a time where "AI" supposedly made all smart wise developers who rely on it 10x or even 1000x more productive.
What problem is it that you are confused isn't solved?
I think the codec analogy is neat but isn't the codec here llama.cpp, and the models are content files? Then the equivalent of VLC are things like LMStudio etc. which use llama.cpp to let you run models locally?
I'd guess one reason we haven't solved the "codec" layer is that there doesn't seem to be a standard that open model trainers have converged on yet?
This is partly why we're building LlamaBarn. It's a lightweight macOS menu bar app that runs llama-server under the hood, with models stored as standard GGUFs in your Hugging Face cache — the same location llama-server uses by default. No separate model store, no lock-in.
It is a bit off-topic, but would it possible to provide a light mode for this blog? I used to work during the day time, and my pupils had to contract to read, making it a very poor reading experience.
I like Ollama Cloud service (I'm paid pro user), because it let me test several open source LLMs very fast - I dont need to download anything locally, just change the model name in the API. If I like the model then I can download it and run locally with sensitive data. I also like their CLI, because it is simple to use.
The fact that they are trying to make money is normal - they are a company. They need to pay the bills.
I agree that they should improve communication, but I assume it is still small company with a lot of different requests, and some things might be overlooked.
Overall I like the software and services they provide.
I'm sorry, on a mac, Ollama just works. It lets me use a model and test it quickly. This is like saying stop using google drive, upload everything to s3 instead!
When i'm using Ollama - I honeslty don't care about performance, I'm looking to try out a model and then if it seems good, place it onto a most dedicated stack specifically for it.
Ollama is a bit easier to use, you’re right. But the point of the article is the way they just disregarded the license of llama.cpp, moved away from open source while still claiming to be open source and pivoted to cloud offerings when the whole point was to run local models all while without contributing anything back to the big open source projects it owns its existence to. Maybe you don’t care about performance (weird given performance is the main blocker for local LLMs) but you should care about the ethics of companies making the product you use?
And anyway this thread has lots of alternatives that are even easier to use and don’t shit on the open source community making things happen.
I'm making more of a pragmatic point. While ethics of companies are important, i'm still using OpenAI, Anthropic, Microsoft, Apple etc, so I definitely accept a trade-off between morality and ease-of-use.
Currently i've found Ollama to have the best intuitive experience for trying new models. Once i've tried those models and decide on something to use for a project, I can deploy them, and not need to use a UI again.
I'll be trying out the other options in this thread, but my point is that ease of use is going to triumph over the other points the original post made, and some of the alternatives mentioned in the original post miss why Ollama is so popular.
Keep in mind that as the post says, the model you’re trying via ollama may not be the model you asked for! And the performance may be subpar and not reflect the model true performance. Otherwise, I agree they offer an easy and polished product and that explains why they are so popular, besides their personal connections having resulted in their OpenAI partnership.
Has anybody figured some of the best flags to compile llama.cpp for rocm? I'm using the framework desktop and the Vulkan backend, because it was easier to compile out of the box, but I feel there's large peformance gains on the table by swtiching to rocm. Not sure if installing with brew on ubuntu would be easier.
I was pretty big on ollama, it seemed like a great default solution. I had alpha that it was a trash organization but I didn't listen because I just liked having a reliable inference backend that didn't require me to install torch. I switched to llama.cpp for everything maybe 6 months ago because of how fucking frustrating every one of my interactions with ollama (the organization) were. I wanna publicly apologize to everyone who's concerns I brushed off. Ollama is a vampire on the culture and their demise cannot come soon enough.
FWIW llama.cpp does almost everything ollama does better than ollama with the exception of model management, but like, be real, you can just ask it to write an API of your preferred shape and qwen will handle it without issue.
Oh I was completely wrong about the model management stuff btw, llama-server has fully fledged model management baked in now, you just have to make an *.ini with your model configs (most models can do this themselves, I pointed qwen3.6 at the relevant part of the docs and it wrote me an ini with all of my model configs in about 2 minutes) and you can swap between models via api or a dropdown menu in the UI.
> Ollama is a Y Combinator-backed (W21) startup, founded by engineers who previously built a Docker GUI that was acquired by Docker Inc. The playbook is familiar: wrap an existing open-source project in a user-friendly interface, build a user base, raise money, then figure out monetization.
The progression follows the pattern cleanly:
1. Launch on open source, build on llama.cpp, gain community trust
2. Minimize attribution, make the product look self-sufficient to investors
3. Create lock-in, proprietary model registry format, hashed filenames that don’t work with other tools
4. Launch closed-source components, the GUI app
5. Add cloud services, the monetization vector
The CLI is great locally, but the architecture fights you in production. Putting a stateful daemon that manages its own blob storage inside a container is a classic anti-pattern. I ended up moving to a proper stateless binary like llama-server for k8s.
Have you ever tried going to the model registry and seeing that the model was recently updated? What updated? What changed? Should I re-download this 20GB file?
I guess if you're not frustrated with things like this then sure, no need to stop using it.
llama.cpp was already public by March 10, 2023. Ollama-the-company may have existed earlier through YC Winter 2021, but that is not the same thing as having a public local-LLM runtime before llama.cpp. In fact, Ollama’s own v0.0.1 repo says: “Run large language models with llama.cpp” and describes itself as a “Fast inference server written in Go, powered by llama.cpp.” Ollama’s own public blog timeline then starts on August 1, 2023 with “Run Llama 2 uncensored locally,” followed by August 24, 2023 with “Run Code Llama locally.” So the public record does not really support any “they were doing local inference before llama.cpp” narrative.
And that is why the attribution issue matters. If your public product is, from day one, a packaging / UX / distribution layer on top of upstream work, then conspicuous credit is not optional. It is part of the bargain. “We made this easier for normal users” is a perfectly legitimate contribution. But presenting that contribution in a way that minimizes the upstream engine is exactly what annoys people.
The founders’ pre-LLM background also points in the same direction. Before Ollama, Jeffrey Morgan and Michael Chiang were known for Kitematic, a Docker usability tool acquired by Docker on March 13, 2015. So the pattern that fits the evidence is not “they pioneered local inference before everyone else.” It is “they had prior experience productizing infrastructure, then applied that playbook to the local-LLM wave once llama.cpp already existed.”
So my issue is not that Ollama is a wrapper. Wrappers can be useful. My issue is that they seem to have taken the social upside of open-source dependence without showing the level of visible credit, humility, and ecosystem citizenship that should come with it. The product may have solved a real UX problem, but the timeline makes it hard to treat them as if they were the originators of the underlying runtime story.
They seem very good at packaging other people’s work, and not quite good enough at sounding appropriately grateful for that fact.
vLLM isn't suitable for people running LLMs side-by-side with regular applications on their PC. It is very good at hosting LLMs for production on dedicated servers. For the prod usecase ollama/llamacpp are practically useless (but that's ok - it's not the projects goal to be).
I'm a llama.cpp user, but apart from the MIT licensing issue, I personally don't see what's the problem here is? Sure Ollama could have advertised better that llama.cpp was it's original backend, but were they obligated to? It's no different to Docker or VMWare that hitch a ride on kernel primitives etc.
Ah man the VC death trap. It's ok. I don't mean it like that but this is classic. It's unavoidable. They gotta make money. They took money, they gotta make money. It's not easy. Everyone has principles, developers more than anyone. They are developers, they are people like you and me. They didn't even start as ollama. They started as a kubernetes infra project in YC and pivoted. Listen don't be hard on these guys. It's hard enough. Trust me I did it. And not as well them.
This is the game. We shouldn't delude ourselves into thinking there are alternative ways to become profitable around open source, there aren't. You effectively end up in this trap and there's no escape and then you have to compromise on everything to build the company, return the money, make a profit. You took people's money, now you have to make good, there's no choice. And anyone who thinks differently is deluded. Open source only goes one way. To the enterprise. Everything else is burning money and wasting time. Look at Docker. Textbook example of the enormous struggle to capture the value of a project that had so much potential, defined an industry and ultimately failed. Even the reboot failed. Sorry. It did.
This stuff is messy. Give them some credit. They give you an epic open source project. Be grateful for that. And now if you want to move on, move on. They don't need a hard time. They're already having a hard time. These guys are probably sweating bullets trying to make it work while their investors breathe down their necks waiting for the payoff. Let them breathe.
> This stuff is messy. Give them some credit. They give you an epic open source project.
It seems to me the epic open source project was given to us by Georgi Gerganov. These people just tried to milk it for some money, and made everything a little worse in the process.
Especially when the solid core now ships with a web ui and API compatibility with OpenAI and Antropic. In my test of ai clients, Ollama was the only one I deleted.
With such concurrency in the market, it is unforgivable to manage a product that way. The concurrency will kill you.
Clients get disappointed, alternatives have better services, and more are popping out monthly. If they continue that way, nothing good will happen, unfortunately :(
Yeah my thoughts exactly. Definitely slop. I have no objection to using AI to help writing. I just don't want to read the same sloppy cliches again and again and again. The short sentences. The Bigger Picture. Here's the rub. It's not just A, it's B.
It's like those cliche titles - for fun and profit, the unreasonable effectiveness of, all you need is, etc. etc. but throughout the prose. Stop it guys!
Sure. Short sentences like "It shouldn’t be.", "I’ve moved on.", "Ollama didn’t.", etc.
Not-this-but-that like "The local LLM ecosystem doesn’t need Ollama. It needs llama.cpp."
Weird signposting: "Benchmarks tell the story."
Heres-the-rub conclusion: "The Bigger Picture"
Starting every title with "The ...".
It's definitely largely human-written, but there are enough slop-isms to make it annoying to read. And of course it's totally possible for a human to write an an AI style, but that doesn't make it any less annoying.
Another scummy YCombinator project, one of many lately. Looks like no-one is left at the wheel, at least as long as the valuations (and hence money) keep coming in.
I find the style of writing incredibly annoying (it doesn't make the point, full of hyperbole) and the website has the standard slopsite black background and glowing CSS.
That's because it was fully written by an LLM, as usual lately with all the articles on the front page of HN.
No wonder I get downvoted to hell every time I mention this... People here can't even tell anymore. They just find this horrible slop completely normal. HN is just another dead website filled with slop articles, time to move on to some smaller reddit communities...
For most users that wanted to run LLM locally, ollama solved the UX problem.
One command, and you are running the models even with the rocm drivers without knowing.
If llama provides such UX, they failed terrible at communicating that. Starting with the name. Llama.cpp: that's a cpp library! Ollama is the wrapper. That's the mental model. I don't want to build my own program! I just want to have fun :-P
Llama.cpp now has a gui installed by default. It previously lacked this. Times have changed.
Having read above article, I just gave llama.cpp a shot. It is as easy as the author says now, though definitely not documented quite as well. My quickstart:
brew install llama.cpp
llama-server -hf ggml-org/gemma-4-E4B-it-GGUF --port 8000
Go to localhost:8000 for the Web UI. On Linux it accelerates correctly on my AMD GPU, which Ollama failed to do, though of course everyone's mileage seems to vary on this.
Was hoping it was so easy :) But I probably need to look into it some more.
llama_model_load: error loading model: error loading model architecture: unknown model architecture: 'gemma4' llama_model_load_from_file_impl: failed to load model
Edit: @below, I used `nix-shell -p llama-cpp` so not brew related. Could indeed be an older version indeed! I'll check.
As it has been discussed in a few recent threads on HN, whenever a new model is released, running it successfully may need changes in the inference backends, such as llama.cpp.
There are 2 main reasons. One is the tokenizer, where new tokenizer definitions may be mishandled by the older tokenizer parsers.
The second reason is that each model may implement differently the tool invocations, e.g. by using different delimiter tokens and different text layouts for describing the parameters of a tool invocation.
Therefore running the Gemma-4 models encountered various problems during the first days after their release, especially for the dense 31B model.
Solving these problems required both a new version of llama.cpp (also for other inference backends) and updates in the model chat template and tokenizer configuration files.
So anyone who wants to use Gemma-4 should update to the latest version of llama.cpp and to the latest models from Huggingface, because the latest updates have been a couple of days ago.
I just hit that error a few minutes ago. I build my llama.cpp from source because I use CUDA on Linux. So I made the mistake of trying to run Gemma4 on an older version I had and I got the same error. It’s possible brew installs an older version which doens’t support Gemma4 yet.
Ah it was indeed just that!
I'm now on:
$ llama --version version: 8770 (82764d8) built with GNU 15.2.0 for Linux x86_64
(From Nix unstable)
And this works as advertised, nice chat interface, but no openai API I guess, so no opencode...
check on same port, there is an OpenAI API https://github.com/ggml-org/llama.cpp/tree/master/tools/serv...
Good stuff, thanx!
And that's exactly why llama.cpp is not usable by casual users. They follow the "move fast and break things" model. With ollama, you just have to make sure you're getting/building the latest version.
Its not possible to run the latest model architectures without 'moving fast'. The only thing broken here is that they are trying to use an old version with a new model.
and Ollama suffered the same fate when wanting to try new models
What fate?
the impedance mismatch between when models are released and the capability of Ollama and other servers capability for use.
I'm a bit unsure what that has to do with someone running an outdated version of the program while trying to use a model that is supported in the latest release.
While that might be true, for as long as its name is “.cpp”, people are going to think it’s a C++ library and avoid it.
This is the first I'm learning that it isn't just a C++ library.
In fact the first line of the wikipedia article is:
> llama.cpp is an open source software library
It would make sense to just make the GUI a separate project, they could call it llama.gui.
It would make even more sense to rename it to ollama, get a copyright for the name, and see how thieves complain they've been robbed :>
it is called llama-barn https://github.com/ggml-org/LlamaBarn
LlamaBarn is the MacOS app, not the HTTP API server, which is "llama-server".
On non-Apple PCs, "llama-server" is what you use, and you can connect to it either with a browser or with an application compatible with the OpenAI API.
Perhaps using "llama-server" as the name of the project would have been less confusing for newbies than "llama.cpp".
I confess that when I first heard about "llama.cpp" I also thought that it is just a library and that I have to write my own program in order to implement a complete LLM inference backend.
This is correct, and I avoided it for this reason, did not have the bandwidth to get into any cpp rabbit hole so just used whatever seemed to abstract it away.
Frankly I think the cli UX and documentation is still much better for ollama.
It makes a bunch of decisions for you so you don't have to think much to get a model up and running.
I don't care about the GUI so much. Ollama lets me download, adjust and run a whole bunch of models and they are reasonably fast. Last time I compared it with Llama.cpp, finding out how to download and install models was a pain in Llama.cpp and it was also _much_ slower than Ollama.
That is not true.
If you today visit a models page on huggingface, the site will show you the exact oneliner you need to run to it on llama.cpp.
I didn't measure it, but both download and inference felt faster than ollama. One thing that was definitely better was memory usage, which may be important if you want to run small models on SCB.
"LM Studio… Jan… Msty… koboldcpp…"
Plenty of alternatives listed. Can anyone with experience suggest the likely successor to Ollama? I have a Mac Mini but don't mind a C/L tool.
I think, as was pointed out, Ollama won because of how easy it is to set up, pull down new models. I would expect similar for a replacement.
If you don't want to have to think about it, LM Studio is probably the best choice.
How about kobold.cpp then? Or LMStudio (I know it's not open source, but at least they give proper credit to llama.cpp)?
Re curation: they should strive to not integrate broken support for models and avoid uploading broken GGUFs.
agree. We can easily compare it with docker. Of course people can use runc directly, but most people select not to and use `docker run` instead.
And you can blame docker in a similar manner. LXC existed for at least 5 years before docker. But docker was just much more convenient to use for an average user.
UX is a huge factor for adoption of technology. If a project fails at creating the right interface, there is nothing wrong with creating a wrapper.
> For most users that wanted to run LLM locally, ollama solved the UX problem
This does not absolve them from the license violation
>solved the UX problem.
>One command
Notwithstanding the fact that there's about zero difference between `ollama run model-name` and `llama-cpp -hf model-name`, and that running things in the terminal is already a gigantic UX blocker (Ollama's popularity comes from the fact that it has a GUI), why are you putting the blame back on an open source project that owes you approximately zero communication ?
> Ollama's popularity comes from the fact that it has a GUI
It's not the GUI, it's the curated model hosting platform. Way easier to use than HF for casual users.
LM Studio also offers curation, while giving credit to llama.cpp and also easy search across all of Huggingface's GGUF's
It also made easy for casual users to think that they were running deepseek.
> Notwithstanding the fact that there's about zero difference between `ollama run model-name` and `llama-cpp -hf model-name`
There is a TON of difference. Ollama downloads the model from its own model library server, sticks it somewhere in your home folder with a hashed name and a proprietary configuration that doesn't use the in built metadata specified by the model creator. So you can't share it with any other tool, you can't change parameters like temp on the fly, and you are stuck with whatever quants they offer.
This was my issue with current client ecosystem. I get a .guff file. I should be able to open my AI Client of choice and File -> Open and select a .guff. Same as opening a .txt file. Alternatively, I have cloned a HF model, all AI Clients should automatically check for the HF cache folder.
The current offering have interfaces to HuggingFace or some model repo. They get you the model based on what they think your hardware can handle and save it to %user%/App Data/Local/%app name%/... (on windows). When I evaluated running locally I ended up with 3 different folders containing copies of the same model in different directory structures.
It seems like HuggingFace uses %user%/.cache/.. however, some of the apps still get the HF models and save them to their own directories.
Those features are 'fine' for a casual user who sticks with one program. It seems designed from the start to lock you into their wrapper. In the end they are all using llama cpp, comfy ui, openvino etc to abstract away the backed. Again this is fine but hiding the files from the user seems strange to me. If you're leaning on HF then why now use their own .cache?
In the end I get the latest llama.cpp releases for CUDA and SYCL and run llama-server. My best UX has been with LM Studio and AI Playground. I want to try Local AI and vLLM next. I just want control over the damn files.
That's one of my major annoyances with the current state of local model infrastructure: All the cruft around what should be a simple matter of downloading and using a file. All these cache directories and file renaming and config files that point to all of these things. The special, bespoke downloading cli tools. It's just kind of awkward from the point of view of someone who is used to just using simple CLI tools that do one thing. Imagine if sqlite3 required all of these paths and hashes and downloaders and configs rather than letting you just run:
Check out Koboldcpp. The dev has a specific philosophy about things (minimal or no dependencies, no installers, no logs, don't do anything to user's system they didn't ask for explicitly) that I find particularly agreeable. It's a single exec and includes the kitchen sink so there is no excuse not to try it.
But if you’re just a GUI wrapper then at least attribute the library you created the GUI for
Whip that llama! Oh wait, that's a different program.
LOL
https://www.youtube.com/watch?v=HaF-nRS_CWM
but if ollama is much slower, that's cutting on your fun and you'll be having better fun with a faster GUI
You’ve completely missed the point.
I got tired of repeating the same points and having to dig up sources every time, so here's the timeline (as I know it) in one place with sources.
Thanks for writing this, I hope people here will actually read this and not assume this is some unfounded hit piece. I was involved a little bit in llama.cpp and knew most of what you wrote and it’s just disgusting how ollama founders behaved! For people looking for alternatives, I would also recommend llama-file, it’s a one file executable for any OS that includes your chosen model: https://github.com/mozilla-ai/llamafile?tab=readme-ov-file
It’s truly open source, backed by Mozilla, openly uses llama.cpp and was created by wizard Justine Tunney of CosmopolitanC fame.
I also thought llamafile deserves a mention. Once you have all model params and tunings done bakes 'em into a single portable binary!
Thank you; it's an educating read for me, as someone who doesn't dwell in this space, but cares about FOSS in its true spirit.
Really nice. I wasn't aware of any of this.
Thanks, did not know any of this.
> Ollama eventually added ollama run hf.co/{repo}:{quant} to pull directly from Hugging Face, which partially addresses the availability problem.
uh actually, _we_ did (generates a Docker-style manifest on the fly)
Great writing, thanks for the summary and timeline.
No mention of the fact that Ollama is about 1000x easier to use. Llama.cpp is a great project, but it's also one of the least user friendly pieces of software I've used. I don't think anyone in the project cares about normal users.
I started with Ollama, and it was great. But I moved to llama.cpp to have more up-to-date fixes. I still use Ollama to pull and list my models because it's so easy. I then built my own set of scripts to populate a separate cache directory of hardlinks so llama-swap can load the gguf's into llama.cpp.
Exactly. The blog post states that the alternatives listed are similarly intuitive. They are not. If you just need a chat app, then sure, there’s plenty of options. But if you want an OpenAI compatible API with model management, accessibility breaks down fast.
I’m open to suggestions, but the alternatives outlined in the blog post ain’t it.
The reported alternatives seem pretty User-Friendly to me:
> LM Studio gives you a GUI if that’s what you want. It uses llama.cpp under the hood, exposes all the knobs, and supports any GGUF model without lock-in.
> Jan(https://www.jan.ai/) is another open-source desktop app with a clean chat interface and local-first design.
> Msty(https://msty.ai/) offers a polished GUI with multi-model support and built-in RAG. koboldcpp is another option with a web UI and extensive configuration options.
API wise: LM Studio has REST, OpenAI and more API Compatibilities.
All of those options were either too slow, or didnt work for me (Mac with Intel). I could have spent hours googling, but I downloaded Ollama and it just worked.
So no, they are not alternatives to ollama
LM Studio is basically Ollama except they give attribution. It offers all of the same features including the ability to host a server.
What do you mean?
LMStudio is listed as an alternative. It offers a chat UI, a model server supporting OpenAI, Anthropic and LMStudio API interfaces. It supports loading the models on demand or picking what models you want loaded. And you can tweak every parameter.
And it uses llama.cpp which is the whole point of the blog post.
Thanks for pointing that out. From the description in the blog post it sounded like it was GUI only without an API, and I didn't bother looking into it because of that. But it look pretty nice, so I'll give it a try.
What you say was true in the past.
As other posters report, now llama-server implements an OpenAI compatible API and you can also connect to it with any Web browser.
I have not tried yet the OpenAI API, but it should have eliminated the last Ollama advantage.
I do not believe that the Ollama "curated" models are significantly easier to use for a newbie than downloading the models directly from Huggingface.
On Huggingface you have much more details about models, which can allow you to navigate through the jungle of countless model variants, to find what should be more suitable for yourself.
The fact criticized in TFA, that the Ollama "curated" list can be misleading about the characteristics of the models, is a very serious criticism from my point of view, which is enough for me to not use such "curated" models.
I am not aware of any alternative for choosing and downloading the right model for local inference that is superior to using directly the Huggingface site.
I believe that choosing a model is the most intimidating part for a newbie who wants to run inference locally.
If a good choice is made, downloading the model, installing llama.cpp and running llama-server are trivial actions, which require minimal skills.
> On Huggingface you have much more details about models...
For a (brand new!) newbie, it's very, very likely to be information overload.
They're still at the start of their journey, so simple tends to be better for 90% of users. ;)
like someone said above: brew install llama.cpp
llama-server -hf ggml-org/gemma-4-E4B-it-GGUF --port 8000 (with MCP support and web chat interface)
and you have OpenAI API on the same 8000 port. (https://github.com/ggml-org/llama.cpp/tree/master/tools/serv... lists the endpoints)
And why do I use ggml-org/gemma-4-E4B-it-GGUF instead of one of the 162 other models that can be found under the ggml-org namespace? And how do I even know that this is the namespace to look at?
That's what I meant by model management. I'm too tired to scroll through a bazillion models that all have very cryptic names and abbreviations just to find the one that works well on my system with my software stack.
I want a simple interface that a tool like me can scroll through easily, click on, and then have a model that works well enough. If I put in that much brain power to get my LLM working, I might as well do the work myself instead of using an LLM in the first place.
1. Go to HF
2. Choose the model they recommend
3. Run the one-liner the site gives you
Bonus: faster access to latest models and better memory usage
>No mention of the fact that Ollama is about 1000x easier to use
I remember changing the context size from the default unusable 2k to something bigger the model actually supports required creating a new model file in Ollama if you wanted the change to persist (another alternative: set an env var before running ollama; although, if you go that low-level route, why not just launch llama.cpp). How was that easier? Did they change this?
I remember people complaining model X is "dumb" simply because Ollama capped the context size to a ridiculously small number by default.
IMHO trying to model Ollama after Docker actually makes it harder for casual users. And power users will have it easier with llama.cpp directly
> so llama-swap can load
Just in case you haven't seen it yet, llama.cpp now has a router mode that lets you hot-swap models. I've switched over from llama-swap and have been happy with it.
It's a lot better than it used to be and honestly way more powerful. If you haven't looked lately, you should https://github.com/ggml-org/llama.cpp/tree/master/tools/serv...
> No mention of the fact that Ollama is about 1000x easier to use.
Easier than what?
I came across LM Studio (mentioned in the post) about 3 years ago before I even knew what Ollama as. It was far better even then.
Not like it mattered much to me but llama-cpp is way lighter and 10x smaller in size.
Resumable downloads seem to work better in llama-cpp.
I love the inbuilt GUI.
I used ollama first and honestly, llama-cpp has been a much better experience.
Maybe given enough time, I would have seen the benefit of ollama but the inability to turn off updates even after users requested it extensively made me uninstall it. Postman PTSD is real.
> No mention of the fact that Ollama is about 1000x easier to use.
The point of the article is not to expound on how user-friendly "Ollama" is. It's about exposing the deception and shameful moral low ground they took.
Least friendly you’ve used makes me think you’ve been spoiled. :)
Agreed ollama is a good intro but once you move beyond it starts to be a pain.
I spend like 2 hours trying to get vulkan acceleration working with ollama, no luck (half models are not supported and crash it). With llama.cpp podman container starts and works in 5 minutes.
Koboldcpp is a single executable with a GUI launcher and a built in webui. It also supports tts, stt, image gen, embeddings, music creation, and a bunch of other stuff out of the box, and can download and browse HF models from within the GUI. That's pretty easy to use.
Two Views of MIT-Style Licenses:
1. MIT-style licenses are "do what you want" as long as you provide a single line of attribution. Including building big closed source business around it.
2. MIT-style licenses are "do what you want" under the law, but they carry moral, GPL-like obligations to think about the "community."
To my knowledge Georgi Gerganov, the creator of llama.cpp, has only complained about attribution when it was missing. As an open-source developer, he selected a permissive license and has not complained about other issues, only the lack of credit. It seems he treats the MIT license as the first kind.
The article has other good points not related to licensing that are good to know. Like performance issues and simplicity that makes me consider llama.cpp.
The second interpretation is nonsense of course. If you want GPL-like obligations, use the GPL.
A license is what it says in the license, nothing extra. It's a legal document not a moral guideline.
I do think it's a very good idea to always use the GPL (even though commercially minded types always get their panties in a bunch about the GPL) for any user-facing software, to force everybody to 'play fair and share'. The only reason to use MIT imho is for a library implementing some sort of standard where you want that standard used by as many people as possible.
I don't understand people who use MIT for their project and then complain some commercial firm takes their contributions and runs with it. If that's not what you want don't use that license.
Apart from license terms and moral obligations being a bad mix, companies don't have morals. Don't get me wrong, I think they should have! But they don't.
People have morals. Groups of people (a company, a country , a mob) not so much. Sadly.
MIT license lets you do what you want with the code. That's the deal. The blob storage thing is the real problem though. Nobody talks about it until they try to move their models somewhere else.
Well, yeah, which is why it's silly when people use MIT licenses and then complain that those, for example, with the motto "Build > ask. Disrupt or die.", only take and don't contribute anything back, instead of using a license that demands it.
exactly
Georgi could have switched to GPL whenever he wanted. He didn't. That's the answer. The loudest voices here aren't the ones writing the code. Meanwhile both projects kept shipping and users got more options. Hard to see the harm.
Do they still not let you change the default model folder? You had to go through this whole song and dance to manually register a model via a pointless dockerfile wannabe that then seemed to copy the original model into their hash storage (again, unable to change where that storage lived).
At the time I dropped it for LMStudio, which to be fair was not fully open source either, but at least exposed the model folder and integrated with HF rather than a proprietary model garden for no good reason.
> Do they still not let you change the default model folder?
Actually they do. It's environment variable OLLAMA_MODELS in the server configuration file.
Yet people claim it has great UI... And still you can't define that in their GUI...
Why would you want to change a server configuration inside a client GUI? The server runs separately.
This also annoyed me a lot. I was running it before upgrading the SSD storage and I wanted to compare with LM Studio. Figured it would be good to have both interfaces use the same models downloaded from HF.
Had to go down the same rabbit hole of finding where things are, how they're sorted/separated/etc. It was unnecessarily painful
> the file gets copied into Ollama’s hashed blob storage, you still can’t share the GGUF with other tool
This is the reason I had stopped using it. I think they might be doing it for deduplication however it makes it impossible to use the same model with other tools. Every other tool can just point to the same existing gguf and can go. Whether its their intention or not, it's making it difficult to try out other tools. Model files are quite large as you know and storage and download can become issues. (They are for me)
Hmm..
Maybe some day.llama.cpp moves too quickly to be added as a stable package. Instead, you can get it directly from AUR: https://aur.archlinux.org/packages?O=0&K=llama.cpp
There are packages for Vulkan, ROCm and CUDA. They all work.
That doesn't make sense. Why would llama.cpp need to move any faster than ollama? For that matter, why not have a llama.cpp package and llama.cpp-git in the AUR?
what are you talking about? llama.cpp doesn't need to respect ollamas speed at all. It does not depend on it, it's the opposite of that.
The claim was that llama.cpp moves too fast to be in Arch's normal repos. But Arch does package ollama. Therefore, either 1. ollama somehow avoids the need to move fast, or 2. it moves at an acceptable pace when packaged.
Edit: Or perhaps put differently: If ollama includes a copy of llama.cpp and has a non-AUR package, why can't there be a non-AUR package that's just llama.cpp without ollama?
Then again...
Sometimes Arch has the software you want at the version you want, other times it doesn't but other distros do. That's why there's half a billion distros instead of just one.yay -S llama.cpp
I just installed llama.cpp on CachyOS after reading this article. It’s much faster and better than Ollama.
The name "llama.cpp" doesn't seem very friendly anymore nowadays... Back then, "llama" probably referred to those models from Facebook, and now those Llama series models clearly can't represent the strongest open-source models anymore...
Doesn't the "llama" in "ollama" present exactly the same issue?
Edit: or maybe that was your point. I guess that for historical reasons this is a kind of generic name for local deployments now (see https://www.reddit.com/r/LocalLLaMA) just like people will call anything ChatGPT.
https://en.wikipedia.org/wiki/List_of_generic_and_genericize...
I always avoided Ollama because it smelled like a project that was trying so desperately to own the entire workflow. I guess I dodged a bigger bullet than I knew.
the article buries what's actaully the most practical gotcha: ollama's hashed blob storage means if you've been pulling models for months, switching tools requires re-downloading everything because you can't just point another runtime at those files, and most users won't discover this until they're already invested enough that it genuinely hurts to leave.
It's as if Ollama is trying to create a walled garden, but the garden is outside of their property, so all it achieves is walling themselves in.
I stopped using Ollama a couple of months ago. Not out of frustration, but because llama.cpp has improved a lot recently with router mode, hot-swapping, a modern and simple web UI, MCP support and lots of other improvements.
The attribution and lock-in arguments are the loud parts of this story, but the quieter production reason to move is concurrency. llama.cpp's server takes parallel N with cont-batching enabled by default, which interleaves tokens from multiple requests inside a single batch and keeps the GPU busy. Ollama defaults its parallel slots low and the interaction is less transparent, so the first time three people share a single model instance you feel it before any of the ethics become relevant. For a 70B Q4_K_M on a workstation, the real ceiling is KV cache fragmentation, and you have to size the context window around the parallel count rather than around one user. What is the highest parallel value anyone here has kept stable on a 70B Q4_K_M before the cache eviction pattern starts hurting quality?
It feels like a bit of history is missing... If ollama was founded 3 years before llama.cpp was released, what engine did they use then? When did they transition?
I don't think that is the case. Llama.cpp appeared within weeks after meta released llama to select researchers (which then made it out to the public). 3 years before that nobody knew of the name llama. I'm sure that llama.cpp existed first
> within weeks
One week, really, if we consider the "public" availability.
Llama announced: February 24, 2023
Weights leaked: March 3, 2023
Llama.cpp: March 10, 2023
(Ollama 0.0.1: Jul 8, 2023)
They spent several years in stealth mode but the initial release was llama.cpp.
Ollama v0.0.1 "Fast inference server written in Go, powered by llama.cpp" https://github.com/ollama/ollama/tree/v0.0.1
They spent several years in stealth mode
doing what?
trying to build themselves what llama.cpp ended up doing for them?
I asked myself the same question. Some other commenter mentioned above they started with some Kubernetes infrastructure thing and they pivoted later.
Oh hey I'm also working on a thing to solve the devx of llama.cpp: https://github.com/nobodywho-ooo/nobodywho
In contrast to Ollama, this is a self-contained library, not a server.
I wrote some quick notes on this blogpost, just to jot down how we think about good open-source citizenship: https://www.nobodywho.ai/posts/notes-on-friends-dont-let-fri...
So, on a mac, what good alternative to ollama supports mlx for acceleration? My main use case is that I have an old m1 max macbook pro with 64 gb ram that I use as a model server.
LM Studio is a popular option that bundles the MLX backend
I read good things about https://omlx.ai but I don't really know enough of the ecosystem to know if there are better options.
If someone has opinions please let us know!
I noticed the performance issues too. I started using Jan recently and tried running the same model via llama.cpp vs local ollama, and the llama.cpp one was noticeably faster.
Just tried llama.cpp
NO, it is not simpler or even as simple as Ollama.
There are multiple options-- llama server and cli, its not obivous which model to use.
With ollama, its one file. And you get the models from their site, you can browse an easy list.
I dont have the time to go thru 20billlion hugging face models and decide which is the one for me.
Thanks, but I'm sticking with Ollama
jan.ai would be the ideal route to take here then. its open source, has a simple chat interface, it uses llama.cpp, it lets you search for models and downloads them, and it supports .gguf so youre not locked in if you want to use the models with another program later on
The performance issues are crazy. Thanks for sharing this
This is a bit like saying stop using Ubuntu, use Debian instead.
Both llama.cpp and ollama are great and focused on different things and yet complement each other (both can be true at the same time!)
Ollama has great ux and also supports inference via mlx, which has better performance on apple silicon than llama.cpp
I'm using llama.cpp, ollama, lm studio, mlx etc etc depending on what is most convenient for me at the time to get done what I want to get done (e.g. a specific model config to run, mcp, just try a prompt quickly, …)
> This is a bit like saying stop using Ubuntu, use Debian instead.
Not really, because Ubuntu has always acknowledged Debian and explicitly documented the dependency:
> Debian is the rock on which Ubuntu is built.
> Ubuntu builds on the Debian architecture and infrastructure and collaborates widely with Debian developers, but there are important differences. Ubuntu has a distinctive user interface, a separate developer community (though many developers participate in both projects) and a different release process.
Source: https://ubuntu.com/community/docs/governance/debian
Ollama never has for llama.cpp. That's all that's being asked for, a credit.
OK. That says absolutely nothing about actual UX or anything that matters to most actual users (as opposed to argumentative HN ideologues).
> Both llama.cpp and ollama are great and focused on different things and yet complement each other
According to the article, ollama is not great (that’s an understatement), focused on making money for the company, stealing clout and nothing else, and hardly complements llama.cpp at all since not long after the initial launch. All of these are backed by evidence.
You may disagree, but then you need to refute OP’s points, not try to handwave them away with a BS analogy that’s nothing like the original.
I guess read the article before commenting?
There isn't much you can do with Ollama models besides saying good morning.
The author points out that the Ollama people are evil.
So it is more like saying "Stop using SCO Unix, use Linux instead".
Where do they use the term "evil"?
In the gaps between the tops of the lines and the bottoms of the other lines ;)
They might not use the word, but the behavior they describe is evil:
" This isn’t a matter of open-source etiquette, the MIT license has exactly one major requirement: include the copyright notice. Ollama didn’t.
The community noticed. GitHub issue #3185 was opened in early 2024 requesting license compliance. It went over 400 days without a response from maintainers. When issue #3697 was opened in April 2024 specifically requesting llama.cpp acknowledgment, community PR #3700 followed within hours. Ollama’s co-founder Michael Chiang eventually added a single line to the bottom of the README: “llama.cpp project founded by Georgi Gerganov.” "
I prefer Ollama over the suggested alternatives.
I will switch once we have good user experience on simple features.
A new model is released on HF or the Ollama registry? One `ollama pull` and it's available. It's underwhelming? `ollama rm`.
> This creates a recurring pattern on r/LocalLLaMA: new model launches, people try it through Ollama, it’s broken or slow or has botched chat templates, and the model gets blamed instead of the runtime.
Seems like maybe, at least some of the time, you’re being underwhelmed my ollama not the model.
The better performance point alone seems worth switching away
I follow the llama.cpp runtime improvements and it’s also true for this project. They may rush a bit less but you also have to wait for a few days after a model release to get a working runtime with most features.
Model authors are welcome to add support to llama.cpp before release like IBM did for granite 4 https://github.com/ggml-org/llama.cpp/pull/13550
`wget https://huggingface.co/[USER]/[REPO]/resolve/main/[FILE_NAME...`
`rm [FILE_NAME]`
With Ollama, the initial one-time setup is a little easier, and the CLI is useful, but is it worth dysfunctional templates, worse performance, and the other issues? Not to me.
Jinja templates are very common, and Jinja is not always losslessly convertible to the Go template syntax expected by Ollama. This means that some models simply cannot work correctly with Ollama. Sometimes the effects of this incompatibility are subtle and unpredictable.
you can pull directly from huggingface with llama.cpp, and it also has a decent web chat included
Does it have a model registry with an API and hot swapping or you still have to use sometime like llama swap as suggested in the article ? Or is it CLI?
You can have multiple models served now with loading/unloading with just the server binary.
https://github.com/ggml-org/llama.cpp/blob/master/tools/serv...
It only lacks the automatic FIFO loading/unloading then. Maybe it will be there in a few weeks.
You have no idea what you are downloading with such a pull. At least LMstudio gives you access to all the different versions of the same model.
https://ollama.com/library/gemma4/tags
I see quite a few versions, and I can also use hugging face models.
how about the others:
- vLLM https://vllm.ai/ ?
- oMLX https://github.com/jundot/omlx ?
> Red Hat’s ramalama is worth a look too, a container-native model runner that explicitly credits its upstream dependencies front and center. Exactly what Ollama should have done from the start.
I've now given ramalama a look:
--
-- --I think the biggest advantage for me with ollama is the ability to "hotswap" models with different utility instead of restarting the server with different models combined with the simple "ollama pull model". In other words, it has been quite convenient.
Due to this post I had to search a bit and it seems that llama.cpp recently got router support[1], so I need to have a look at this.
My main use for this is a discord bot where I have different models for different features like replying to messages with images/video or pure text, and non reply generation of sentiment and image descriptions. These all perform best with different models and it has been very convenient for the server to just swap in and out models on request.
[1] https://huggingface.co/blog/ggml-org/model-management-in-lla...
> the ability to "hotswap" models with different utility instead of restarting the server
The article mentions llama-swap does this
Llama.cpp added the ability load/switch models on demand with the max-models and models preset flags.
You can do that with llama-server
Llama-server which is part of llamacpp does this for a few months now
Not sure why VLC doesn't do that.
It's a joke... but also not really? I mean VLC is "just" an interface to play videos. Videos are content files one "interact" with, mostly play/pause and few other functions like seeking. Because there are different video formats VLC relies on codecs to decode the videos, so basically delegating the "hard" part to codecs.
Now... what's the difference here? A model is a codec, the interactions are sending text/image/etc to it, output is text/image/etc out. It's not even radically bigger in size as videos can be huge, like models.
I'm confused as why this isn't a solved problem, especially (and yes I'm being a big sarcastic here, can't help myself) in a time where "AI" supposedly made all smart wise developers who rely on it 10x or even 1000x more productive.
Weird.
What problem is it that you are confused isn't solved?
I think the codec analogy is neat but isn't the codec here llama.cpp, and the models are content files? Then the equivalent of VLC are things like LMStudio etc. which use llama.cpp to let you run models locally?
I'd guess one reason we haven't solved the "codec" layer is that there doesn't seem to be a standard that open model trainers have converged on yet?
llama.cpp is the ffmpeg/libavcodec equivalent in this story.
This is partly why we're building LlamaBarn. It's a lightweight macOS menu bar app that runs llama-server under the hood, with models stored as standard GGUFs in your Hugging Face cache — the same location llama-server uses by default. No separate model store, no lock-in.
https://github.com/ggml-org/LlamaBarn
It is a bit off-topic, but would it possible to provide a light mode for this blog? I used to work during the day time, and my pupils had to contract to read, making it a very poor reading experience.
I like Ollama Cloud service (I'm paid pro user), because it let me test several open source LLMs very fast - I dont need to download anything locally, just change the model name in the API. If I like the model then I can download it and run locally with sensitive data. I also like their CLI, because it is simple to use.
The fact that they are trying to make money is normal - they are a company. They need to pay the bills.
I agree that they should improve communication, but I assume it is still small company with a lot of different requests, and some things might be overlooked.
Overall I like the software and services they provide.
I'm sorry, on a mac, Ollama just works. It lets me use a model and test it quickly. This is like saying stop using google drive, upload everything to s3 instead!
When i'm using Ollama - I honeslty don't care about performance, I'm looking to try out a model and then if it seems good, place it onto a most dedicated stack specifically for it.
Give LM Studio a shot! It gives you the same experience without all of the problems of Ollama.
Ollama is a bit easier to use, you’re right. But the point of the article is the way they just disregarded the license of llama.cpp, moved away from open source while still claiming to be open source and pivoted to cloud offerings when the whole point was to run local models all while without contributing anything back to the big open source projects it owns its existence to. Maybe you don’t care about performance (weird given performance is the main blocker for local LLMs) but you should care about the ethics of companies making the product you use?
And anyway this thread has lots of alternatives that are even easier to use and don’t shit on the open source community making things happen.
I'm making more of a pragmatic point. While ethics of companies are important, i'm still using OpenAI, Anthropic, Microsoft, Apple etc, so I definitely accept a trade-off between morality and ease-of-use.
Currently i've found Ollama to have the best intuitive experience for trying new models. Once i've tried those models and decide on something to use for a project, I can deploy them, and not need to use a UI again.
I'll be trying out the other options in this thread, but my point is that ease of use is going to triumph over the other points the original post made, and some of the alternatives mentioned in the original post miss why Ollama is so popular.
Keep in mind that as the post says, the model you’re trying via ollama may not be the model you asked for! And the performance may be subpar and not reflect the model true performance. Otherwise, I agree they offer an easy and polished product and that explains why they are so popular, besides their personal connections having resulted in their OpenAI partnership.
Good points - i'll definitely look into switching - just wanted to reflect on what's causing it to stay so popular.
LM Studio is 1000x easier to use than ollama btw
Has anybody figured some of the best flags to compile llama.cpp for rocm? I'm using the framework desktop and the Vulkan backend, because it was easier to compile out of the box, but I feel there's large peformance gains on the table by swtiching to rocm. Not sure if installing with brew on ubuntu would be easier.
Not huge gains with rocm but you do get a little faster preprocessing speed
I was pretty big on ollama, it seemed like a great default solution. I had alpha that it was a trash organization but I didn't listen because I just liked having a reliable inference backend that didn't require me to install torch. I switched to llama.cpp for everything maybe 6 months ago because of how fucking frustrating every one of my interactions with ollama (the organization) were. I wanna publicly apologize to everyone who's concerns I brushed off. Ollama is a vampire on the culture and their demise cannot come soon enough.
FWIW llama.cpp does almost everything ollama does better than ollama with the exception of model management, but like, be real, you can just ask it to write an API of your preferred shape and qwen will handle it without issue.
Oh I was completely wrong about the model management stuff btw, llama-server has fully fledged model management baked in now, you just have to make an *.ini with your model configs (most models can do this themselves, I pointed qwen3.6 at the relevant part of the docs and it wrote me an ini with all of my model configs in about 2 minutes) and you can swap between models via api or a dropdown menu in the UI.
I switched to using LlamaBarn to manage local models on macOS.
https://github.com/ggml-org/llamabarn
> Ollama is a Y Combinator-backed (W21) startup, founded by engineers who previously built a Docker GUI that was acquired by Docker Inc. The playbook is familiar: wrap an existing open-source project in a user-friendly interface, build a user base, raise money, then figure out monetization.
I've been experimenting with running Gemma with MLX directly within my own harness: https://github.com/cjroth/mlx-harness
The CLI is great locally, but the architecture fights you in production. Putting a stateful daemon that manages its own blob storage inside a container is a classic anti-pattern. I ended up moving to a proper stateless binary like llama-server for k8s.
ollama is pretty intuitive to use still - dont see why will stop.
LM Studio is equally as simple, has all the same features, and none of the performance or lock-in problems of ollama.
If you only needed a single reason, how about kneecapping your performance by choosing ollama?
Have you ever tried going to the model registry and seeing that the model was recently updated? What updated? What changed? Should I re-download this 20GB file?
I guess if you're not frustrated with things like this then sure, no need to stop using it.
I am running ollama as back end and open webui as front end. It handled downloading and swapping between models.
What is the llama-cpp alternative?
The timeline here is pretty important.
llama.cpp was already public by March 10, 2023. Ollama-the-company may have existed earlier through YC Winter 2021, but that is not the same thing as having a public local-LLM runtime before llama.cpp. In fact, Ollama’s own v0.0.1 repo says: “Run large language models with llama.cpp” and describes itself as a “Fast inference server written in Go, powered by llama.cpp.” Ollama’s own public blog timeline then starts on August 1, 2023 with “Run Llama 2 uncensored locally,” followed by August 24, 2023 with “Run Code Llama locally.” So the public record does not really support any “they were doing local inference before llama.cpp” narrative.
And that is why the attribution issue matters. If your public product is, from day one, a packaging / UX / distribution layer on top of upstream work, then conspicuous credit is not optional. It is part of the bargain. “We made this easier for normal users” is a perfectly legitimate contribution. But presenting that contribution in a way that minimizes the upstream engine is exactly what annoys people.
The founders’ pre-LLM background also points in the same direction. Before Ollama, Jeffrey Morgan and Michael Chiang were known for Kitematic, a Docker usability tool acquired by Docker on March 13, 2015. So the pattern that fits the evidence is not “they pioneered local inference before everyone else.” It is “they had prior experience productizing infrastructure, then applied that playbook to the local-LLM wave once llama.cpp already existed.”
So my issue is not that Ollama is a wrapper. Wrappers can be useful. My issue is that they seem to have taken the social upside of open-source dependence without showing the level of visible credit, humility, and ecosystem citizenship that should come with it. The product may have solved a real UX problem, but the timeline makes it hard to treat them as if they were the originators of the underlying runtime story.
They seem very good at packaging other people’s work, and not quite good enough at sounding appropriately grateful for that fact.
There is also lemonade-server from AMD. Although I am not sure if that is any better.
I did not know! Shady :(
I was using LM Studio since I've moved to MacOS so that's fine I guess
i had no idea about all this. especially the performance and bugs. thanks for informing me!
On a practical note if fumbles connection handling as to be unusable to download anything.
I see no mention of vLLM in the article.
vLLM isn't suitable for people running LLMs side-by-side with regular applications on their PC. It is very good at hosting LLMs for production on dedicated servers. For the prod usecase ollama/llamacpp are practically useless (but that's ok - it's not the projects goal to be).
Alas people want convenience and don’t care about this sort of stuff.
Thank you, I needed to read this.
I'm a llama.cpp user, but apart from the MIT licensing issue, I personally don't see what's the problem here is? Sure Ollama could have advertised better that llama.cpp was it's original backend, but were they obligated to? It's no different to Docker or VMWare that hitch a ride on kernel primitives etc.
I am trying to run models that are on the edge of what my hardware can support. I guess many people are.
So given, as the author states, Ollama runs the LLMs inefficiently, what is the tool that runs them most efficiently on limited hardware ?
Ah man the VC death trap. It's ok. I don't mean it like that but this is classic. It's unavoidable. They gotta make money. They took money, they gotta make money. It's not easy. Everyone has principles, developers more than anyone. They are developers, they are people like you and me. They didn't even start as ollama. They started as a kubernetes infra project in YC and pivoted. Listen don't be hard on these guys. It's hard enough. Trust me I did it. And not as well them.
This is the game. We shouldn't delude ourselves into thinking there are alternative ways to become profitable around open source, there aren't. You effectively end up in this trap and there's no escape and then you have to compromise on everything to build the company, return the money, make a profit. You took people's money, now you have to make good, there's no choice. And anyone who thinks differently is deluded. Open source only goes one way. To the enterprise. Everything else is burning money and wasting time. Look at Docker. Textbook example of the enormous struggle to capture the value of a project that had so much potential, defined an industry and ultimately failed. Even the reboot failed. Sorry. It did.
This stuff is messy. Give them some credit. They give you an epic open source project. Be grateful for that. And now if you want to move on, move on. They don't need a hard time. They're already having a hard time. These guys are probably sweating bullets trying to make it work while their investors breathe down their necks waiting for the payoff. Let them breathe.
Good luck to you ollama guys!
> This stuff is messy. Give them some credit. They give you an epic open source project.
It seems to me the epic open source project was given to us by Georgi Gerganov. These people just tried to milk it for some money, and made everything a little worse in the process.
100%.
UX is where the money is, it is in the wrapper, not the core.
Unfortunately, the core is the most valuable and labor intensive part of it.
With agentic coding, the gap between solid core and shitty wrapper is going to be wider and wider.
Especially when the solid core now ships with a web ui and API compatibility with OpenAI and Antropic. In my test of ai clients, Ollama was the only one I deleted.
i use goose by block
seems pretty unrelated to the post?
also you might be the only person in the wild I've seen admit to this
With such concurrency in the market, it is unforgivable to manage a product that way. The concurrency will kill you.
Clients get disappointed, alternatives have better services, and more are popping out monthly. If they continue that way, nothing good will happen, unfortunately :(
The state of LLM as a service is just depressing
It is a parasitic stack that redirects investment into service wrappers while leaving core infrastructure underfunded
We have to suffer with limits and quotas as if we are living in the Soviet Union
The missing attribution pattern is nasty.
the article nails it!
[flagged]
Way too much text - feels LLM written.
At the top could have been a link to equivalent llamacpp workflows to ollamas.
I wish the op had gone back and written this as a human, I agree with not using Ollama but don't like reading slop.
Can you share some excerpts from that article that feel LLM-written to you?
Yeah my thoughts exactly. Definitely slop. I have no objection to using AI to help writing. I just don't want to read the same sloppy cliches again and again and again. The short sentences. The Bigger Picture. Here's the rub. It's not just A, it's B.
It's like those cliche titles - for fun and profit, the unreasonable effectiveness of, all you need is, etc. etc. but throughout the prose. Stop it guys!
Can you share some excerpts from that article that feel LLM-written to you?
Sure. Short sentences like "It shouldn’t be.", "I’ve moved on.", "Ollama didn’t.", etc.
Not-this-but-that like "The local LLM ecosystem doesn’t need Ollama. It needs llama.cpp."
Weird signposting: "Benchmarks tell the story."
Heres-the-rub conclusion: "The Bigger Picture"
Starting every title with "The ...".
It's definitely largely human-written, but there are enough slop-isms to make it annoying to read. And of course it's totally possible for a human to write an an AI style, but that doesn't make it any less annoying.
I guess I write like an LLM :P
Probably a side effect of using them so much
amen
Another scummy YCombinator project, one of many lately. Looks like no-one is left at the wheel, at least as long as the valuations (and hence money) keep coming in.
I find the style of writing incredibly annoying (it doesn't make the point, full of hyperbole) and the website has the standard slopsite black background and glowing CSS.
That's because it was fully written by an LLM, as usual lately with all the articles on the front page of HN.
No wonder I get downvoted to hell every time I mention this... People here can't even tell anymore. They just find this horrible slop completely normal. HN is just another dead website filled with slop articles, time to move on to some smaller reddit communities...
As Claude would say: You're absolutely right!
Please, please, please stop getting Claude to write blogposts.