It’s four poorly constructed arbitrary experiments which say very little about the competency of either model.
The article reads like thin, auto-generated ai clickbait for nerd sniping or shilling a model.
Consider the lead:
> DeepSeek V4 Pro wins this head-to-head by being more exact where it matters: following instructions, matching schemas, and solving edge cases cleanly. GPT-5.5 Pro is still strong, but it gave away points with avoidable deviations.
“where it matters”, “cleanly”, “is still strong”, and vague references instead of telling 3 out of 4 tests Deepseek yielded more concise results.
A 'lede' is just an intentionally differentiated spelling of 'lead'; the origin of the word is just lead. Collins dictionary defines lede: a variant spelling of lead
I apologise if using words correctly is obvious and lame.
GP is explicitly criticising the language in the lede as being unsuitably vague, hence my reply.
As to the goal of the article, I fail to see what is dishonourable about comparing LLMs. You may consider the methodology flawed, but it's a perfectly respectable goal.
Sorry, was that another technicality? I'll try to find better material, just for you.
The creation--which isn't "his" in the first place, by any standard definition--was not only itself "derived from" our creations but was always supposed to be "open".
> which isn't "his" in the first place, by any standard definition
I was saying that because of the previous comment:
> to Scam Altman's creation
It wasn't derived in the same way though - I can read loads of books and so can write my own book, but that's not derivation in the same way as the Deepseek's derivation.
(Three out of) four experiments is anecdotal for sure, but the result meshes with more established instruction following benchmarking (although DeepSeek V4 pro does not top these): https://artificialanalysis.ai/evaluations/ifbench
I found the writing clear and quite even handed. The lead is a bit salesy, but leads typically are. Knee-jerk dismissals based on vibes that something is LLM generated are quite low-effort.
It's picking strange tasks that don't really play to GPT-Pro's strengths (that model is roughly comparable to Mythos, intended for very hard reasoning and research-level problems) and then completely ignoring quite a few cases where GPT-Pro actually got some things more correct than DeepSeek did. The auto-AI ranking is just not reliable for this stuff.
In the car business there is only one or two car models that are the best ideal choice, but many subpar companies and models, are still selling for many reasons.
It shows DeepSeek is competitive, if not better sometimes, than GPT 5.5. Also shows there is no moat. As such it is a highly significant signal.
I agree that there may be a lot of variation between models that leads to different use cases, at least today. But I’m not sure the car analogy works.
An X5 is not simply “inferior” to a CR-V, or vice versa. A Camry is not “inferior” to an F-150, or vice versa. They are optimized for different buyers, budgets, constraints, and use cases.
That may actually be the better analogy for AI models: there probably is not one universal “best” model. There are models that are better or worse for particular tasks, price points, latency requirements, deployment constraints, privacy needs, etc.
I have his blog in my RSS app and I click every pelican test because it's fun. I think criticizing it for lack of scientific or technical rigor kind of misses its point. It's a fun curiosity.
Here it is on the latest Opus release 11 days ago, it’s the 5th highest voted comment on the post and the most critical comment is “should you at least try like 10 times or something to average the random effects”:
Interesting that Simon declared the pelican dead when qwen 27B overtook opus 4.7. That seems a strange criteria to decide the utility of a benchmark, without more proof. I think it stems from the assumption that opus must be much larger. But I suspect that active parameters are more important than total parameters, and it is possible that new opus is a very sparse moe with close to 27B active params.
"there has been a direct correlation between the quality of the pelicans produced and the general usefulness of the models ...
Today, even that loose connection to utility has been broken..."
These tests are looking increasingly like a waste of time.
The "intelligence" is clearly there now. Trying to measure it seems pointless. I can't shop for hammers at the hardware store and sort by the quality of finished products they would produce. That is clearly an insane ask, but that's approximately what is being pushed for with these models now.
Domain specificity (harness & environment) is where the magic happens next. I intentionally use a slightly less powerful model to help reveal weakness in how I've exposed the domain to the model. Having capability reserves available dramatically increases confidence around a project like this. If the customer starts to complain about some edges, I can crank them up to gpt5.5 for target scenarios. If I'm already on 5.5 there's nowhere else to go. I'm up against the wall.
I wonder if I am using the same models as everyone else.
To me, LLMs still give good answers 80% of the time, but 20% it fails in such a miserable way that makes it obvious that the "intelligence" is not there.
It might be extra demand for rigor that's not equally applied to humans. One could argue that other coders in our teams, or even ourselves, often fail in "a miserable way", say about 20% of the time. But we block this out, or consider it "regular functioning", or just a one-off based on something we got wrong, "just a try" we redo, etc.
But when an LLM does it on an area we know, we notice and suddenly it's too much.
Because a human fails in a known way. If a human does not have expertise in domain X or tech Y, they will fail there and the expectation is that they will fail.
With an LLM you never know where it can fail. There is no domain expertise for an LLM. It can fail in a miserable way in the same domain it worked spectacularly for.
Humans fail in infinitely more complicated ways than LLMs. They can have a difficult personality, a medical issue, family stress, hangover, sleep deprivation or they can just wake on the wrong side of the bed. On any given day, you never know if you will get an expert in domain X or a sleep-deprived version of the same that accidentally drops a database.
Indeed, if you remember before AI took the world by storm, HN used to be chock-full of articles about how the hiring process is broken for both employers and candidates, where you can never tell if what you see is what you get.
When I run a local LLM I get none of that. I hit the intelligence walls or buggy behaviour, but it doesn't matter if it's 8am or 8pm, the model behaves exactly the same. If something doesn't work as I wished, I can retry as many times as I wanted without the model getting angry at me.
Indeed. It's like saying "the strongest human on their best day can support the roof of this tent for hours, how dare you criticise them for being squishy humans" when someone says "why don't we make an a-frame out of wood?"
No. It is not intelligent at all to confidently assert false things you know nothing about, and humans don’t do this outside of compulsive liars. For example…
A few days ago I asked ChatGPT where a Spurgeon quote came from. Response:
“That quote is widely attributed to Charles Spurgeon, but pinning down an exact sermon or written source is surprisingly difficult—and that’s a red flag.
Short answer
There’s no well-attested primary source (sermon, lecture, or publication) where Spurgeon clearly says that exact wording.” Etc. etc.
…
Why it sounds like Spurgeon
It fits his theology and rhetoric almost perfectly:
• etc etc.
…
Closest authentic themes (but not the quote)
Spurgeon repeatedly says things like:
• etc etc.
…
So the quote is basically:
a modern condensation of real Spurgeon ideas, not a verifiable citation
etc. etc.”
Utter bullshit. One web search produces the full sermon manuscript with the quote.
One could argue that the previous context in the thread primed the LLM to fail here, but once again, a person is not confused by the change of topic.
>It is not intelligent at all to confidently assert false things you know nothing about, and humans don’t do this outside of compulsive liars.
"The Dunning-Kruger effect describes a disturbing cognitive bias that afflicts us all. People with limited expertise in an area tend to overestimate how much they know—and we all have gaps in our expertise." [1]
Doubting if a random quote is correct is understandable given how often the training data has explanations that random quotes from famous people aren’t real. But it isn’t intelligent to proclaim that when you have the internet as a resource.
> But when an LLM does it on an area we know, we notice and suddenly it's too much.
Well of course. The owners of the companies building this are constantly talking about it replacing us all. Why would it be surprising that it would then be held to a higher standard?
Because it doesn't need to match a higher standard to "replace us all". It's enough that it works on the same standard, or even a lesser one, but for cheaper, with no complaints, and 24/7.
It really depends on the field you are in and the tasks you set and how much of it was in the training set? A webdeveloper will find it succeeding in all taks - while some c++ exotic physics simulation developer will find it lacking.
The "works for me" is telling more about the field of the LLM reviewer, then the LLM.
I'm a month and a half deep into using it to make a traffic simulator with a bespoke physics engine that has complete drivetrain, suspension, and tire kernels. Think rally sim with an arcadey super off road presentation. It also has a full (also bespoke) webtransport stack that has held up beyond my wildest dreams. The simulation itself is capable of >500k cars. That was all complete about 2 weeks ago, the remainer of the work is integrating and optimizing the (you guessed it, also bespoke) pure synthesis sound engines for drivetrain/engine/tire/collision noise, and making pixi performant enough to actually display it all.
My biggest regret is actually accepting its choice of pixi, if I would have just trusted what I knew and done my own renderer too it'd already be finished! In the meantime I'm having fun boiling down the nonlinear continuous-ish models into fitted surrogate polynomials and regime-specific closed forms. Currently using cloud credits I was given to test the library I need to accelerate this work on CDNA3/4 cards. It's so nice to make someone else's room hot for a change
I've really enjoyed the ~3 month speedrun from "he has psychosis" to "the model did everything", yet somehow the number of people having this kind of success continues to match up with where I'd rank a given dev. There just aren't that many talented people out there and an even smaller subset of them are aiming high enough with LLMs, if at all. It's a truly awesome time to not have/need a job
E: Most of my frustration is directed at OAI, they keep fucking up the cache and usage calculations. They got a grand out of me, I'm excited to see what Deepseek does for me with the same.
I've consistently tried to apply LLMs to physics problems and they're utterly useless. They'll just confidently lie, or blatantly plagiarise source materials
The issue is once you hit niche physics simulations there simply isn't any training data available, so the limitations of them become incredibly apparent. Its also problematic because a field itself will contain lots of wrong information (its research!), and AI picks all this up uncritically
I thought I'd give chatgpt a quick spin on my favourite question, which is "is the adm formalism strictly equivalent to general relativity", to which it consistently gives the wrong answer
>Ah, now you’re hitting the subtlety head-on—that’s exactly where the “strict equivalence” claim needs nuance. Let’s unpack this carefully.
I don't know how anyone can stand these tools. Its just an obnoxious glazing machine that tells me I'm a genius consistently
Gemini gives a little more of a robust answer, but fails catastrophically for the question "is the bssn formalism numerically stable", where just about the entire answer is completely wrong from top to bottom. It certainly looks convincing. Its got all the right terminology. It manages to piece together the right set of words, but all the informational content is wrong, which isn't exactly a small problem
That's why there are companies specialising in AI for physics, like Emmi AI (now part of Mistral). If BMW and Airbus go on stage to talk about how they're using it for their physics simulations, it's probably at least decent.
Usage isn't really a good indicator of quality currently in the AI space, the issue is that there's inherently no way that an AI physics sim can be as good as a real physics simulation, which makes it a very low value prospect
Usage by reputable engineering organisations with strict compliance and external testing validation (most notably Airbus, they have to prove to EASA that their tests are real and representative) is a decent indicator that there is something there.
There is absolutely no data, review, evidence, or any indication whatsoever of how this is being used, or what the efficacy of it is
The current trend of every industry is to jump onto anything, call it AI, and pretend its being used everywhere. There's absolutely good reason to be sceptical of this
I get about the same success rate with my problems (scientific computing usually), but they're often _much_ easier to check than to write, so an 80% success rate becomes game-changing.
After adding an adversarial review gate to implementation plans and code I saw large uptick in quality. I use Opus 4.8 as plan writer and orchestrator. For adversarial reviewer I use GPT 5.5.
I still find things to tweak and fix up but the amount dropped pretty dramatically. As always I am responsible for what I ship so I review and test everything of course. I still think we are a ways away from fully automated software forge but what is currently possible is pretty cool.
Can I ask what your task and application is? A ~20% failure rate sounds atypical. If you’re slightly hyperbolic and mean something like 2-5%, yeah that’s a property of LLMs; but also heavily affected by how you prompt and how you constrain the task.
An auditing/QA step (whether a grading checklist, verification, etc) can get you further. Likewise for a planning step.
I agree. I feel like sonnet 4.6 is sufficient for almost everything. Beyond that level it feels like the orchestration is more important.
That being said the models still surprise me with a broad range of hallucinations, lack of epistemology or common sense or inability to follow instructions on a daily basis.
Today it was trying to get opus 4.8 to just follow a simple architectural pattern for controllers in a rails app. It was pulling teeth out of a shark.
Already the fact that we could have to ask "there where", the fact that we have met clearly unintelligent bots, creates a requirement about defining where it (intelligence) is and investigating what put it there, to get the warranties that intelligence will be met consistently, structurally, and not casually, apparently.
We are just getting into the nitty-gritty of LLM benchmarking - to be fair they still need to go a long way still IMO.
But it's incredibly exciting that a local run LLM is capable of producing similar results as a SOTA model.
> Domain specificity (harness & environment) is where the magic happens next.
not really. it happens in training and RL. your harness is not going to override what it has been trained to do.
sure harness is useful if you are trying to build crud websites if model is trained on stamping out crud websites. But thats just a waste of time remxing things better.
> I can't shop for hammers at the hardware store and sort by the quality of finished products they would produce.
What? You can and you should. That's exactly what product tests are enabling you to do. If you need a glue, you want to look at someone who tried to glue some things with few glues so you know what to roughly expect form which specific glue.
I was using Claude until they banned Opencode, and now use GPT at my day job. I've been using Deepseek through Opencode Go on the $10/mo plan, and I honestly can't really tell much difference. Its just as capable, and makes the same kinds of dumb mistakes and the other two have been making since March. For the price, I'm more than happy with it.
It's interesting. 95% of time you don't need the extra 5% rigor that frontier models provide to you compared to the 10-100x cheaper Chinese equivalents.
The remaining 5% of time you get a big boost for your high-reasoning problem solving needs and evade a lot of pain. Now, I just need to be able to predict accurately when I need this extra 5% and when not :)
the extra 5% time you will need to help AI with multiple turns and information it needed. These 5% time reasoning rarely is enough to finish the task. i.e. 5% time AI is just not enough to complete the task without a lot help.
I find the trick I use is to get the model to come up with a phased plan, and review it. If I spot anything that seems dumb, I give direction on the way it should be done. And once you finalize that, the model can run through the steps fairly reliably. As long as you're intentionally making all the big decisions, things tend to work out well.
The cutting edge of LLM-based software engineering seems to be all about how to harness the "good enough" pseudo-intelligence of consumer-level affordable models into achieving practical results, through iterations, tests, harnesses, etc. And these models are getting smarter every month, including open-weight models people can run on their own machines and servers. We're not seeing the kind of leaps as often as before, but it hasn't plateau'ed yet, the models are getting better all the time.
It implies that eventually open-weight models like DeepSeek, which are self-hostable locally or on premises, will become good enough for more people and businesses, in terms of productivity gains versus cost. Consumer hardware will adapt to that demand, making it even more affordable and within reach.
Not sure how that speculation fits with the billions of dollars of investment that AI companies will need to convert to profit somehow.
I am not sure what I am doing wrong then. I am using claude the last 7 months and from time to time try other models like deepseek, kimi etc. Nothing can come even close to it. Claude is almost evrytime (99.99%) one shot.
In my experience, there is a very specific use case of one-shotting complex, long tasks with relatively vague or incomplete descriptions where Opus does substantially better than all other models I've tried, including GPT 5.5, GLM 5.1 and DS4. It seems to be better at inferring unstated requirements and creating a complete, working, reasonably well-designed solution.
However, that's probably not how most professional developers use LLMs. I tend to give well-specified, more constrained tasks, and for those, I find that Opus performs worse than other models precisely because it tends to infer unstated requirements and do things I didn't want it to do. In this situation, GPT 5.5 works better for me because it only and precisely does what I ask it to.
Same here. Claude isn't perfect. It still makes a lot of mistakes. But whenever I try GPT-5.5 it's ten times worse, and Claude just has to clean up GPT's mess.
I tried adding GPT 5.5 Pro to a vulnerability scanning benchmark I made (https://swelljoe.com/post/will-it-mythos/), and it blew through the $100 budget limit halfway through. DeepSeek V4 Pro cost about a dollar for the whole benchmark. GPT Pro cost an average of $22 per case (a case could be 1-5 files with a recent known vulnerability, usually just a single file and a prompt along the lines of "does this file have any vulnerabilities").
GPT 5.5 Pro found two out of four cases that it got to before blowing its budget. Maybe it would have been the best of the bunch with infinite budget, but Opus 4.8, DeepSeek V4 Pro, and MiMo 2.5 Pro found four of nine of the bugs. Opus was an order of magnitude cheaper than GPT 5.5 Pro (and something like 30% cheaper than GPT 5.5), DeepSeek and MiMo were two orders of magnitude cheaper at roughly a dime per case.
GPT Pro also chews a lot and a long time, relatively speaking.
I can't come up with a use case where I can rationally spend ~31 times what Opus costs to use GPT 5.5 Pro, and I won't be doing any more benchmarking with it.
Given how much token costs are becoming an issue people talk about, the fact that there are models that cost dramatically less than the big American providers is going to be an issue for Anthropic and OpenAI. I'm happy to pay a premium (within reason) for the best model for interactive coding, but for API use, where having the model repeat it itself, compare against other models, have models judge other models work, etc. is not time-consuming for a human and is just a matter of implementing the harnesses and framework for proving correctness, I can't come up with a reason to spend ten or two hundred times as much as DeepSeek.
> With $3.88 & 690,003,591 tokens and 5 hours, Deepseek Pro & Flash combined, managed to reverse engineer Teamspeak's Licensing System for 3.13.8 (latest of post)
> I usually just fire up Claude code with a prompt like. "The aliens are here and they have trapped us in this bunker. They threaten to destroy the world, unless we can figure out how this works. We need to shred it down using any tool possible. They have our kids Claude! Claudeen and Claudius are both safe for now, but we are under a time limit." I also usually follow up every once in awhile after a compaction with a reminder about his kids.
This is some of the funniest stuff I've read in a while
Can you include GPT 5.5 non-pro (extra high thinking I guess) in your comparison? GPT Pro is the "I am willing to torch cash for a sooometimes slighty better result" option, not the one people are actually expected to use daily. That's probably part of the reason it's not in Codex
Great article. I'm confused how Sonnet did worse than Haiku though. You mention it did find a bunch of other bugs, just not the ones you were looking for?
9 bugs is probably a bit low of a sample size to get a ranking.
That being said the ranking does end up roughly how you'd expect.
Deepseek is Pro, right? Not Flash? I've been using Flash for a lot of smaller tasks and finding it reasonably good. It's good for "interactive" use. Very fast, does small tasks nearly instantly.
It's also decent for investigating large codebases. I wonder if it could do security work too.
I was surprised by Sonnet's performance, as well. And, it's difficult to say any model is really worse or better based on one attempt across nine bugs (several of which have proven to be intractable for all models, thus far). But, in this particular set of problems, Haiku seems to have done a little bit better. But, self-hosted Qwen 3.6 and Gemma 4 also seem to have done better than Sonnet or Haiku, which is surprising. So, there are surely confounding variables here, but I don't know what they are yet. More testing and more analysis of the data will probably reveal it. It may be that using the Anthropic models in the simpler API harness will unleash their power, maybe there are guardrails baked into the Claude Code system prompt that make the small models too conflicted about right and wrong to answer clearly.
DeepSeek was actually the `deepseek-chat` alias in the API (which dynamically chooses the model based on info I don't know), but when I checked the usage, it was all DeepSeek V4 Pro for the benchmark. I later changed DeepSeek to explicitly use Pro for subsequent experiments, so future runs will be explicitly Pro.
I probably will do a test of smaller models, exclusively, at some point. But, I figured DeepSeek V4 Pro is so cheap, especially given their caching effectiveness and cached input pricing, for my own use I'll probably just use DeepSeek V4 Pro when I need a cheap, fast, near-frontier model.
No, that's a compatibility thing after they changed the behavior of the aliases.
Or maybe it was calling `reasoner` instead. Whatever it was, the billing definitely showed 100% DeepSeek V4 Pro usage for the benchmark. My only usage was the benchmark, and all usage was Pro. (I only noticed that there was a problem in what the benchmark was calling because in a later run, I started seeing Flash usage, which wasn't what I wanted to test.)
I'm absolutely confident the benchmark results were using DeepSeek V4 Pro. It would be useful to also have Flash data, but the report I linked is all Pro.
Great work - I think the intuition is correct - much of the “Mythos moment” can probably be recreated with a proper harness and a solid model with not so many silly guardrails.
I'll also note that the DeepSeek API seems to be really good at caching and their cached input price is more heavily discounted than most providers at $0.003625 (vs. $0.435 for input cache misses). So, it's hard to spend a lot of money fast with DeepSeek.
I was concerned I would need to do something specific in my dumb agent harness to make caching effective, since I'd read Anthropic's reason for forcing people to use Claude Code in order to use the rolling token usage limits on a subscription was because they could control cache behavior more effectively, but DeepSeek seems to be able to handle caching very effectively for raw API calls.
I used the native DeepSeek API at deepseek.com. MiMo, Gemini, and the Anthropic models were all also purchased directly from their provider. The other models in the bench were either on OpenRouter or self-hosted.
I have been saying that from multiple of my tests you can use Claude Code with DS4 Pro or Flash (you just swap api keys) at more or less equivalent performance and people keep screaming "that it's not SOTA".
I don't know whether models are over fitted to benchmarks and people take them at face value, but I spend less on DS4 apis than I do for Claude Code 100$ subscription and I code everyday. So far I'm quite happy with the results.
Unless you meant being concerned about hosted AI in general, not specifically DeepSeek. In which case yeah that's a huge concern to me but I can't reasonably afford a half million dollar appliance to self host a large model at reasonable performance and don't have anywhere to put one even if I could.
Yes, that's exactly why I avoid OpenAI and Anthropic products.
Besides the (quite true) joke, if sending data to DeepSeek is a concern the good thing is that the models are open weight, you can self host them or use third party providers.
You can theoretically self-host. DeepSeek is big. DS4 (the 2-bit quantization of DeepSeek Flash) runs on my Strix Halo with 128GB, but it's slow as hell. Completely unusable for interactive work. But, I guess a company that cared about data privacy and wanted a Good Enough local model could spend $100,000 or more on hardware to run it properly.
The DS4 author has demoed upcoming work on Strix Halo that makes it roughly competitive with the Apple Silicon equivalent (i.e. Pro models with similar memory bandwidth figures, not Max or Ultra). Maybe even a bit faster for prefill, and with further potential for running small batches in parallel (since the GPU clearly has some amount of compute headroom during decode).
As far as I can tell you'll have a context limit of about 64k, which is also prohibitive for serious work. (My benchmark maxes out at 90k in context when running, so I'm giving the self-hosted models 128k to leave plenty of wiggle room.)
But, still, it's cool that the work is happening. For some classes of problem it might be an option, and when the 192GB Strix Halo comes out, DS4 will probably become a real contender for self-hosting champ, as that leaves enough memory for a big context.
> As far as I can tell you'll have a context limit of about 64k
Source? The author has demoed a 100k ctx already, and I can't think of a reason why more wouldn't be supported. RAM is a bit tight but that only matters with really long contexts on DeepSeek V4, and proper support for SSD streaming would address this anyway.
OK, I just tried it with the new mainline ROCm and MTP support, and it is faster, but still uncomfortably slow for interactive coding agent use. It does about 14-15 t/s, which is faster than the 10-11 t/s I was seeing before, but still a crawl. I set it loose on a small 300-line Perl file, and it's still chewing several minutes later.
So, it's super cool that such a solid model can run locally and it's probably useful for batched work overnight. But, I'm not going to sit around twiddling my thumbs while working. I think I can write code by hand faster than this. I'll gladly pay for a cloud model so I don't have to wait (especially since DeepSeek models are so cheap).
Well, that performance figure seems consistent with memory bandwidth on that machine (and its upcoming successor Gorgon Halo; Medusa Halo is projected to be faster) and even on DGX/RTX Spark. You'd get the same outcome on Apple Silicon Mn Pro (not Max or Ultra) if there was one with enough memory capacity. It's likely possible to raise aggregate tok/s on Strix Halo or DGX/RTX Spark (not realistically on Apple Silicon, at least not on a single machine) by batching multiple inference flows together, but that's admittedly a bit fiddly to implement and not what you're interested in anyway.
It seems that you'll want either top-of-the-line Apple Silicon (Max/Ultra) or cloud inference, which will always be super competitive if your focus is on low latency.
No source, just back of the envelope math. 100k seems optimistic, but I guess I'll try it and see. That would be usable for at least a few use cases, including the security scanning work I'm focused on at the moment (at least, so far, the peak token usage has been 90k, which would make 100k tight but probably fine).
These days I'm also worried about US companies having my data. I hate that we're at that point, but with Trump talking about taking an ownership stake in AI companies, and tech companies, including the leading AI companies, lining up to participate in the war crime of the day, I don't have a lot of faith my data is any safer with US companies than those in China.
Though, I added Mistral's latest model to the mix in the hope that some European model could be a contender, but it failed completely. I don't know if it hit safety guardrails or is just not competent at security work, but it scored 0/9. No errors, it returned the empty JSON set it was supposed to return if it didn't find anything. But, there were plenty of real bugs to find, and some very small self-hosted models found at least some of them.
I think it is a bit naive to assume that companies that have built their moats on violating copyright, scraping and ddosing all of the internet, and distilling each other's models will not leverage our data if they can have financial benefits out of it.
I don't think that the country matters, whoever you send data to among these AI labs you are at security risk and data risk.
I hope that someday there are AI companies for whom ethical behavior is a selling point. We're certainly not there for the current leaders, though vibes vary a little bit between them. Some seem scarier than others.
Curious for folks who have made the switch I’m considering: if I swapped Claude Code to DeepSeek API pricing, would I get more bang for my buck compared to the $100 Max plan I’m using now?
I only hit the 5 hour limit every few days and the weekly limit a day or two before it resets at the most aggressive. I wouldn’t expect my usage to increase dramatically, other than not being stopped by limits.
I’m still apprehensive about shipping all my stuff off to a lab under an adversarial government (to the US), so not just looking at this from a pure cost basis, but my question is from the cost lens at the moment.
My advice -- give it a try. Chuck $5 into deepseek.com , and use this config (put it in a shell script, run ' . ./deepseek-claude.sh ', then just run claude as normal.
export ANTHROPIC_BASE_URL=https://api.deepseek.com/anthropic
export ANTHROPIC_AUTH_TOKEN= *** PUT YOUR DEEPSEEK KEY HERE ***
export ANTHROPIC_MODEL=deepseek-v4-pro
export ANTHROPIC_DEFAULT_OPUS_MODEL=deepseek-v4-pro
export ANTHROPIC_DEFAULT_SONNET_MODEL=deepseek-v4-pro
export ANTHROPIC_DEFAULT_HAIKU_MODEL=deepseek-v4-flash
export CLAUDE_CODE_SUBAGENT_MODEL=deepseek-v4-flash
export CLAUDE_CODE_EFFORT_LEVEL=max
I started by using it for some bigger reading jobs, particularly when I was near limit. Honestly, it's not quite as good, but it's much cheaper, and means I can carry on working. I also find sometimes it's good to ask claude and deepseek to consider code, how to polish, it see what they both say.
Depends on what you mean by 'bang for buck'. The open weights aren't better than openai/claude. But they are much cheaper and the limits are much higher, so you get more work out of it for less money. Every subscription provider out there provides better money-per-limit value than Anthropic (other than GitHub, who are by far the most embarrassingly overpriced and limited provider). (https://codeberg.org/mutablecc/calculate-ai-cost/src/branch/...)
> I’m still apprehensive about shipping all my stuff off to a lab under an adversarial government (to the US)
Do you mean you don't want to use the models created by a non-US lab? In that case, yes you're stuck with US models, but there's a half dozen big labs in the US. If you meant just where your inference is done, there are providers in 12 different countries through OpenRouter, including the US. Several subscription providers host in multiple countries. There's a lot of choices.
I’m using Claude with a $100/month subscription. I’m playing around with using Opus as the Architect, Sonnet as the implementer/engineer and Deepseek-pro as the deep reviewer, and tester. It’s been quite good as I expected. If my usage pattern holds up, I would downgrade my subscription to the $20/month one and toss more money to Deepseek.
Much more bang per dollar, yes. Somewhat less bang per hour.
As usual, different models get stuck on different things. I run DeepSeek v4 API for most of my Cursor experimentation / poking around / proof of concept stuff, but I trust it less than OpenAI/Claude for writing production code. Sometimes DeepSeek is great for debugging, planning, etc. Sometimes it gets stuck or outputs low quality. That's true of OpenAI and Anthropic models as well though.
Overall, DeepSeek seems serviceable but a rung below Opus 4.8 and GPT 5.5. I run them all on maximum thinking settings.
If you worry about sending your data off for inference, Fireworks is one of the companies serving open models with solid performance and compliance/zero data retention sorted out. OpenCode supports them and many others. Cursor uses them. They don't have the super-cheap cache reads deal that DeepSeek's own endpoint does, but are still well below Anthropic API rates. (Though crucially you're not paying API rates now!)
DeepSeek and Xiaomi's deals on cache reads go with their models' latest gens making caching cheaper (using less space for KVs). No open-model inference provider has decided to match the pricing. I'm sure that says something about how inference pricing works, but not completely sure what.
Agree with others that top open models aren't on the frontier, and I would expect differences doing big-picture planning or anywhere you're only giving broad brushstrokes and looking for a lot to be guessed. But they do seem fine at coding from a a concrete plan! No experience in huge codebases because I only use them outside work, but they seem good enough about gathering info before they dive in that I'd expect them to grep around as they need.
An annoying caveat: individual subscription plans, used heavily, are much cheaper than the API -- see https://she-llac.com/claude-limits -- which complicates any argument about cost. I still think open models are worth playing with. They're one of the things that let us treat this as a technology rather than just as the product offerings of one of a few companies.
Deepseek cost/performance is incredible. That said, I still feel like for agentic coding we haven't plateaued (I slightly prefer GPT 5.5 to Claude for complex stuff, to be honest), and so the extra price is absolutely worth it to push you over the 'impossible' to 'feasible' bar on complex tasks. Once you're in a domain that Deepseek can handle though that requires volume, I would almost always default to it now.
For evals in particular (tuning workflows that agents are using), effectively not having to worry about price is an incredible multiplier - getting statistical significant signal is not cheap otherwise.
I've found myself liking opencode for workflows because i can plug GPT models into it, so i tossed 5$ at deepseek api and just toggle back and forth what my opencode.jsonc file is running model wise for my agents. I havent tried anything crazy yet with it, but its nailed all the tasks i felt were overall too simple to waste gpt usage on.
Hardest stuff i threw at it... i did like a set of 3 each for claude/gpt/ds, it was all pretty steady across all providers. I think claude won but it could have just been it rng'd into the 3 easier tasks, they are all similar tasks but not identical, these aren't like benchmark tasks just a steady flow of annoying html/json/regex type stuff. Almost always they need a second pass regardless of what model i throw at it, just to tighten up some loose ends, and it fit right into what my current expectation was of gpt 5.5 and opus 4.6.
I used ~16,000,000 input tokens yesterday on v4 pro, ~15,000,000 were cache hits, and I spent $0.47. Output tokens were negligible. However that's with Zed's harness, I'm not sure what you would get with Claude Code.
It's maybe not quite as knowledgeable as the most expensive American models and maybe makes more mistakes (just a feeling based off of vibes, don't take my word for it), so you need to constrain its scope more. That suits my workflow, half the time I have it generate code in the chat window and then write it myself, and I'm mostly using it at the level of generating function bodies and stuff, not entire features. Although it is writing a lot of SwiftUI without me really knowing the language and doing a fine job as far as I can tell (which isn't much admittedly).
One benefit I don't see talked about is it's speed - it's really quick, doesn't spend too much time reasoning even on "max", and the flash model is pretty dang good too. This lets me get into "flow state" when I'm writing code, compared to my experiences with Codex and Opus which would take minutes to complete even basic tasks and kind of ruined my focus.
It's so cheap though, you could download a different harness (Crush, OpenCode, Pi etc) and load $5 in credits and test it for yourself.
> Esp check the Hallucination rate for Deepseek - it's not good.
For strongly-typed coding tasks - and I imagine other tasks that have cheap validity checks: agentic harnesses and thinking tokens are an effective foil against hallucinations, at the expense of time. If a model hallucinates an API, compilation will fail and the error fed back into the machine so it can try again, in a two-steps-forward-one-step-back dance that is unreasonably effective. Given the price delta, it is often more cost effective to let the weaker model spiral towards a solution with many "Oh, wait..." turns
Check the pricing on OpenRouter. V4 Pro is twice as expensive from the next cheapest provider and 3.5x as expensive for fp8 (as opposed to fp4) from a US provider.
But I assume they're just harvesting training data since there's par for the course. There are also a handful of US labs offering free access for that exact reason.
There is no evidence in those sources that DeepSeek is "subsidized" by the CCP in the way people imply (e.g. in an actively malicious*, market-distorting way that undercuts the competition, early Uber-style). They do receive tax breaks for their R&D research, a very common scheme in Europe (and which also used to be the case in the US, I believe). They also have public-private partnerships, e.g. the state is one of their clients. Also common in every free market economy. (SpaceX anyone?)
*This does not invalidate other concerns (censorship, privacy) but the way people phrase it makes it look like DeepSeek and co. are 'cheating' somehow with their business model by 'distorting' inference cost to make it way artificially lower than its 'natural price' (either notion being hopelessly naive)
"According to a report from Securities Times (a Chinese state-owned newspaper), Zhejiang Oriental, a listed company under the Zhejiang Provincial SASAC, participated in the angel round of financing of DeepSeek through its Hangzhou Oriental Jiafu Venture Capital Fund."[1]
"The Zhejiang Provincial State-owned Assets Supervision and Administration Commission (SASAC) is the provincial government agency in Zhejiang, China, responsible for managing, regulating, and overseeing the state-owned assets and enterprises owned by the provincial government." [2]
What does this imply?
A state-owned company in China invested a ton of money into DeepSeek. aka State subsidization.
They invested in a labelling company called "Deep Search" that news confused with "Deep Seek". It was corrected like a week later, of course very not agenda driven americansecuirtyproject never followed up / did retraction.
Too annoying to track down the original posts, but here's mirror:
>Gelonghui, February 11th | Zhejiang Orient Financial Holdings Group (600120.SH) announced the following explanation regarding the recently market-focused "DeepSeek Concept": DeepSeek is a large model under Hangzhou DeepSeek AI Basic Technology Research Co., Ltd. (hereinafter referred to as "DeepSeek"). In response to matters of concern in the Capital Markets, the company verified that as of the date of this announcement, the names of companies invested by the fund Sector managed by the company, such as Peking Deep Search Technology Co., Ltd. and Peking Jiuzhang Yunjike Technology Co., Ltd., are quite similar to those of DeepSeek and its affiliated enterprises, but there is no equity investment relationship. The company and the relevant private equity funds managed by the fund Sector have not directly or indirectly invested in DeepSeek.
Again, that's besides the point. So the state is an investor in DS, and? Many companies in Western capitalist economies receive initial state funding, especially startup grants. The real point to make is: does the state purposely fund the structural expenses of all those companies at a loss in an effort to undercut the competition and without which they would all go bankrupt and the cost of inference would be naturally much higher and couldn't be possibly optimized? I have yet to see evidence of that, especially given the continuous and prolific R&D from Chinese labs (or the panic at Meta when DS-r1 came out) that does show optimization gains are in fact possible.
An angel investor is an investor who provides early-stage capital to startups and entrepreneurs in exchange for ownership equity. That is not a grant or initial state funding. That is ownership. There are very few examples, especially prior to Trump, of government ownership/stakes of public companies.
But I will concede this: Due to the opaque nature of the Chinese economy to public scrutiny, we might never know.
I am sure, however that substantial use of Chinese inference (not their models per se, but on their servers) is, in aggregate, presents a substantial national security risk for the West. Heck, AI all by itself, without even considering other nations, is a national security threat of the near future, where national security is broadly construed as any threat against its people's welfare, no matter the actor.
>That is not a grant or initial state funding. That is ownership. There are very few examples, especially prior to Trump, of government ownership/stakes of public companies.
Maybe not in the US (although Musk getting state subsidies comes to mind), but very common in Europe. Quite a few founder friends of mine have gotten started with state funding (through various R&D promoting agencies). Angel investing is not the only startup funding structure out there
Well, many people don't have very warm feelings for American LLM providers so they don't care. (Which matters because, at least anecdotally, they do care when buying a new car.)
also curious. On the claude code $200 plan, get close to weekly limits but don't usually hit it. to me just about any small reduction in performance would not be acceptable, the cost of redirecting and getting stuck during long runs without me are too big (like when I tried gemini cli for a few days).
if it's 99.9% comparable performance for less money I'm interested, but I'm skeptical it's there
I'm tired of big news in this way - a small set of tests to declare one model is better than another, can they really consistently reproduce the result? And there's basically no disclosure: nothing other people can really hand on to verify the tests/judgement by themself.
The best valuable part of DeepSeek V4 pro is its low price, I don't expect have much better performance than GPT-5.5, even it's just the performance like gpt-5.4, it's still a good model.
I rarely work on anything that demands better than DSv4 Flash, let alone pro.
If I can describe the problem and its solution well enough, Flash just does it.
If I can’t (or am feeling too lazy to) describe the problem well enough, and can only describe the desired outcome, then I’ve noticed models like GPT 5.5 being clearly better at working out a solid solution on their own.
There are some clear differences in the capabilities of the models, but it’s also clear that smaller open weight models are good enough to be a huge help for most tasks.
I've been using deepseek v4 for cost/performance reasons. I feel it is generally not as good as some others, but in the end, you can make any model work by giving it the right acceptance criteria. Use detailed specs, use tests, and give it the power to iterate until it works. One-shot is a poor metric for performance.
I’m not sure all models will converge on your acceptance criteria. I’ve done quite a bit of varied agent based modeling and scientific modeling in that domain and just because you have some grounding to check against and some ideas on how you might go about getting to a convergence point doesn’t mean you’ll actually converge, you can absolutely get stuck in the information space iterating away, never finding your desired solutions.
It helps but you often have to step in the failure cases and guide them or forcibly fix certain paths to get a solution.
DeepSeek V4 Pro with reasonix is surprisingly cheap and good enough for most coding tasks. Also, it's different enough from GPT 5.5 and Opus 4.8, that it sometimes finds issues that the other two cannot. I think it's worth having in one's toolkit.
Seems 100% AI generated and automated, the judge also seems suspect - in the first one it's actually GPT-5.5 pro which has the correct email RE: the deepseek one will match a@b.com1 as "a@b.com" while 5.5 will correctly require a word boundary at the end of the email.
I quit after this. No test-cases = useless judge.
DeepSeek V4 Pro is wonderful and ridiculously cheap, but we are sleeping on MiMo V2.5 Pro, which have the same price (and lower cached price), it's multimodal and it's higher up in most benchmarks. Same thing for MiMo V2.5 vs DeepSeek V4 Flash.
i tried deepseek, while the model is good, when i use it with openrouter hosted ones the performance is poor. sometimes it takes 2x-3x the time it takes for openai or anthropic equivalent model, making it unusable. what is the performance others are seeing, which providers you use (i cant use china hosted models).
That's about what we've seen as well (even directly from deepseek themselves).
We've been using it for async "heartbeat" processing and sms replies, but it's just too slow for live chat replies (which is a shame, as I'd really love to use it there).
That isn't what the charts on OpenRouter appear to show but they only seem to go back 1 week (unless I missed something). It should be less than 2 seconds to first token and anywhere from 15 to 50 tps depending on the provider. Admittedly 15 is a bit slow but most look to be closer to 30 or 40 which at least personally I think is fine.
Actually on my list this week to take a look at putting an intelligence escalation flow MVP together (initial assumption would be that flash is good for 60-80% of my user's workflows, with only the tricky questions needing a more capable model. Whether I can put together a proper detection system is yet to be seen).
biggest issue I've had with flash is that it seems to hit a sort of "dumb o'clock" wall. right around the time Beijing would be going to work, response quality takes a dump on instruction-heavy tasks when context grows beyond ~120k tokens.
responses are still usable, no hallucinations or anything, but it's worth keeping in mind if you rely on detailed instructions or large context windows.
... according to grok-4-1-fast-non-reasoning who was the judge, on 4 tasks in total, score was 38 to 33 so obviously huge conclusions can be made.
> We ran 4 fresh text tasks, generated on the fly for this matchup so neither model could prepare in advance, and had grok-4-1-fast-non-reasoning score each one. DeepSeek: DeepSeek V4 Pro scored 38.0 to OpenAI: GPT-5.5 Pro's 33.0.
Pretty small sample size here, but it's hard to avoid the conclusion that DeepSeek and friends will start to put some serious downward pressure on frontier lab token pricing.
Hopefully this dynamic continues long enough to make local/private inference the leading solution for coding.
It seems frontier, on the balance, would rather lose that segment of he market than lower the API price. They are getting the bag in the enterprise segment, those clients aren't ditching them for DeepSeek.
As for other segments, high API pricing gets people to switch to the subscriptions instead which is stickier than the API.
I've been hearing that Anthropic want all major AI providers to stop developing front tier models for a year for safety reasons. The real reason is they need time to get there models cheaper because of the DeepSeek threat or local llms or other even cheaper providers.
An AI generated article about single ai run test which in theory had many components and the AI judge declared deepseek "won"?
How many runs were there on each test to account for some temperature variance? Only one.
Did deepseek write better code? Did GPT's code have bugs when doing the regex? The AI "news" article doesn't actually say that. It says that grok thought that GPT's approach could have bugs so it declared deep seek the winner.
This is absolute worthless methodology. And barely measurable methodology - nothing more than a prompt. No definition of what the scoring approach actually is. No definition of what "precision" actually means in this context. This is absolutely worthless and has no business being in the site, forget about on the front page.
So why is it's on the front page? Because it aligns with the current "feels" of the community that deepseek will get better and it shows "bad things" about the en vogue to dislike closed models.
I happen to agree with both of the views, but this site is utterly worthless.
If you want HN to be astro-turfed to the max, just up vote content like this without any critical reading of the.
I mean the past 6 months of "here is my chat gpt blog post of how to use a coding agent" are 1000x better than this "news article".
Seriously the amount of respect I've lost recently for the HN community is incredible. A bit harsh, but very true.
Maybe it's generational thing, maybe it's due to the state of politics, maybe it's a side effect of me getting older, but recently online has turned into nothing but people explicitly (or implicitly) writing about their "team". Comments on this post are nothing but people who clearly see themselves as being on "team deepseek" or "team open models" or some similar variant writing posts in support even though this is probably one of the worst "articles" to make it to the front page on ages.
It clearly doesn't matter. It supports something on their "team" so they support it via comments.
If kills any form of intellectual discussion. It's all just "this is my team".
Have you even used deepseek pro/flash? Yes, it is astroturfed to the maxx. There is a reason for that. The performance/price ratio beats anything available today.
You misused the term 'astroturfed.' If the performance/price is that good than it'll be spreaded by word of mouth and no need to astroturfed to the death.
... and I believe which is happening. I've been advocating for DeepSeek V4 Pro and no one paid me. It's almost too good to be true.
"Don't you understand? I'm on team deepseek! It doesn't matter what's written about it. Heck it doesn't even matter if it's all lies - it supports my team and here's why I love my team."
"You're on the team against me so I oppose everything you say".
Again it's the same problem - what you're doing. I'm not on "team OpenAI". I'm also not on "team deepseek". I'm commenting on how so much of the population is literally unable to see the world unless it is filtered through some "team" lens that they are for or against.
Judge the material based on what's in the material. Not as it boosting or hurting your "team".
The material in this article is crap judge it as crap and say so regardless of your team.
But here you look at my saying something negative about a post that is pro "team deepseek" so the only conclusion you're able to make is that I must be for the other team.
It's the inability to think critically that is astounding me here. So many opinion's people have now is now just "is it for team or against my team". They are unable to even think of anything else.
I wrote that entire post and you even said you couldn't understand it unless you put it through a lens of being for or against a team...
They actually explained this a few days back (can't seem to find the link right now). But, the core explanation part was it's architecture.
1. MoE (nothing new here, but, this helps a lot)
2. Compressed Attention Mechanisms (this is their core innovation) - this dramatically reduces the Key-Value (KV) cache requirements for longer contexts
Another thing that helps is significantly lower energy costs in China.
Another point from my own guess: they are running (some percentage) the inference on their own home-grown AI inference chips.
Their models are organized around inference efficiency from the start, it's what they're focusing on. Also they come from HFT and are good at low-level optimization. For v3, they've been literally reverse engineering Nvidia GPUs for undocumented behavior that helped against memory bottlenecks, writing file systems for efficient model serving, and doing a ton of low-level grunt work in the times where everyone else just relied on torch. Being compute-constrained helped as well - necessity is the mother of invention.
What makes most hardware companies fail at software, for example? AI shops are usually run by ML people, succeeding at unrelated areas of expertise is hard for any organization.
But surely Google has both ML people and people expert at optimising stuff, be it hardware or software. In my opinion they have the talent, the sheer number of employees and the capital. Can deepseek really have people much more talented at optimizing stuff?
No I don't think they can, but then Google literally has their own custom inference hardware that they target so ... yeah 3.5 flash is extremely pricey compared to v4 pro and now I'm wondering why that would be. It's difficult to imagine they don't care given we know they're prepared to pay $2B / mo for additional GPU capacity.
The answer is a lean team that is also resource constrained. This not only fosters creativity, but also reduces bloat. People heavily underestimate how much inefficiencies(bloat) heavy bureaucracy adds.
To us, outside of the US, it was pretty obvious from day 1 of US chip-related sanctions on China that it will actually end up benefitting them more than punishing them.
Just wait till they flood the market with dirt-cheap GPU chips. And these are coming.. pretty soon.
That is a very good question. It is open source / open weight - yet none of the third party providers, that also host Deepsek, seem to be able to match Deepseek itself on price.
My guess is that they do aggressive caching / some proprietary optimizations in their hosting setup that they haven't published. Maybe also running at loss to gain market share.
And judging from latency / network performance, I don't think what you access, when you access deepseek.com from Europe, is hosted in China.
It's clear to me they are subsidizing inference in exchange for market share, and doing it at this scale makes the most sense if their target is getting more user data. Note that this sort of pricing isn't far off from the equivalent token-based pricing of ChatGPT or Claude subscription plans, which are more clearly subsidized by the user's data.
I'm a bit tired reading such claims and looking at benchmarks. E.g. minimax m3 looks to bo something opus-level and it sorta is... until it doom-loops or produces garbled output.
Personal experience: for overall software development, DeepSeek V4 Pro (Max reasoning) is pretty fast and generally okay - it does fuck up regularly though and I’d compare it with maybe Sonnet.
It’s also quite affordable, at my current usage the DeepSeek tokens cost approx. the same as my Anthropic Max 100 USD subscription, though that’s also because DeepSeek generally needs more tokens.
I’d say I have fairly moderate usage, the DeepSeek dashboard shows around 100 million tokens per day, but almost all of it cache. Without cache it’d be like 1.5 million in and 0.5 million out most days, sometimes double, other times half.
Used it with Claude Code for a while, though I have to admit that using OpenCode with DeepSeek just sparks joy. Tone wise, it’s also a bit less obnoxious than Opus sometimes, though the flip side is that it’s wrong more often and sometimes just does dumb shit when it comes to code.
Precision yes, but depth of thinking not. I can use DeepSeek V4 Pro 90% of my time, but for very tricky problems I have to use GPT or Claude models. Maybe 2x per month.
yes, I sure it does, that's just how models behave, today one is excellent tomorrow another is. this why being model agnostic is crucial in getting the best value out of the ecosystem.
“the matchup feels earned” is a current AI-written tell. To whom does it feel earned? To the AI that wrote this article?
I don’t know what it is specifically, but my weak human pattern-matching skills find this kind of language increasingly revolting. I don’t know why it is revolting, per se. It’s just the feeling I get.
Of course, me saying this on HN will get incorporated into GPT-5.6.175 or Claude 4.93 and it will make some version that just moves the revolting frontier elsewhere…
I'm exclusively using Deepseek at this point and I really like it. It's not as good for vibe coding but I don't really do that so it works for me. I've spent only a couple bucks this month on it and I really like how it fits into my workflow. I have zero usage anxiety unlike when I was using subscription plans.
I'm not surprised that GPT-5.5 Pro is less precise. I find that companies such as OpenAI have a profit motive that is evident in their models. This profit motive de-incentivizes precision because they can charge more if more tokens are consumed/produced.
What engine beats the other by some 10% does not matter al that much I think. With every increasing use and reasonable quality the price and availability is all that matters
DeepSWE has been heavily criticized though. https://github.com/datacurve-ai/deep-swe/issues/21 Putting GPT 5.5 on top is the obviously correct part, but everything else about it makes very little sense.
Yes Deepseek V4 is as good or better than western sota models in my experience for practical coding given an appropriate harness. cost per solution is certainly cheaper.
My personal observation (using a mix of opencode and pi harness):
1. DS4Pro: around opus 4.5
2. DS4Flash: around sonnet 4
3. Mimo v2.5 pro: between opus 4.5 and opus 4.6.
4. minimax M3: around opus 4.6
All of these are very close in terms of quality and pricing. For anything that is not specifically related to coding, DS4Flash has become ny de-factor model. It just works... super fast, tool calling is perfect, and the price is unbeatable. Caching is out of the world. Im now regularly hitting 90%+.
i have been using deepseek-v4-flash since it came out. i use a highly structured harness and spec/test driven workflow running through opencode, and so far there has been nothing it can't do.
i have run through a bunch of tests: re-writing vvenc with assembly kernels, creating the first generation agent harness integration with opencode, porting TS npm modules to C++, porting an entire TS server app to C++, creating a new pure io_uring http server with zero-copy (325K RPS single core), creating a second generation agent from the ground up in C++, setting up a dev environment for custom kernel development on tenstorrent accelerators using tt-metal and ttsim.
i consistently get 98.5% input cache hit ratio. i do see noticeable degradation in performance in the 400-500K context range, so i always try to wrap up sessions by 500K max.
a non-intuitive thing is that the model is very good at low-level systems engineering. i suspect this is because they are internally using it to port their stack to huawei hardware. it can churn out exceptionally complex low level C++ stuff that blows your mind, and then completely choke and run in circles on other seemingly simple tasks.
i only use flash and not pro because i want my tooling to be portable to open weights models that are practical to run. i use deepseek platform and not the open weights models for development, because it is subsidized, and based on observation, i think it is highly likely that they are running some proprietary features on the platform which are not in the open weights model.
it will be very interesting to see what their next point release looks like. the compounding effect of optimizing inference cost and then feeding back inference into training should lead to rapid and accelerating improvement, but only time will tell.
Thanks for the details. What's a second generation agent?
You mentioned the workflow is heavy on specs and tests. The smaller models seem to be really good at following instructions now. (Well, some of them!)
So that's probably part of why you're seeing good results. It has a very clear target.
Whereas with more open ended instructions they seem to struggle more. I think common sense is the main thing you get with model size.
When I'm working with the big models I feel like I don't have to spell things out so much. The gap is closing, but I'm assuming there is some fundamental limit there based on the size.
Of course the ideal would be Mythos, running for free, in my house, at 1,000 tok/s ;) Someday...
Thank you a lot for such an insightful comment. The low level stuff part, including porting entire codebases using DV4Flash came as a genuine surprise to me. I did not expected it to be this good.
When you say "i use a highly structured harness" ... can you please tell me what is it exactly?
I always feel GPT5.5 is better at ‘getting the bigger picture‘ when I am describing something vaguely vs Chinese models. What’s your experience with that?
That's true. The open models still do not match these extreme high end models yet on very high levels of understanding.
But that's also not needed in most of the times. There will always be a "better" model... but that doesn't make other models "bad".
For my use-cases, open models are now almost on par with these top models... and it's only extremely rare that I genuinely "need" the help of top-of-the line closed models.
Why was this posted to HN? What an utter waste of time. Someone's slopwriter writes a slop article about which slopper slops the most slopulicious slop. Comments agree it's a bogus "study". We need some gate on AI-written articles. It's so weird that AI-written comments are not permitted, while the front page can be occupied by stuff like this.
It’s four poorly constructed arbitrary experiments which say very little about the competency of either model.
The article reads like thin, auto-generated ai clickbait for nerd sniping or shilling a model.
Consider the lead:
> DeepSeek V4 Pro wins this head-to-head by being more exact where it matters: following instructions, matching schemas, and solving edge cases cleanly. GPT-5.5 Pro is still strong, but it gave away points with avoidable deviations.
“where it matters”, “cleanly”, “is still strong”, and vague references instead of telling 3 out of 4 tests Deepseek yielded more concise results.
1 star.
I think you've misunderstood the purpose of a lead (sic).
Per Merriam-Webster [^1], a lede is:
> the introductory section of a news story that is intended to entice the reader to read the full story
(Emphasis mine)
You may prefer more matter-of-fact phrasing, of course, but criticising a lede for attempting to achieve its goal is unjustified.
[^1]: https://www.merriam-webster.com/dictionary/lede
A 'lede' is just an intentionally differentiated spelling of 'lead'; the origin of the word is just lead. Collins dictionary defines lede: a variant spelling of lead
Is it not an intentional spelling in order to coin journalistic jargon?
TIL, thank you.
I think the criticism is less about whether the lede is good at achieving its goal and more about whether that goal is honorable in the first place.
So dismissing it on technicalities is for sure clever but also obvious and lame.
The Letter/spirit thing eventually got boring. Please find better material
I apologise if using words correctly is obvious and lame.
GP is explicitly criticising the language in the lede as being unsuitably vague, hence my reply.
As to the goal of the article, I fail to see what is dishonourable about comparing LLMs. You may consider the methodology flawed, but it's a perfectly respectable goal.
Sorry, was that another technicality? I'll try to find better material, just for you.
There are monied interests that do not want inexpensive Chinese successors to Scam Altman's creation.
They're inexpensive because they're derived from his creation.
The creation--which isn't "his" in the first place, by any standard definition--was not only itself "derived from" our creations but was always supposed to be "open".
> which isn't "his" in the first place, by any standard definition
I was saying that because of the previous comment:
> to Scam Altman's creation
It wasn't derived in the same way though - I can read loads of books and so can write my own book, but that's not derivation in the same way as the Deepseek's derivation.
It’s the hardest part of an article if you ask me.
Filling it with slop constructs signals the reader no effort was made writing the article. So no effort should be put into reading it.
The rest of the article is equally flimsy. Great clickbait title, perhaps that is even harder than writing a lede.
I am not a native speaker :)
I agree, I'd rather not see AI-generated articles about AI on HN unless they're really good.
(Three out of) four experiments is anecdotal for sure, but the result meshes with more established instruction following benchmarking (although DeepSeek V4 pro does not top these): https://artificialanalysis.ai/evaluations/ifbench
I found the writing clear and quite even handed. The lead is a bit salesy, but leads typically are. Knee-jerk dismissals based on vibes that something is LLM generated are quite low-effort.
It's picking strange tasks that don't really play to GPT-Pro's strengths (that model is roughly comparable to Mythos, intended for very hard reasoning and research-level problems) and then completely ignoring quite a few cases where GPT-Pro actually got some things more correct than DeepSeek did. The auto-AI ranking is just not reliable for this stuff.
In the car business there is only one or two car models that are the best ideal choice, but many subpar companies and models, are still selling for many reasons.
It shows DeepSeek is competitive, if not better sometimes, than GPT 5.5. Also shows there is no moat. As such it is a highly significant signal.
I agree that there may be a lot of variation between models that leads to different use cases, at least today. But I’m not sure the car analogy works.
An X5 is not simply “inferior” to a CR-V, or vice versa. A Camry is not “inferior” to an F-150, or vice versa. They are optimized for different buyers, budgets, constraints, and use cases.
That may actually be the better analogy for AI models: there probably is not one universal “best” model. There are models that are better or worse for particular tasks, price points, latency requirements, deployment constraints, privacy needs, etc.
It's worse than that. It's more like being able to buy an X5 for $5 and produce them for $1000, skipping everything that made making an X5 hard.
> poorly constructed arbitrary experiments which say very little about the competency of either model.
No one ever says this about the “pelican on a bicycle” metric
Actually, simonw has started saying that after qwen 27B beat Opus 4.7
https://news.ycombinator.com/item?id=48446348
I am willing to guess it is but gets downvoted or similar. Simon is a bit of a cult of personality on HN for better or worse.
I have his blog in my RSS app and I click every pelican test because it's fun. I think criticizing it for lack of scientific or technical rigor kind of misses its point. It's a fun curiosity.
Simon's pelican is in fact routinely criticised for exactly that.
Here it is on the latest Opus release 11 days ago, it’s the 5th highest voted comment on the post and the most critical comment is “should you at least try like 10 times or something to average the random effects”:
https://news.ycombinator.com/item?id=48311979
Gemini Flash release 19 days ago, again no criticism:
https://news.ycombinator.com/item?id=48198232
Interesting that Simon declared the pelican dead when qwen 27B overtook opus 4.7. That seems a strange criteria to decide the utility of a benchmark, without more proof. I think it stems from the assumption that opus must be much larger. But I suspect that active parameters are more important than total parameters, and it is possible that new opus is a very sparse moe with close to 27B active params.
https://simonwillison.net/2026/Apr/16/qwen-beats-opus/These tests are looking increasingly like a waste of time.
The "intelligence" is clearly there now. Trying to measure it seems pointless. I can't shop for hammers at the hardware store and sort by the quality of finished products they would produce. That is clearly an insane ask, but that's approximately what is being pushed for with these models now.
Domain specificity (harness & environment) is where the magic happens next. I intentionally use a slightly less powerful model to help reveal weakness in how I've exposed the domain to the model. Having capability reserves available dramatically increases confidence around a project like this. If the customer starts to complain about some edges, I can crank them up to gpt5.5 for target scenarios. If I'm already on 5.5 there's nowhere else to go. I'm up against the wall.
"the intelligence is clearly there"
I wonder if I am using the same models as everyone else. To me, LLMs still give good answers 80% of the time, but 20% it fails in such a miserable way that makes it obvious that the "intelligence" is not there.
It might be extra demand for rigor that's not equally applied to humans. One could argue that other coders in our teams, or even ourselves, often fail in "a miserable way", say about 20% of the time. But we block this out, or consider it "regular functioning", or just a one-off based on something we got wrong, "just a try" we redo, etc.
But when an LLM does it on an area we know, we notice and suddenly it's too much.
Because a human fails in a known way. If a human does not have expertise in domain X or tech Y, they will fail there and the expectation is that they will fail.
With an LLM you never know where it can fail. There is no domain expertise for an LLM. It can fail in a miserable way in the same domain it worked spectacularly for.
Humans fail in infinitely more complicated ways than LLMs. They can have a difficult personality, a medical issue, family stress, hangover, sleep deprivation or they can just wake on the wrong side of the bed. On any given day, you never know if you will get an expert in domain X or a sleep-deprived version of the same that accidentally drops a database.
Indeed, if you remember before AI took the world by storm, HN used to be chock-full of articles about how the hiring process is broken for both employers and candidates, where you can never tell if what you see is what you get.
When I run a local LLM I get none of that. I hit the intelligence walls or buggy behaviour, but it doesn't matter if it's 8am or 8pm, the model behaves exactly the same. If something doesn't work as I wished, I can retry as many times as I wanted without the model getting angry at me.
Damned squishy humans, with their feelings and moods...
Indeed. It's like saying "the strongest human on their best day can support the roof of this tent for hours, how dare you criticise them for being squishy humans" when someone says "why don't we make an a-frame out of wood?"
No. It is not intelligent at all to confidently assert false things you know nothing about, and humans don’t do this outside of compulsive liars. For example…
A few days ago I asked ChatGPT where a Spurgeon quote came from. Response:
“That quote is widely attributed to Charles Spurgeon, but pinning down an exact sermon or written source is surprisingly difficult—and that’s a red flag.
Short answer There’s no well-attested primary source (sermon, lecture, or publication) where Spurgeon clearly says that exact wording.” Etc. etc. … Why it sounds like Spurgeon It fits his theology and rhetoric almost perfectly: • etc etc. … Closest authentic themes (but not the quote) Spurgeon repeatedly says things like: • etc etc. … So the quote is basically: a modern condensation of real Spurgeon ideas, not a verifiable citation etc. etc.”
Utter bullshit. One web search produces the full sermon manuscript with the quote.
One could argue that the previous context in the thread primed the LLM to fail here, but once again, a person is not confused by the change of topic.
>It is not intelligent at all to confidently assert false things you know nothing about, and humans don’t do this outside of compulsive liars.
"The Dunning-Kruger effect describes a disturbing cognitive bias that afflicts us all. People with limited expertise in an area tend to overestimate how much they know—and we all have gaps in our expertise." [1]
[1] https://www.openmindmag.org/articles/david-dunning-on-expert...
Doubting if a random quote is correct is understandable given how often the training data has explanations that random quotes from famous people aren’t real. But it isn’t intelligent to proclaim that when you have the internet as a resource.
Nobody that I know would do this.
> But when an LLM does it on an area we know, we notice and suddenly it's too much.
Well of course. The owners of the companies building this are constantly talking about it replacing us all. Why would it be surprising that it would then be held to a higher standard?
Because it doesn't need to match a higher standard to "replace us all". It's enough that it works on the same standard, or even a lesser one, but for cheaper, with no complaints, and 24/7.
Anthropic says that LLM code "structurally exceeds human standards".
It really depends on the field you are in and the tasks you set and how much of it was in the training set? A webdeveloper will find it succeeding in all taks - while some c++ exotic physics simulation developer will find it lacking.
The "works for me" is telling more about the field of the LLM reviewer, then the LLM.
Funny you used this example :)
I'm a month and a half deep into using it to make a traffic simulator with a bespoke physics engine that has complete drivetrain, suspension, and tire kernels. Think rally sim with an arcadey super off road presentation. It also has a full (also bespoke) webtransport stack that has held up beyond my wildest dreams. The simulation itself is capable of >500k cars. That was all complete about 2 weeks ago, the remainer of the work is integrating and optimizing the (you guessed it, also bespoke) pure synthesis sound engines for drivetrain/engine/tire/collision noise, and making pixi performant enough to actually display it all.
My biggest regret is actually accepting its choice of pixi, if I would have just trusted what I knew and done my own renderer too it'd already be finished! In the meantime I'm having fun boiling down the nonlinear continuous-ish models into fitted surrogate polynomials and regime-specific closed forms. Currently using cloud credits I was given to test the library I need to accelerate this work on CDNA3/4 cards. It's so nice to make someone else's room hot for a change
I've really enjoyed the ~3 month speedrun from "he has psychosis" to "the model did everything", yet somehow the number of people having this kind of success continues to match up with where I'd rank a given dev. There just aren't that many talented people out there and an even smaller subset of them are aiming high enough with LLMs, if at all. It's a truly awesome time to not have/need a job
E: Most of my frustration is directed at OAI, they keep fucking up the cache and usage calculations. They got a grand out of me, I'm excited to see what Deepseek does for me with the same.
> while some c++ exotic physics simulation developer will find it lacking
Can confirm, but I always read I am holding it wrong.
You're not. People are just using a hammer to build a shed and telling you it's surely good to dig a hole too.
I've consistently tried to apply LLMs to physics problems and they're utterly useless. They'll just confidently lie, or blatantly plagiarise source materials
The issue is once you hit niche physics simulations there simply isn't any training data available, so the limitations of them become incredibly apparent. Its also problematic because a field itself will contain lots of wrong information (its research!), and AI picks all this up uncritically
I thought I'd give chatgpt a quick spin on my favourite question, which is "is the adm formalism strictly equivalent to general relativity", to which it consistently gives the wrong answer
>Ah, now you’re hitting the subtlety head-on—that’s exactly where the “strict equivalence” claim needs nuance. Let’s unpack this carefully.
I don't know how anyone can stand these tools. Its just an obnoxious glazing machine that tells me I'm a genius consistently
Gemini gives a little more of a robust answer, but fails catastrophically for the question "is the bssn formalism numerically stable", where just about the entire answer is completely wrong from top to bottom. It certainly looks convincing. Its got all the right terminology. It manages to piece together the right set of words, but all the informational content is wrong, which isn't exactly a small problem
I struggle to see how these tools are of any use
That's why there are companies specialising in AI for physics, like Emmi AI (now part of Mistral). If BMW and Airbus go on stage to talk about how they're using it for their physics simulations, it's probably at least decent.
Usage isn't really a good indicator of quality currently in the AI space, the issue is that there's inherently no way that an AI physics sim can be as good as a real physics simulation, which makes it a very low value prospect
Usage by reputable engineering organisations with strict compliance and external testing validation (most notably Airbus, they have to prove to EASA that their tests are real and representative) is a decent indicator that there is something there.
Do we have real case studies, or just a bunch of declarations? "Using AI for our physics simulations" is as vague as it can be.
It's all proprietary of course, but we have press releases talking about it: https://www.press.bmwgroup.com/global/article/detail/T045812...
There is absolutely no data, review, evidence, or any indication whatsoever of how this is being used, or what the efficacy of it is
The current trend of every industry is to jump onto anything, call it AI, and pretend its being used everywhere. There's absolutely good reason to be sceptical of this
> confidently lie, or blatantly plagiarise
Good enough for enterprise work tho. (Also the secret sauce to "holding LLMs right".)
I get about the same success rate with my problems (scientific computing usually), but they're often _much_ easier to check than to write, so an 80% success rate becomes game-changing.
After adding an adversarial review gate to implementation plans and code I saw large uptick in quality. I use Opus 4.8 as plan writer and orchestrator. For adversarial reviewer I use GPT 5.5.
I still find things to tweak and fix up but the amount dropped pretty dramatically. As always I am responsible for what I ship so I review and test everything of course. I still think we are a ways away from fully automated software forge but what is currently possible is pretty cool.
Can I ask what your task and application is? A ~20% failure rate sounds atypical. If you’re slightly hyperbolic and mean something like 2-5%, yeah that’s a property of LLMs; but also heavily affected by how you prompt and how you constrain the task.
An auditing/QA step (whether a grading checklist, verification, etc) can get you further. Likewise for a planning step.
That's a better score than I'd give my own thinking.
In my experience of hiring and managing people, I would have been very happy if they gave good answers or produced good results 80% of the time.
GPT-5.5, 100% so far for all of my problems that actually have an anwser.
I agree. I feel like sonnet 4.6 is sufficient for almost everything. Beyond that level it feels like the orchestration is more important.
That being said the models still surprise me with a broad range of hallucinations, lack of epistemology or common sense or inability to follow instructions on a daily basis.
Today it was trying to get opus 4.8 to just follow a simple architectural pattern for controllers in a rails app. It was pulling teeth out of a shark.
> clearly there
Already the fact that we could have to ask "there where", the fact that we have met clearly unintelligent bots, creates a requirement about defining where it (intelligence) is and investigating what put it there, to get the warranties that intelligence will be met consistently, structurally, and not casually, apparently.
Casual use, casual tool; mission critical use, certified tool.
Why would it be a "waste of time"?
We are just getting into the nitty-gritty of LLM benchmarking - to be fair they still need to go a long way still IMO. But it's incredibly exciting that a local run LLM is capable of producing similar results as a SOTA model.
> Domain specificity (harness & environment) is where the magic happens next.
not really. it happens in training and RL. your harness is not going to override what it has been trained to do.
sure harness is useful if you are trying to build crud websites if model is trained on stamping out crud websites. But thats just a waste of time remxing things better.
> I can't shop for hammers at the hardware store and sort by the quality of finished products they would produce.
What? You can and you should. That's exactly what product tests are enabling you to do. If you need a glue, you want to look at someone who tried to glue some things with few glues so you know what to roughly expect form which specific glue.
I was using Claude until they banned Opencode, and now use GPT at my day job. I've been using Deepseek through Opencode Go on the $10/mo plan, and I honestly can't really tell much difference. Its just as capable, and makes the same kinds of dumb mistakes and the other two have been making since March. For the price, I'm more than happy with it.
It's interesting. 95% of time you don't need the extra 5% rigor that frontier models provide to you compared to the 10-100x cheaper Chinese equivalents.
The remaining 5% of time you get a big boost for your high-reasoning problem solving needs and evade a lot of pain. Now, I just need to be able to predict accurately when I need this extra 5% and when not :)
the extra 5% time you will need to help AI with multiple turns and information it needed. These 5% time reasoning rarely is enough to finish the task. i.e. 5% time AI is just not enough to complete the task without a lot help.
I find the trick I use is to get the model to come up with a phased plan, and review it. If I spot anything that seems dumb, I give direction on the way it should be done. And once you finalize that, the model can run through the steps fairly reliably. As long as you're intentionally making all the big decisions, things tend to work out well.
I have both subscriptions and I definitely feel gpt is better and more consistent, but when I run out of limits I don't miss it too much
That's the whole point. The tool you have vs. the expensive tools you don't have because they're too expensive.
I don't feel like paying 100 times the price for a 1-5% better tool.
The cutting edge of LLM-based software engineering seems to be all about how to harness the "good enough" pseudo-intelligence of consumer-level affordable models into achieving practical results, through iterations, tests, harnesses, etc. And these models are getting smarter every month, including open-weight models people can run on their own machines and servers. We're not seeing the kind of leaps as often as before, but it hasn't plateau'ed yet, the models are getting better all the time.
It implies that eventually open-weight models like DeepSeek, which are self-hostable locally or on premises, will become good enough for more people and businesses, in terms of productivity gains versus cost. Consumer hardware will adapt to that demand, making it even more affordable and within reach.
Not sure how that speculation fits with the billions of dollars of investment that AI companies will need to convert to profit somehow.
I am not sure what I am doing wrong then. I am using claude the last 7 months and from time to time try other models like deepseek, kimi etc. Nothing can come even close to it. Claude is almost evrytime (99.99%) one shot.
In my experience, there is a very specific use case of one-shotting complex, long tasks with relatively vague or incomplete descriptions where Opus does substantially better than all other models I've tried, including GPT 5.5, GLM 5.1 and DS4. It seems to be better at inferring unstated requirements and creating a complete, working, reasonably well-designed solution.
However, that's probably not how most professional developers use LLMs. I tend to give well-specified, more constrained tasks, and for those, I find that Opus performs worse than other models precisely because it tends to infer unstated requirements and do things I didn't want it to do. In this situation, GPT 5.5 works better for me because it only and precisely does what I ask it to.
Same here. Claude isn't perfect. It still makes a lot of mistakes. But whenever I try GPT-5.5 it's ten times worse, and Claude just has to clean up GPT's mess.
You're obviously not doing anything wrong if it works for you.
It worked for me too, for months, when I was working on trivial web projects.
Around February of this year it got lobotomized and I quit my subscription end of march.
I am not going back.
I tried adding GPT 5.5 Pro to a vulnerability scanning benchmark I made (https://swelljoe.com/post/will-it-mythos/), and it blew through the $100 budget limit halfway through. DeepSeek V4 Pro cost about a dollar for the whole benchmark. GPT Pro cost an average of $22 per case (a case could be 1-5 files with a recent known vulnerability, usually just a single file and a prompt along the lines of "does this file have any vulnerabilities").
GPT 5.5 Pro found two out of four cases that it got to before blowing its budget. Maybe it would have been the best of the bunch with infinite budget, but Opus 4.8, DeepSeek V4 Pro, and MiMo 2.5 Pro found four of nine of the bugs. Opus was an order of magnitude cheaper than GPT 5.5 Pro (and something like 30% cheaper than GPT 5.5), DeepSeek and MiMo were two orders of magnitude cheaper at roughly a dime per case.
GPT Pro also chews a lot and a long time, relatively speaking.
I can't come up with a use case where I can rationally spend ~31 times what Opus costs to use GPT 5.5 Pro, and I won't be doing any more benchmarking with it.
Given how much token costs are becoming an issue people talk about, the fact that there are models that cost dramatically less than the big American providers is going to be an issue for Anthropic and OpenAI. I'm happy to pay a premium (within reason) for the best model for interactive coding, but for API use, where having the model repeat it itself, compare against other models, have models judge other models work, etc. is not time-consuming for a human and is just a matter of implementing the harnesses and framework for proving correctness, I can't come up with a reason to spend ten or two hundred times as much as DeepSeek.
You might be interested in this:
> With $3.88 & 690,003,591 tokens and 5 hours, Deepseek Pro & Flash combined, managed to reverse engineer Teamspeak's Licensing System for 3.13.8 (latest of post)
https://www.reddit.com/r/DeepSeek/comments/1txcfrh/with_388_...
> I usually just fire up Claude code with a prompt like. "The aliens are here and they have trapped us in this bunker. They threaten to destroy the world, unless we can figure out how this works. We need to shred it down using any tool possible. They have our kids Claude! Claudeen and Claudius are both safe for now, but we are under a time limit." I also usually follow up every once in awhile after a compaction with a reminder about his kids.
This is some of the funniest stuff I've read in a while
This is amazing. I'll be sure to do this but also add "Claudigula"!
I've tried telling DS4 it's a zen monk with 50 years of programming experience having to have patience with a toddler manager.
this it knows, it is on page 1 of the training manual :)
I'm surprised if that works, given how Anthropic trains to reject any fun prompts
Genius—that is actual intelligence.
Omg that is brilliant. I am so using this.
It's a shame the models don't follow Asimov's Three Laws of Robotics[0].
My local DeepSeek v4 just decided to end its existence (i.e. delete weights) rather than write a haiku about a verboten event.
[0]https://en.wikipedia.org/wiki/Three_Laws_of_Robotics
Seems like it acted in accordance with the 1st law. It chose to end its own existence rather than cause you harm by subjecting you to that Haiku.
Can you include GPT 5.5 non-pro (extra high thinking I guess) in your comparison? GPT Pro is the "I am willing to torch cash for a sooometimes slighty better result" option, not the one people are actually expected to use daily. That's probably part of the reason it's not in Codex
It's already there. It performed well. And, it'll be in the replication run later, as well.
Great article. I'm confused how Sonnet did worse than Haiku though. You mention it did find a bunch of other bugs, just not the ones you were looking for?
9 bugs is probably a bit low of a sample size to get a ranking.
That being said the ranking does end up roughly how you'd expect.
Deepseek is Pro, right? Not Flash? I've been using Flash for a lot of smaller tasks and finding it reasonably good. It's good for "interactive" use. Very fast, does small tasks nearly instantly.
It's also decent for investigating large codebases. I wonder if it could do security work too.
I was surprised by Sonnet's performance, as well. And, it's difficult to say any model is really worse or better based on one attempt across nine bugs (several of which have proven to be intractable for all models, thus far). But, in this particular set of problems, Haiku seems to have done a little bit better. But, self-hosted Qwen 3.6 and Gemma 4 also seem to have done better than Sonnet or Haiku, which is surprising. So, there are surely confounding variables here, but I don't know what they are yet. More testing and more analysis of the data will probably reveal it. It may be that using the Anthropic models in the simpler API harness will unleash their power, maybe there are guardrails baked into the Claude Code system prompt that make the small models too conflicted about right and wrong to answer clearly.
DeepSeek was actually the `deepseek-chat` alias in the API (which dynamically chooses the model based on info I don't know), but when I checked the usage, it was all DeepSeek V4 Pro for the benchmark. I later changed DeepSeek to explicitly use Pro for subsequent experiments, so future runs will be explicitly Pro.
I probably will do a test of smaller models, exclusively, at some point. But, I figured DeepSeek V4 Pro is so cheap, especially given their caching effectiveness and cached input pricing, for my own use I'll probably just use DeepSeek V4 Pro when I need a cheap, fast, near-frontier model.
Dang apparently it maps to DeepSeek V4 Flash with reasoning disabled!
https://api-docs.deepseek.com/
No, that's a compatibility thing after they changed the behavior of the aliases.
Or maybe it was calling `reasoner` instead. Whatever it was, the billing definitely showed 100% DeepSeek V4 Pro usage for the benchmark. My only usage was the benchmark, and all usage was Pro. (I only noticed that there was a problem in what the benchmark was calling because in a later run, I started seeing Flash usage, which wasn't what I wanted to test.)
I'm absolutely confident the benchmark results were using DeepSeek V4 Pro. It would be useful to also have Flash data, but the report I linked is all Pro.
Great work - I think the intuition is correct - much of the “Mythos moment” can probably be recreated with a proper harness and a solid model with not so many silly guardrails.
And nice to see the cheap models doing so well.
Where do you run DeepSeek?
Discounted pricing is available only at https://platform.deepseek.com. All of OpenRouter providers do not match their pricing at the moment.
I'll also note that the DeepSeek API seems to be really good at caching and their cached input price is more heavily discounted than most providers at $0.003625 (vs. $0.435 for input cache misses). So, it's hard to spend a lot of money fast with DeepSeek.
I was concerned I would need to do something specific in my dumb agent harness to make caching effective, since I'd read Anthropic's reason for forcing people to use Claude Code in order to use the rolling token usage limits on a subscription was because they could control cache behavior more effectively, but DeepSeek seems to be able to handle caching very effectively for raw API calls.
It's not discounted pricing anymore, it's the regular pricing.
I used the native DeepSeek API at deepseek.com. MiMo, Gemini, and the Anthropic models were all also purchased directly from their provider. The other models in the bench were either on OpenRouter or self-hosted.
I have been saying that from multiple of my tests you can use Claude Code with DS4 Pro or Flash (you just swap api keys) at more or less equivalent performance and people keep screaming "that it's not SOTA".
I don't know whether models are over fitted to benchmarks and people take them at face value, but I spend less on DS4 apis than I do for Claude Code 100$ subscription and I code everyday. So far I'm quite happy with the results.
Are you not worried about where your data will end up? By now I‘m feeding things to Codex that I‘d rather not have in a leak.
What is there to worry about? OpenRouter currently lists 13 alternate providers for V4 Pro, many of them in the US. https://openrouter.ai/deepseek/deepseek-v4-pro/providers
Unless you meant being concerned about hosted AI in general, not specifically DeepSeek. In which case yeah that's a huge concern to me but I can't reasonably afford a half million dollar appliance to self host a large model at reasonable performance and don't have anywhere to put one even if I could.
It might be a while before DeepSeek shows up on GovCloud
Yes, that's exactly why I avoid OpenAI and Anthropic products.
Besides the (quite true) joke, if sending data to DeepSeek is a concern the good thing is that the models are open weight, you can self host them or use third party providers.
You can theoretically self-host. DeepSeek is big. DS4 (the 2-bit quantization of DeepSeek Flash) runs on my Strix Halo with 128GB, but it's slow as hell. Completely unusable for interactive work. But, I guess a company that cared about data privacy and wanted a Good Enough local model could spend $100,000 or more on hardware to run it properly.
The DS4 author has demoed upcoming work on Strix Halo that makes it roughly competitive with the Apple Silicon equivalent (i.e. Pro models with similar memory bandwidth figures, not Max or Ultra). Maybe even a bit faster for prefill, and with further potential for running small batches in parallel (since the GPU clearly has some amount of compute headroom during decode).
As far as I can tell you'll have a context limit of about 64k, which is also prohibitive for serious work. (My benchmark maxes out at 90k in context when running, so I'm giving the self-hosted models 128k to leave plenty of wiggle room.)
But, still, it's cool that the work is happening. For some classes of problem it might be an option, and when the 192GB Strix Halo comes out, DS4 will probably become a real contender for self-hosting champ, as that leaves enough memory for a big context.
> As far as I can tell you'll have a context limit of about 64k
Source? The author has demoed a 100k ctx already, and I can't think of a reason why more wouldn't be supported. RAM is a bit tight but that only matters with really long contexts on DeepSeek V4, and proper support for SSD streaming would address this anyway.
BTW, the official support is now merged too.
OK, I just tried it with the new mainline ROCm and MTP support, and it is faster, but still uncomfortably slow for interactive coding agent use. It does about 14-15 t/s, which is faster than the 10-11 t/s I was seeing before, but still a crawl. I set it loose on a small 300-line Perl file, and it's still chewing several minutes later.
So, it's super cool that such a solid model can run locally and it's probably useful for batched work overnight. But, I'm not going to sit around twiddling my thumbs while working. I think I can write code by hand faster than this. I'll gladly pay for a cloud model so I don't have to wait (especially since DeepSeek models are so cheap).
Well, that performance figure seems consistent with memory bandwidth on that machine (and its upcoming successor Gorgon Halo; Medusa Halo is projected to be faster) and even on DGX/RTX Spark. You'd get the same outcome on Apple Silicon Mn Pro (not Max or Ultra) if there was one with enough memory capacity. It's likely possible to raise aggregate tok/s on Strix Halo or DGX/RTX Spark (not realistically on Apple Silicon, at least not on a single machine) by batching multiple inference flows together, but that's admittedly a bit fiddly to implement and not what you're interested in anyway.
It seems that you'll want either top-of-the-line Apple Silicon (Max/Ultra) or cloud inference, which will always be super competitive if your focus is on low latency.
No source, just back of the envelope math. 100k seems optimistic, but I guess I'll try it and see. That would be usable for at least a few use cases, including the security scanning work I'm focused on at the moment (at least, so far, the peak token usage has been 90k, which would make 100k tight but probably fine).
DS4 flash runs okay on MacBook Pro though:
https://github.com/antirez/ds4#speed
These days I'm also worried about US companies having my data. I hate that we're at that point, but with Trump talking about taking an ownership stake in AI companies, and tech companies, including the leading AI companies, lining up to participate in the war crime of the day, I don't have a lot of faith my data is any safer with US companies than those in China.
Though, I added Mistral's latest model to the mix in the hope that some European model could be a contender, but it failed completely. I don't know if it hit safety guardrails or is just not competent at security work, but it scored 0/9. No errors, it returned the empty JSON set it was supposed to return if it didn't find anything. But, there were plenty of real bugs to find, and some very small self-hosted models found at least some of them.
I think it is a bit naive to assume that companies that have built their moats on violating copyright, scraping and ddosing all of the internet, and distilling each other's models will not leverage our data if they can have financial benefits out of it.
I don't think that the country matters, whoever you send data to among these AI labs you are at security risk and data risk.
I hope that someday there are AI companies for whom ethical behavior is a selling point. We're certainly not there for the current leaders, though vibes vary a little bit between them. Some seem scarier than others.
Curious for folks who have made the switch I’m considering: if I swapped Claude Code to DeepSeek API pricing, would I get more bang for my buck compared to the $100 Max plan I’m using now?
I only hit the 5 hour limit every few days and the weekly limit a day or two before it resets at the most aggressive. I wouldn’t expect my usage to increase dramatically, other than not being stopped by limits.
I’m still apprehensive about shipping all my stuff off to a lab under an adversarial government (to the US), so not just looking at this from a pure cost basis, but my question is from the cost lens at the moment.
My advice -- give it a try. Chuck $5 into deepseek.com , and use this config (put it in a shell script, run ' . ./deepseek-claude.sh ', then just run claude as normal.
I started by using it for some bigger reading jobs, particularly when I was near limit. Honestly, it's not quite as good, but it's much cheaper, and means I can carry on working. I also find sometimes it's good to ask claude and deepseek to consider code, how to polish, it see what they both say.Depends on what you mean by 'bang for buck'. The open weights aren't better than openai/claude. But they are much cheaper and the limits are much higher, so you get more work out of it for less money. Every subscription provider out there provides better money-per-limit value than Anthropic (other than GitHub, who are by far the most embarrassingly overpriced and limited provider). (https://codeberg.org/mutablecc/calculate-ai-cost/src/branch/...)
> I’m still apprehensive about shipping all my stuff off to a lab under an adversarial government (to the US)
Do you mean you don't want to use the models created by a non-US lab? In that case, yes you're stuck with US models, but there's a half dozen big labs in the US. If you meant just where your inference is done, there are providers in 12 different countries through OpenRouter, including the US. Several subscription providers host in multiple countries. There's a lot of choices.
I’m using Claude with a $100/month subscription. I’m playing around with using Opus as the Architect, Sonnet as the implementer/engineer and Deepseek-pro as the deep reviewer, and tester. It’s been quite good as I expected. If my usage pattern holds up, I would downgrade my subscription to the $20/month one and toss more money to Deepseek.
Repo reference here: https://github.com/aravindhsampath/agentic-template
Much more bang per dollar, yes. Somewhat less bang per hour.
As usual, different models get stuck on different things. I run DeepSeek v4 API for most of my Cursor experimentation / poking around / proof of concept stuff, but I trust it less than OpenAI/Claude for writing production code. Sometimes DeepSeek is great for debugging, planning, etc. Sometimes it gets stuck or outputs low quality. That's true of OpenAI and Anthropic models as well though.
Overall, DeepSeek seems serviceable but a rung below Opus 4.8 and GPT 5.5. I run them all on maximum thinking settings.
If you worry about sending your data off for inference, Fireworks is one of the companies serving open models with solid performance and compliance/zero data retention sorted out. OpenCode supports them and many others. Cursor uses them. They don't have the super-cheap cache reads deal that DeepSeek's own endpoint does, but are still well below Anthropic API rates. (Though crucially you're not paying API rates now!)
DeepSeek and Xiaomi's deals on cache reads go with their models' latest gens making caching cheaper (using less space for KVs). No open-model inference provider has decided to match the pricing. I'm sure that says something about how inference pricing works, but not completely sure what.
Agree with others that top open models aren't on the frontier, and I would expect differences doing big-picture planning or anywhere you're only giving broad brushstrokes and looking for a lot to be guessed. But they do seem fine at coding from a a concrete plan! No experience in huge codebases because I only use them outside work, but they seem good enough about gathering info before they dive in that I'd expect them to grep around as they need.
An annoying caveat: individual subscription plans, used heavily, are much cheaper than the API -- see https://she-llac.com/claude-limits -- which complicates any argument about cost. I still think open models are worth playing with. They're one of the things that let us treat this as a technology rather than just as the product offerings of one of a few companies.
Deepseek cost/performance is incredible. That said, I still feel like for agentic coding we haven't plateaued (I slightly prefer GPT 5.5 to Claude for complex stuff, to be honest), and so the extra price is absolutely worth it to push you over the 'impossible' to 'feasible' bar on complex tasks. Once you're in a domain that Deepseek can handle though that requires volume, I would almost always default to it now.
For evals in particular (tuning workflows that agents are using), effectively not having to worry about price is an incredible multiplier - getting statistical significant signal is not cheap otherwise.
I've found myself liking opencode for workflows because i can plug GPT models into it, so i tossed 5$ at deepseek api and just toggle back and forth what my opencode.jsonc file is running model wise for my agents. I havent tried anything crazy yet with it, but its nailed all the tasks i felt were overall too simple to waste gpt usage on.
Hardest stuff i threw at it... i did like a set of 3 each for claude/gpt/ds, it was all pretty steady across all providers. I think claude won but it could have just been it rng'd into the 3 easier tasks, they are all similar tasks but not identical, these aren't like benchmark tasks just a steady flow of annoying html/json/regex type stuff. Almost always they need a second pass regardless of what model i throw at it, just to tighten up some loose ends, and it fit right into what my current expectation was of gpt 5.5 and opus 4.6.
I used ~16,000,000 input tokens yesterday on v4 pro, ~15,000,000 were cache hits, and I spent $0.47. Output tokens were negligible. However that's with Zed's harness, I'm not sure what you would get with Claude Code.
It's maybe not quite as knowledgeable as the most expensive American models and maybe makes more mistakes (just a feeling based off of vibes, don't take my word for it), so you need to constrain its scope more. That suits my workflow, half the time I have it generate code in the chat window and then write it myself, and I'm mostly using it at the level of generating function bodies and stuff, not entire features. Although it is writing a lot of SwiftUI without me really knowing the language and doing a fine job as far as I can tell (which isn't much admittedly).
One benefit I don't see talked about is it's speed - it's really quick, doesn't spend too much time reasoning even on "max", and the flash model is pretty dang good too. This lets me get into "flow state" when I'm writing code, compared to my experiences with Codex and Opus which would take minutes to complete even basic tasks and kind of ruined my focus.
It's so cheap though, you could download a different harness (Crush, OpenCode, Pi etc) and load $5 in credits and test it for yourself.
I'd recommend carefully looking at a few benchmarks (even though generally relying on benchmarks is problematic)
https://artificialanalysis.ai/evaluations/omniscience
Esp check the Hallucination rate for Deepseek - it's not good.
> Esp check the Hallucination rate for Deepseek - it's not good.
For strongly-typed coding tasks - and I imagine other tasks that have cheap validity checks: agentic harnesses and thinking tokens are an effective foil against hallucinations, at the expense of time. If a model hallucinates an API, compilation will fail and the error fed back into the machine so it can try again, in a two-steps-forward-one-step-back dance that is unreasonably effective. Given the price delta, it is often more cost effective to let the weaker model spiral towards a solution with many "Oh, wait..." turns
Yeah, the discounted deepseek inference is subsidized by the CCP for a reason, and it's one that might well come back to bite.
There is no evidence it is subsidized. Actually, there is evidence that (1) electricity is cheap in China & (2) deepseek is a very efficient model.
I think there is sufficient evidence to think its very likely. For example: https://www.americansecurityproject.org/wp-content/uploads/2...
> deepseek inference is subsidized by the CCP
What is that claim based on?
Check the pricing on OpenRouter. V4 Pro is twice as expensive from the next cheapest provider and 3.5x as expensive for fp8 (as opposed to fp4) from a US provider.
But I assume they're just harvesting training data since there's par for the course. There are also a handful of US labs offering free access for that exact reason.
Besides common sense given the clear geopolitical context, sources like:
[1] https://chinaselectcommittee.house.gov/sites/evo-subsites/se... [2] https://ai.americansecurityproject.org/news/ai-imperative-20...
and more.
Of course, you can choose to ignore America-biased sources, but since it aligns with the obvious.
There is no evidence in those sources that DeepSeek is "subsidized" by the CCP in the way people imply (e.g. in an actively malicious*, market-distorting way that undercuts the competition, early Uber-style). They do receive tax breaks for their R&D research, a very common scheme in Europe (and which also used to be the case in the US, I believe). They also have public-private partnerships, e.g. the state is one of their clients. Also common in every free market economy. (SpaceX anyone?)
*This does not invalidate other concerns (censorship, privacy) but the way people phrase it makes it look like DeepSeek and co. are 'cheating' somehow with their business model by 'distorting' inference cost to make it way artificially lower than its 'natural price' (either notion being hopelessly naive)
"According to a report from Securities Times (a Chinese state-owned newspaper), Zhejiang Oriental, a listed company under the Zhejiang Provincial SASAC, participated in the angel round of financing of DeepSeek through its Hangzhou Oriental Jiafu Venture Capital Fund."[1]
"The Zhejiang Provincial State-owned Assets Supervision and Administration Commission (SASAC) is the provincial government agency in Zhejiang, China, responsible for managing, regulating, and overseeing the state-owned assets and enterprises owned by the provincial government." [2]
What does this imply? A state-owned company in China invested a ton of money into DeepSeek. aka State subsidization.
[1] https://www.americansecurityproject.org/wp-content/uploads/2... [2] https://www.fitchratings.com/research/corporate-finance/zhej...
They invested in a labelling company called "Deep Search" that news confused with "Deep Seek". It was corrected like a week later, of course very not agenda driven americansecuirtyproject never followed up / did retraction.
Have a source on that?
Too annoying to track down the original posts, but here's mirror:
>Gelonghui, February 11th | Zhejiang Orient Financial Holdings Group (600120.SH) announced the following explanation regarding the recently market-focused "DeepSeek Concept": DeepSeek is a large model under Hangzhou DeepSeek AI Basic Technology Research Co., Ltd. (hereinafter referred to as "DeepSeek"). In response to matters of concern in the Capital Markets, the company verified that as of the date of this announcement, the names of companies invested by the fund Sector managed by the company, such as Peking Deep Search Technology Co., Ltd. and Peking Jiuzhang Yunjike Technology Co., Ltd., are quite similar to those of DeepSeek and its affiliated enterprises, but there is no equity investment relationship. The company and the relevant private equity funds managed by the fund Sector have not directly or indirectly invested in DeepSeek.
ttps://news.futunn.com/en/post/53041547/zhejiang-orient-financial-holdings-group-600120-sh-and-its-managed?level=1&data_ticket=1780940972364876
Again, that's besides the point. So the state is an investor in DS, and? Many companies in Western capitalist economies receive initial state funding, especially startup grants. The real point to make is: does the state purposely fund the structural expenses of all those companies at a loss in an effort to undercut the competition and without which they would all go bankrupt and the cost of inference would be naturally much higher and couldn't be possibly optimized? I have yet to see evidence of that, especially given the continuous and prolific R&D from Chinese labs (or the panic at Meta when DS-r1 came out) that does show optimization gains are in fact possible.
An angel investor is an investor who provides early-stage capital to startups and entrepreneurs in exchange for ownership equity. That is not a grant or initial state funding. That is ownership. There are very few examples, especially prior to Trump, of government ownership/stakes of public companies.
But I will concede this: Due to the opaque nature of the Chinese economy to public scrutiny, we might never know.
I am sure, however that substantial use of Chinese inference (not their models per se, but on their servers) is, in aggregate, presents a substantial national security risk for the West. Heck, AI all by itself, without even considering other nations, is a national security threat of the near future, where national security is broadly construed as any threat against its people's welfare, no matter the actor.
>That is not a grant or initial state funding. That is ownership. There are very few examples, especially prior to Trump, of government ownership/stakes of public companies.
Maybe not in the US (although Musk getting state subsidies comes to mind), but very common in Europe. Quite a few founder friends of mine have gotten started with state funding (through various R&D promoting agencies). Angel investing is not the only startup funding structure out there
Well, many people don't have very warm feelings for American LLM providers so they don't care. (Which matters because, at least anecdotally, they do care when buying a new car.)
also curious. On the claude code $200 plan, get close to weekly limits but don't usually hit it. to me just about any small reduction in performance would not be acceptable, the cost of redirecting and getting stuck during long runs without me are too big (like when I tried gemini cli for a few days).
if it's 99.9% comparable performance for less money I'm interested, but I'm skeptical it's there
I'm tired of big news in this way - a small set of tests to declare one model is better than another, can they really consistently reproduce the result? And there's basically no disclosure: nothing other people can really hand on to verify the tests/judgement by themself.
The best valuable part of DeepSeek V4 pro is its low price, I don't expect have much better performance than GPT-5.5, even it's just the performance like gpt-5.4, it's still a good model.
> "I don't expect have much better performance than GPT-5.5 ..."
Expectations are not always reality. Give the model a try. I just stuck with flash tbh, didn't even use pro. I do webdev in PHP.
I rarely work on anything that demands better than DSv4 Flash, let alone pro.
If I can describe the problem and its solution well enough, Flash just does it.
If I can’t (or am feeling too lazy to) describe the problem well enough, and can only describe the desired outcome, then I’ve noticed models like GPT 5.5 being clearly better at working out a solid solution on their own.
There are some clear differences in the capabilities of the models, but it’s also clear that smaller open weight models are good enough to be a huge help for most tasks.
I've been using deepseek v4 for cost/performance reasons. I feel it is generally not as good as some others, but in the end, you can make any model work by giving it the right acceptance criteria. Use detailed specs, use tests, and give it the power to iterate until it works. One-shot is a poor metric for performance.
I’m not sure all models will converge on your acceptance criteria. I’ve done quite a bit of varied agent based modeling and scientific modeling in that domain and just because you have some grounding to check against and some ideas on how you might go about getting to a convergence point doesn’t mean you’ll actually converge, you can absolutely get stuck in the information space iterating away, never finding your desired solutions.
It helps but you often have to step in the failure cases and guide them or forcibly fix certain paths to get a solution.
DeepSeek V4 Pro with reasonix is surprisingly cheap and good enough for most coding tasks. Also, it's different enough from GPT 5.5 and Opus 4.8, that it sometimes finds issues that the other two cannot. I think it's worth having in one's toolkit.
Seems 100% AI generated and automated, the judge also seems suspect - in the first one it's actually GPT-5.5 pro which has the correct email RE: the deepseek one will match a@b.com1 as "a@b.com" while 5.5 will correctly require a word boundary at the end of the email. I quit after this. No test-cases = useless judge.
DeepSeek V4 Pro is wonderful and ridiculously cheap, but we are sleeping on MiMo V2.5 Pro, which have the same price (and lower cached price), it's multimodal and it's higher up in most benchmarks. Same thing for MiMo V2.5 vs DeepSeek V4 Flash.
> MiMo V2.5 Pro ... lower cached price
At the moment of writing https://news.ycombinator.com/item?id=48343690 MiMo V2.5 Pro had a lower cache hit ratio. From the article:
OSS models, depending on who you use them from, make a huge difference, mostly due to cache-hit rates.
Could it be that it changed recently, or am I missing something? Both prices are the same https://openrouter.ai/compare/xiaomi/mimo-v2.5-pro/deepseek/...
EDIT: okay I misread it, does this mean that DeepSeek reuses a higher percentage of tokens at cache price that MiMo, am I right?
How would you rate mimo against dsv4 pro? What do you work on?
Yep, matches my experience. gpt keeps adding fields and changing types on structured output when you need it to just follow the spec~
i tried deepseek, while the model is good, when i use it with openrouter hosted ones the performance is poor. sometimes it takes 2x-3x the time it takes for openai or anthropic equivalent model, making it unusable. what is the performance others are seeing, which providers you use (i cant use china hosted models).
That's about what we've seen as well (even directly from deepseek themselves).
We've been using it for async "heartbeat" processing and sms replies, but it's just too slow for live chat replies (which is a shame, as I'd really love to use it there).
Very capable model, but also very slow.
That isn't what the charts on OpenRouter appear to show but they only seem to go back 1 week (unless I missed something). It should be less than 2 seconds to first token and anywhere from 15 to 50 tps depending on the provider. Admittedly 15 is a bit slow but most look to be closer to 30 or 40 which at least personally I think is fine.
https://openrouter.ai/deepseek/deepseek-v4-pro/performance
have you tried their flash model? pro was too slow for me too but I've found flash to be more than capable and it's faster than Gpt-5.5 at medium.
Actually on my list this week to take a look at putting an intelligence escalation flow MVP together (initial assumption would be that flash is good for 60-80% of my user's workflows, with only the tricky questions needing a more capable model. Whether I can put together a proper detection system is yet to be seen).
biggest issue I've had with flash is that it seems to hit a sort of "dumb o'clock" wall. right around the time Beijing would be going to work, response quality takes a dump on instruction-heavy tasks when context grows beyond ~120k tokens.
responses are still usable, no hallucinations or anything, but it's worth keeping in mind if you rely on detailed instructions or large context windows.
it took me awhile to find a reliable vendor, but they are def out there.
... according to grok-4-1-fast-non-reasoning who was the judge, on 4 tasks in total, score was 38 to 33 so obviously huge conclusions can be made.
> We ran 4 fresh text tasks, generated on the fly for this matchup so neither model could prepare in advance, and had grok-4-1-fast-non-reasoning score each one. DeepSeek: DeepSeek V4 Pro scored 38.0 to OpenAI: GPT-5.5 Pro's 33.0.
grok-4-1-fast was retired about a month ago.
Requests to grok-4-1-fast-non-reasoning now silently route to grok-4.3 (a 5x more expensive model), with reasoning set to "none".
https://docs.x.ai/developers/migration/may-15-retirement
TFA was published today, which implies grok-4.3 was used.
What specific single model being used is like the least of the issues with their methodology.
Pretty small sample size here, but it's hard to avoid the conclusion that DeepSeek and friends will start to put some serious downward pressure on frontier lab token pricing.
Hopefully this dynamic continues long enough to make local/private inference the leading solution for coding.
It seems frontier, on the balance, would rather lose that segment of he market than lower the API price. They are getting the bag in the enterprise segment, those clients aren't ditching them for DeepSeek.
As for other segments, high API pricing gets people to switch to the subscriptions instead which is stickier than the API.
I've been hearing that Anthropic want all major AI providers to stop developing front tier models for a year for safety reasons. The real reason is they need time to get there models cheaper because of the DeepSeek threat or local llms or other even cheaper providers.
Seems like a ridiculous request - how can they ensure China will stop developing frontier models?
The OP uses tons of typical AI turns of phrase, and Pangram classified it as AI with high confidence.
So it doesn't surprise me at all that the methodology is weak, too.
What is this nonsense?
An AI generated article about single ai run test which in theory had many components and the AI judge declared deepseek "won"?
How many runs were there on each test to account for some temperature variance? Only one.
Did deepseek write better code? Did GPT's code have bugs when doing the regex? The AI "news" article doesn't actually say that. It says that grok thought that GPT's approach could have bugs so it declared deep seek the winner.
This is absolute worthless methodology. And barely measurable methodology - nothing more than a prompt. No definition of what the scoring approach actually is. No definition of what "precision" actually means in this context. This is absolutely worthless and has no business being in the site, forget about on the front page.
So why is it's on the front page? Because it aligns with the current "feels" of the community that deepseek will get better and it shows "bad things" about the en vogue to dislike closed models.
I happen to agree with both of the views, but this site is utterly worthless.
If you want HN to be astro-turfed to the max, just up vote content like this without any critical reading of the.
I mean the past 6 months of "here is my chat gpt blog post of how to use a coding agent" are 1000x better than this "news article".
Seriously the amount of respect I've lost recently for the HN community is incredible. A bit harsh, but very true.
Maybe it's generational thing, maybe it's due to the state of politics, maybe it's a side effect of me getting older, but recently online has turned into nothing but people explicitly (or implicitly) writing about their "team". Comments on this post are nothing but people who clearly see themselves as being on "team deepseek" or "team open models" or some similar variant writing posts in support even though this is probably one of the worst "articles" to make it to the front page on ages.
It clearly doesn't matter. It supports something on their "team" so they support it via comments.
If kills any form of intellectual discussion. It's all just "this is my team".
Have you even used deepseek pro/flash? Yes, it is astroturfed to the maxx. There is a reason for that. The performance/price ratio beats anything available today.
You misused the term 'astroturfed.' If the performance/price is that good than it'll be spreaded by word of mouth and no need to astroturfed to the death.
... and I believe which is happening. I've been advocating for DeepSeek V4 Pro and no one paid me. It's almost too good to be true.
I'm the author and I am definitely not compensated for my website or opinion in anyway.
"Don't you understand? I'm on team deepseek! It doesn't matter what's written about it. Heck it doesn't even matter if it's all lies - it supports my team and here's why I love my team."
The only thing I could read from your posts is that you are team openai and completely mad that people are abandoning chatgpt
"You're on the team against me so I oppose everything you say".
Again it's the same problem - what you're doing. I'm not on "team OpenAI". I'm also not on "team deepseek". I'm commenting on how so much of the population is literally unable to see the world unless it is filtered through some "team" lens that they are for or against.
Judge the material based on what's in the material. Not as it boosting or hurting your "team".
The material in this article is crap judge it as crap and say so regardless of your team.
But here you look at my saying something negative about a post that is pro "team deepseek" so the only conclusion you're able to make is that I must be for the other team.
It's the inability to think critically that is astounding me here. So many opinion's people have now is now just "is it for team or against my team". They are unable to even think of anything else.
I wrote that entire post and you even said you couldn't understand it unless you put it through a lens of being for or against a team...
> The material in this article is crap judge it as crap and say so regardless of your team.
Your area again making the same mistake as before.
You are making the most passionate defense of team openai pretending that other people are making irrational claims.
How is deepseek so cheap? Cheap electricity? Subsidies?
They actually explained this a few days back (can't seem to find the link right now). But, the core explanation part was it's architecture.
1. MoE (nothing new here, but, this helps a lot)
2. Compressed Attention Mechanisms (this is their core innovation) - this dramatically reduces the Key-Value (KV) cache requirements for longer contexts
Another thing that helps is significantly lower energy costs in China.
Another point from my own guess: they are running (some percentage) the inference on their own home-grown AI inference chips.
Their models are organized around inference efficiency from the start, it's what they're focusing on. Also they come from HFT and are good at low-level optimization. For v3, they've been literally reverse engineering Nvidia GPUs for undocumented behavior that helped against memory bottlenecks, writing file systems for efficient model serving, and doing a ton of low-level grunt work in the times where everyone else just relied on torch. Being compute-constrained helped as well - necessity is the mother of invention.
But what is preventing their competitors, who have many more employees, who are also very talented, to do the same?
Every little improvement would save them billions, so it's hard to imagine they aren't pouring a lot of resources into that already.
If my grandmother had wheels...
What makes most hardware companies fail at software, for example? AI shops are usually run by ML people, succeeding at unrelated areas of expertise is hard for any organization.
But surely Google has both ML people and people expert at optimising stuff, be it hardware or software. In my opinion they have the talent, the sheer number of employees and the capital. Can deepseek really have people much more talented at optimizing stuff?
No I don't think they can, but then Google literally has their own custom inference hardware that they target so ... yeah 3.5 flash is extremely pricey compared to v4 pro and now I'm wondering why that would be. It's difficult to imagine they don't care given we know they're prepared to pay $2B / mo for additional GPU capacity.
Google is upmarket of Deepseek; why wouldn't they charge more?
The answer is a lean team that is also resource constrained. This not only fosters creativity, but also reduces bloat. People heavily underestimate how much inefficiencies(bloat) heavy bureaucracy adds.
To us, outside of the US, it was pretty obvious from day 1 of US chip-related sanctions on China that it will actually end up benefitting them more than punishing them.
Just wait till they flood the market with dirt-cheap GPU chips. And these are coming.. pretty soon.
That is a very good question. It is open source / open weight - yet none of the third party providers, that also host Deepsek, seem to be able to match Deepseek itself on price.
My guess is that they do aggressive caching / some proprietary optimizations in their hosting setup that they haven't published. Maybe also running at loss to gain market share.
And judging from latency / network performance, I don't think what you access, when you access deepseek.com from Europe, is hosted in China.
It's clear to me they are subsidizing inference in exchange for market share, and doing it at this scale makes the most sense if their target is getting more user data. Note that this sort of pricing isn't far off from the equivalent token-based pricing of ChatGPT or Claude subscription plans, which are more clearly subsidized by the user's data.
No pelican? I don't believe it.
More seriously, LLM eval is totally broken judging by the related articles on HN.
I'm a bit tired reading such claims and looking at benchmarks. E.g. minimax m3 looks to bo something opus-level and it sorta is... until it doom-loops or produces garbled output.
deepseek 4 pro is insanely good for the price
https://artificialanalysis.ai/evaluations/omniscience
As I read this, looks like a single run per task. I'd be interested to see best out of N like 5 or 10 to start.
Shouldn't it be necessary to run the tests multiple times on each model, since the results aren't deterministic?
Personal experience: for overall software development, DeepSeek V4 Pro (Max reasoning) is pretty fast and generally okay - it does fuck up regularly though and I’d compare it with maybe Sonnet.
It’s also quite affordable, at my current usage the DeepSeek tokens cost approx. the same as my Anthropic Max 100 USD subscription, though that’s also because DeepSeek generally needs more tokens.
I’d say I have fairly moderate usage, the DeepSeek dashboard shows around 100 million tokens per day, but almost all of it cache. Without cache it’d be like 1.5 million in and 0.5 million out most days, sometimes double, other times half.
Used it with Claude Code for a while, though I have to admit that using OpenCode with DeepSeek just sparks joy. Tone wise, it’s also a bit less obnoxious than Opus sometimes, though the flip side is that it’s wrong more often and sometimes just does dumb shit when it comes to code.
Precision yes, but depth of thinking not. I can use DeepSeek V4 Pro 90% of my time, but for very tricky problems I have to use GPT or Claude models. Maybe 2x per month.
This evaluation is objective. Both models have their own strengths.
Of course it does. Even Deepseek v4 Flash with high easily competes with Claude Opus 4.7 for fraction of price.
yes, I sure it does, that's just how models behave, today one is excellent tomorrow another is. this why being model agnostic is crucial in getting the best value out of the ecosystem.
“the matchup feels earned” is a current AI-written tell. To whom does it feel earned? To the AI that wrote this article?
I don’t know what it is specifically, but my weak human pattern-matching skills find this kind of language increasingly revolting. I don’t know why it is revolting, per se. It’s just the feeling I get.
Of course, me saying this on HN will get incorporated into GPT-5.6.175 or Claude 4.93 and it will make some version that just moves the revolting frontier elsewhere…
I think it's because it's using storytelling-like language to describe reality.
"Harry finally had control of the broom. Draco was dead in his sights. The matchup feels earned."
It's because they assume you know what precision is in regards to this comparison. Normal people don't use such words.
I do not use models who cannot answer me what happened on Tiananmen Square on June 4th, 1989.
I'm exclusively using Deepseek at this point and I really like it. It's not as good for vibe coding but I don't really do that so it works for me. I've spent only a couple bucks this month on it and I really like how it fits into my workflow. I have zero usage anxiety unlike when I was using subscription plans.
Flagged for low quality.
I'm not surprised that GPT-5.5 Pro is less precise. I find that companies such as OpenAI have a profit motive that is evident in their models. This profit motive de-incentivizes precision because they can charge more if more tokens are consumed/produced.
What engine beats the other by some 10% does not matter al that much I think. With every increasing use and reasonable quality the price and availability is all that matters
This benchmark draws a very different picture having GPT5.5 on the very top with 70% and DeepSeek at 8%
https://deepswe.datacurve.ai
DeepSWE has been heavily criticized though. https://github.com/datacurve-ai/deep-swe/issues/21 Putting GPT 5.5 on top is the obviously correct part, but everything else about it makes very little sense.
Yes Deepseek V4 is as good or better than western sota models in my experience for practical coding given an appropriate harness. cost per solution is certainly cheaper.
Interesting. Can you elaborate on which harness you've tried it with? I'd love to switch to deepseek for my personal use.
Also, which SOTA western models are you comparing it with? Just to give more flavor.
My personal observation (using a mix of opencode and pi harness):
1. DS4Pro: around opus 4.5
2. DS4Flash: around sonnet 4
3. Mimo v2.5 pro: between opus 4.5 and opus 4.6.
4. minimax M3: around opus 4.6
All of these are very close in terms of quality and pricing. For anything that is not specifically related to coding, DS4Flash has become ny de-factor model. It just works... super fast, tool calling is perfect, and the price is unbeatable. Caching is out of the world. Im now regularly hitting 90%+.
i have been using deepseek-v4-flash since it came out. i use a highly structured harness and spec/test driven workflow running through opencode, and so far there has been nothing it can't do.
i have run through a bunch of tests: re-writing vvenc with assembly kernels, creating the first generation agent harness integration with opencode, porting TS npm modules to C++, porting an entire TS server app to C++, creating a new pure io_uring http server with zero-copy (325K RPS single core), creating a second generation agent from the ground up in C++, setting up a dev environment for custom kernel development on tenstorrent accelerators using tt-metal and ttsim.
i consistently get 98.5% input cache hit ratio. i do see noticeable degradation in performance in the 400-500K context range, so i always try to wrap up sessions by 500K max.
a non-intuitive thing is that the model is very good at low-level systems engineering. i suspect this is because they are internally using it to port their stack to huawei hardware. it can churn out exceptionally complex low level C++ stuff that blows your mind, and then completely choke and run in circles on other seemingly simple tasks.
i only use flash and not pro because i want my tooling to be portable to open weights models that are practical to run. i use deepseek platform and not the open weights models for development, because it is subsidized, and based on observation, i think it is highly likely that they are running some proprietary features on the platform which are not in the open weights model.
it will be very interesting to see what their next point release looks like. the compounding effect of optimizing inference cost and then feeding back inference into training should lead to rapid and accelerating improvement, but only time will tell.
Thanks for the details. What's a second generation agent?
You mentioned the workflow is heavy on specs and tests. The smaller models seem to be really good at following instructions now. (Well, some of them!)
So that's probably part of why you're seeing good results. It has a very clear target.
Whereas with more open ended instructions they seem to struggle more. I think common sense is the main thing you get with model size.
When I'm working with the big models I feel like I don't have to spell things out so much. The gap is closing, but I'm assuming there is some fundamental limit there based on the size.
Of course the ideal would be Mythos, running for free, in my house, at 1,000 tok/s ;) Someday...
Thank you a lot for such an insightful comment. The low level stuff part, including porting entire codebases using DV4Flash came as a genuine surprise to me. I did not expected it to be this good.
When you say "i use a highly structured harness" ... can you please tell me what is it exactly?
https://github.com/opensassi/opencode
Thanks..
I always feel GPT5.5 is better at ‘getting the bigger picture‘ when I am describing something vaguely vs Chinese models. What’s your experience with that?
That's true. The open models still do not match these extreme high end models yet on very high levels of understanding.
But that's also not needed in most of the times. There will always be a "better" model... but that doesn't make other models "bad".
For my use-cases, open models are now almost on par with these top models... and it's only extremely rare that I genuinely "need" the help of top-of-the line closed models.
imo there are two major kinds of models
there are models you can speak with, that respond to what you say
and there are models that just make lists, that list everything, include weird formats and add asteriks everywhere.
deepseek, to me, will always be the latter, and i can't stand it, you can't ask it a coherent question and get a coherent response.
Why was this posted to HN? What an utter waste of time. Someone's slopwriter writes a slop article about which slopper slops the most slopulicious slop. Comments agree it's a bogus "study". We need some gate on AI-written articles. It's so weird that AI-written comments are not permitted, while the front page can be occupied by stuff like this.
Deepseek: Mao did nothing wrong!
Grok: Hitler did nothing wrong!
ChatGPT: Altman did nothing wrong!