SolidStart - Hacker News

Morromist 4 days ago ago

I was in the market for a laptop this month. Many new laptops now advertise AI features like this "HP OmniBook 5 Next Gen AI PC" which advertises:

"SNAPDRAGON X PLUS PROCESSOR - Achieve more everyday with responsive performance for seamless multitasking with AI tools that enhance productivity and connectivity while providing long battery life"

I don't want this garbage on my laptop, especially when its running of its battery! Running AI on your laptop is like playing Starcraft Remastered on the Xbox or Factorio on your steamdeck. I hear you can play DOOM on a pregnancy test too. Sure, you can, but its just going to be a tedious inferior experiance.

Really, this is just a fine example of how overhyped AI is right now.

[-]

Legend2440 4 days ago ago

Laptop manufacturers are too desperate to cash on the AI craze. There's nothing special about an 'AI PC'. It's just a regular PC with Windows Copilot... which is a standard Windows feature anyway.

>I don't want this garbage on my laptop, especially when its running of its battery!

The one bit of good news is it's not going to impact your battery life because it doesn't do any on-device processing. It's just calling an LLM in the cloud.

[-]

14113 3 days ago ago

That's not quite correct. Snapdragon chips that are advertised as being good for "AI" also come with the Hexagon DSP, which is now used for (or targeted at) AI applications. It's essentially a separate vector processor with large vector sizes.

marcus_holmes 4 days ago ago

Doesn't this lead to a lot of tension between the hardware makers and Microsoft?

MS wants everyone to run Copilot on their shiny new data centre, so they can collect the data on the way.

Laptop manufacturers are making laptops that can run an LLM locally, but there's no point in that unless there's a local LLM to run (and Windows won't have that because Copilot). Are they going to be pre-installing Llama on new laptops?

Are we going to see a new power user / normal user split? Where power users buy laptops with LLMs installed, that can run them, and normal folks buy something that can call Copilot?

Any ideas?

[-]

zdragnar 4 days ago ago

It isn't just copilot that these laptops come with; manufacturers are already putting their own AI chat apps as well.

For example, the LG gram I recently got came with just such an app named Chat, though the "ai button" on the keyboard (really just right alt or control, I forget which) defaults to copilot.

If there's any tension at all, it's just who gets to be the default app for the "ai button" on the keyboard that I assume almost nobody actually uses.

[-]

marcus_holmes 4 days ago ago

Interesting. Yeah, that'll be the argument

autoexec 4 days ago ago

> MS wants everyone to run Copilot on their shiny new data centre, so they can collect the data on the way.

MS doesn't care where your data is, they're happy to go digging through your C drive to collect/mine whatever they want, assuming you can avoid all the dark patterns they use to push you to save everything on OneDrive anyway and they'll record all your interactions with any other AI using Recall

[-]

marcus_holmes 4 days ago ago

I had assumed that they needed the usage to justify the investment in the data centre, but you could be right and they don't care.

astrange 2 days ago ago

MS doesn't want your data in the first place. Nobody cares about or wants your data. You are not special.

eterm 3 days ago ago

It's just marketing. The laptop makers will market it as if your laptop power makes a difference knowing full well that it's offloaded to the cloud.

For a slightly more charitable perspective, agentic AI means that there is still a bunch of stuff happening on the local machine, it's just not the inference itself.

wmf 3 days ago ago

Copilot is a local LLM (well SLM). https://learn.microsoft.com/en-us/windows/ai/apis/phi-silica

zamadatix 4 days ago ago

> It's just a regular PC with Windows Copilot... which is a standard Windows feature anyway.

"AI PC" branded devices get "Copilot+" and additional crap that comes with that due to the NPU. Despite desktops having GPUs with up to 50x more TOPs than the requirement, they don't get all that for some reason https://www.thurrott.com/mobile/copilot-pc/323616/microsoft-...

[-]

robocat 3 days ago ago

Is Microsoft trying to help NPU chip makers?

When is Wintel going to finally happen?

Microsoft has roughly $102 billion in cash (+ short-term investments). Intel’s market value is approximately $176 billion.

I've never really understood why Microsoft helped Intel's bottom line over decades.

With Azure, Microsoft has even more reason to buy Intel.

eleventyseven 3 days ago ago

There's nothing special with what Intel has lowered the bar as an AI PC so vendors can market it. Ollama can run a 4b model plenty fine on Tiger Lake with 8gb classic RAM.

But unified memory IS truly what makes an AI ready PC. The Apple Silicon proves that. People are willing to pay the premium, and I suspect unified memory will still be around and bringing us benefits even if no one cares about LLMs in 5 years.

bitwize 4 days ago ago

AI PCs also have NPUs which I guess provide accelerated matmuls, albeit less accelerated than a good discrete GPU.

autoexec 4 days ago ago

Even collecting and sending all that data to the cloud is going to drain battery life. I'd really rather my devices only do what I ask them to than have AI running the background all the time trying to be helpful or just silently collecting data.

[-]

Legend2440 4 days ago ago

Copilot is just ChatGPT as an app.

If you don't use it, it will have no impact on your device. And it's not sending your data to the cloud except for anything you paste into it.

[-]

dijit 3 days ago ago

So, the new AI features like recall don’t exist?

Windows is going more and more into AI and embedding it into the core of the OS as much as it can. It’s not “an app”, even if that was true now it wouldn't be true for very long. The strategy is well communicated.

sandworm101 4 days ago ago

>> I'd really rather my devices only do what I ask them to

Linux hears your cry. You have a choice. Make it.

[-]

benbristow 3 days ago ago

Unfortunately still loads of hurdles for most people.

AAA Games with anti-cheat that don't support Linux.

Video editing (DaVinci Resolve exists but is a pain to get up and running on many distros, KDenLive/OpenShot don't really cut it for most)

Adobe Suite (Photoshop/Lightroom specifically, and Premiere for Video Editing) - would like to see Affinity support Linux but hasn't happened so far. GIMP and DarkTable aren't really substitutions unless you pour a lot of time into them.

Tried moving to Linux on my laptop this past month, made it a month before a reinstall of Windows 11. Had issues with WiFi chip (managed to fix but had to edit config files deep in the system, not ideal), Fedora with LUKS encryption after a kernel update the keyboard wouldn't work to input the encryption key, no Windows Hello-like support (face ID). Had the most success with EndeavourOS but running Arch is a chore for most.

It's getting there, best it's ever been, but there's still hurdles.

[-]

cultofmetatron 3 days ago ago

> AAA Games with anti-cheat that don't support Linux.

I really don't understand people that want to play games so badly that they are willing to install a literal rootkit on their devices. I can understand if you're a pro gamer but it feels stupid to do it otherwise.

[-]

3 days ago ago

[deleted]

benbristow 3 days ago ago

Most of the time they're not really informed that they are. I know Valorant does (Riot Games), one I've avoided in the past because of it.

But a lot of the time it's peer-pressure for wanting to play with friends who couldn't care less.

cmxch 3 days ago ago

Riot Vanguard is a popular rootkit.

grayhatter 3 days ago ago

According to my friends, Arc Raders works well on linux. So it's very much, just a small selection of AAA games, so they can run anti-cheat, that probably doesn't even work. Can you name a triple a you want to play, that proton says is incompatible?

Gimp isn't a solution, sure but it works for what I need. Darktable does way more than I've ever wanted, so I can forgive it for the one time it crashed. Inkscape and blender both exceed my needs as well.

And Adobe is so user hostile, that I feel I need to call you a mean name to prove how I feel.... dummy!

Yes, I already feel bad, and I'm sorry. But trolling aside, listing applications that treat users like shit, aren't reasons to stay on the platform that also treats you like shit.

I get it, sometimes, being treated like shit is worth it because it's easier now that you're used to being disrespected. But an aversion to the effort it'd take for you to climb the learning curve of something different, isn't valid reason to help the disrespectful trash companies making the world worse, recruit more people for them to treat like trash.

Just because you use it, doesn't make it worth recommending.

[-]

benbristow 3 days ago ago

I don't really PC game anymore, use my Xbox or a few older games my laptop's iGPU can handle, not at the moment anyway. Battlefield 6 is a big one recently that if I had a gaming PC set-up I'd probably want to play.

I know Adobe are... c-words, but their software is industry standard for a reason.

[-]

grayhatter 3 days ago ago

> Battlefield 6 is a big one recently that if I had a gaming PC set-up I'd probably want to play.

We definitely play very different games, I wouldn't touch it if you paid me. So I'm sure we both have a bit of sample bias in our expected rates of linux compatibility. Especially since EA is another company like Adobe. Also, the internet seems to think they have a cheating problem. I wonder how bad it really is, and if it's worth the cost of the anti-cheat.

They're industry standard because they were first. Not necessarily because they were better. They do have a feature set that's near impossible to beat, not even I can pretend like they don't. I'm just saying, respect and fairness is more important to me, than content aware fill ever will be.

Also, doesn't the Adobe suite work on Linux?

[-]

benbristow 3 days ago ago

I think older versions do, like CS6 through WINE.

Photoshop CC 2024 apparently works somewhat, but no GPU support and the removal tool doesn't work apparently.

https://appdb.winehq.org/objectManager.php?sClass=version&iI...

Basically, no.

sixothree 3 days ago ago

Part of me is starting to think Valve is going to be the best thing to happen to Linux (in this regard) since Ubuntu.

neves 3 days ago ago

I have a Snapdragon laptop and it is the best I've ever had. But the NPU is really almost useless.

This is a nice companion to the article: https://www.pcworld.com/article/2965927/the-great-npu-failur...

[-]

dijit 3 days ago ago

Agreed, I have the ARM based T14s for work.

The thing is nowhere near the performance as a macbook, but its silent and the battery lasts ages, which is a far cry from the same laptop with an Intel CPU, which is what many are running.

Company removes a lot of the AI bloat though.

dpedu 3 days ago ago

> Running AI on your laptop is like playing Starcraft Remastered on the Xbox

A great analogy because there is Starcraft for a console - Nintendo 64 - and it is quite awkward. Split-screen multiplayer included.

layer8 3 days ago ago

It’s true that the AI marketing is largely nonsense, but the NPUs also don’t hurt, and you don’t have to make use of them.

pluralmonad 3 days ago ago

Factorio runs really well on the deck though...

But yeah, fresh install of OS is a must for any new computer.

jwr 3 days ago ago

The author seems unaware of how well recent Apple laptops run LLMs. This is puzzling and puts into question the validity of anything in this article.

[-]

gcanyon 3 days ago ago

If Apple offered a reasonably-priced laptop with more than 24gb of memory (I'm writing this on a maxed-out Air) I'd agree. I've been buying Apple laptops for a long time, and buying the maximum memory every time. I just checked, and I see that now you can get 32gb. But to get 64gb I think you have to spend $3700 for the MBMax, and 128gb starts at $4500, almost 3x the 32gb Air's price.

And as far as I understand it, an Air with an M3 is perfectly capable of running larger models (albeit slower) if it had the memory.

[-]

mft_ 3 days ago ago

You’re not wrong that Apple’s memory prices are unpleasant, but also consider the competition - in this context (running LLMs locally) laptops with large amounts of fast memory that can be purposed for the GPU. This limits you to Apple or one specific AMD processor at present.

An HP Zbook with an AMD 395+ and 128Gb of memory apparently lists for $4049 [0]

An ASUS ROG Flow z13 with the same spec sells for $2799 [1] - so cheaper than Apple, but still a high price for a laptop.

[0] https://hothardware.com/reviews/hp-zbook-ultra-g1a-128gb-rev...

[1] https://www.hidevolution.com/asus-rog-flow-z13-gz302ea-xs99-...

[-]

gcanyon 2 days ago ago

Yeah, I'm by no means saying that Apple is uniquely bad here -- it's just an issue I've been frustrated by since the first M1 chip, long before local LLMs made it a serious issue. More memory is always a good idea, and too much is never enough.

subscribed 2 days ago ago

You can get any low spec laptop that has no soldered DIMMs and just replace them with the maximum supported capacity.

You don't necessarily need to go the maxed up SKU.

[-]

eddyzh 2 days ago ago

Would that be unified memory? Where the gpu and cpu can share the memory? Which is key for performance.

[-]

subscribed a day ago ago

Right, no, it wouldn't, I appreciate that in this particular context my comment was entirely wrong.

Thanks for helping me see it!

mft_ a day ago ago

No, it wouldn’t. You’d be limited to using the CPU and the lower bandwidth system memory.

2 days ago ago

[deleted]

dehugger 2 days ago ago

The framework desktop will get you the 395+ and 128gb of ram for 2k USD.

jdprgm 3 days ago ago

The trick here is buying used. Especially for something like the m1 series there is tremendous value to be had on high memory models where the memory hasn't changed significantly over generations compared the cpus and even m1's are quite competent for many workloads. Got a m1 max 64gb ram recently for I think $1400.

jwr 3 days ago ago

I think pricing is just one dimension of this discussion — but let's dive into it. I agree it's a lot of money. But what are you comparing this pricing to?

From what I understand, getting a non-Apple solution to the problem of running LLMs in 64GB of VRAM or more has a price tag that is at least double of what you mentioned, and likely has another digit in front if you want to get to 128GB?

fnord77 3 days ago ago

it's astonishing how apple gouges on the memory and ssd upgrade prices (I'm on an M1 w/ 64Gb/4Tb).

That said they have some elasticity when it comes to the DRAM shortage.

[-]

aurareturn 2 days ago ago

They gouge you on RAM and SSD but provide a far better overall machine for the price than Windows laptops.

fancyfredbot 3 days ago ago

I think the author is aware of Apple silicon. The article mentions the fact Apple has unified memory and that this is advantageous for running LLMs.

[-]

dangus 3 days ago ago

Then idk why they say that most laptops are bad at running LLMs, Apple has a huge marketshare in the laptop market and even their cheapest laptops are capable in that realm. And their PC competitors are more likely to be generously specced out in terms of included memory.

> However, for the average laptop that’s over a year old, the number of useful AI models you can run locally on your PC is close to zero.

This straight up isn’t true.

[-]

literalAardvark 3 days ago ago

Apple has a 10-18% market share for laptops. That's significant but it certainly isn't "most".

Most laptops can run at best a 7-14b model, even if you buy one with a high spec graphics chip. These are not useful models unless you're writing spam.

Most desktops have a decent amount of system memory but that can't be used for running LLMs at a useful speed, especially since the stuff you could run in 32-64GB RAM would need lots of interaction and hand holding.

And that's for the easy part, inference. Training is much more expensive.

[-]

nunodonato 3 days ago ago

my laptop is 4 years old. I only have 6Gb VRam. I run, mostly, 4b and 8b models. They are extremely useful in a variety of situations. Just because you can't replicate what you do in chatgpt doesn't mean they don't have their use cases. It seems to me you know very little about what these models can do. Not to speak of trained models for specific use cases, or even smaller models like functiongemma or TTS/ASR models. (btw, I've trained models using my 6Gb VRAM too)

[-]

reactordev 3 days ago ago

I’ll chime in and say I run LM Studio on my 2021 MacBook Pro M1 with no issues.

I have 16GB ram. I use unsloth quantized models like qwen3 and gpt-oss. I have some MCP servers like Context7 and Fetch that make sure the models have up to date information. I use continue.dev in VSCode or OpenCode Agent with LM Studio and write C++ code against Vulkan.

It’s more than capable. Is it fast? Not necessarily. Does it get stuck? Sometimes. Does it keep getting better? With every model release on huggingface.

Total monthly cost: $0

literalAardvark 3 days ago ago

A few examples of useful tasks would be appreciated. I do suffer from a sad lack of imagination.

[-]

nunodonato 3 days ago ago

I suggest taking a look at /r/localLLaMa and see all sorts of cool things people do with small models.

seanmcdirmid 3 days ago ago

A Max cpu can run 30b models quantized, and definitely has the RAM to fit them in memory. The normal and pro CPUs will be compute/bandwidth limited. Of course, the Ultra CPU is even better than the Max, but they don't come in laptops yet.

andai 3 days ago ago

So I'm hearing a lot of people running LLMs on Apple hardware. But is there actually anything useful you can run? Does it run at a usable speed? And is it worth the cost? Because the last time I checked the answer to all three questions appeared to be no.

Though maybe it depends on what you're doing? (Although if you're doing something simple like embeddings, then you don't need the Apple hardware in the first place.)

[-]

anonzzzies 3 days ago ago

I was sitting in an airplane next to a guy on a MacBook pro something who was coding in cursor with a local llm. We got talking and he said there are obviously differences but for his style of 'English coding' (he described basically what code to write/files to change but in english, but more sloppy than code obviously otherwise he would just code) it works really well. And indeed that's what he could demo. The model (which was the OSS gpt i believe) did pretty well in his nextjs project and fast too.

[-]

andai 3 days ago ago

Thanks. I call this method Power Coding (like Power Armor), where you're still doing everything except for typing out the syntax.

I found that for this method the smaller the model, the better it works, because smaller models can generally handle it, and you benefit more from iteration speed than anything else.

I don't have hardware to run even tiny LLMs at anything approaching interactive speeds, so I use APIs. The one I ended up with was Grok 4 Fast, because it's weirdly fast.

ArtificialAnalysis has a section "end to end" time, and it was the best there for a long time, tho many other models are catching up now.

jwr 3 days ago ago

The speed is fine, the models are not.

I found only one great application of local LLMs: spam filtering. I wrote a "despammer" tool that accesses my mail server using IMAP, reads new messages, and uses an LLM to determine if they are spam or not. 95.6% correct classification rate on my (very difficult) test corpus, in practical usage it's nearly perfect. gpt-oss-20b is currently the best model for this.

For all other purposes models with <80B parameters are just too stupid to do anything useful for me. I write in Clojure and there is no boilerplate: the code reflects real business problems, so I need an LLM that is capable of understanding things. Claude Code, especially with Opus, does pretty well on simpler problems, all local models are just plain dumb and a waste of time compared to that, so I don't see the appeal yet.

That said, my next laptop will be a MacBook pro with M5 Max and 128GB of RAM, because the small LLMs are slowly getting better.

sueders101 3 days ago ago

I've tried out gpt-oss:20b on a MacBook Air (via Ollama) with 24GB of RAM. In my experience it's output is comparable to what you'd get out of older models and the openAI benchmarks seem accurate https://openai.com/index/introducing-gpt-oss/ . Definitely a usable speed. Not instant, but ~5 tokens per second of output if I had to guess.

fhsm 3 days ago ago

This paper shows a use case running on Apple silicon that’s theoretically valuable:

https://pmc.ncbi.nlm.nih.gov/articles/PMC12067846/

Who cares if result is right / wrong etc as it will all be different in a year … just interesting to see a test of desktop class hardware go ok.

seanmcdirmid 3 days ago ago

I have an MBP Max M3 with 64GB of RAM, and I can run a lot at useful speed (LLMs run fine, diffusion image models run OK although not as fast as they would on a 3090). My laptop isn't typical though, it isn't a standard MBP with a normal or pro processor.

jki275 3 days ago ago

I can definitely write code with a local model like Devstral small or a quantized granite, or a quantized deep-seek on an M1 Max w/ 64gb of ram.

DANmode 3 days ago ago

Of course it depends what you’re doing.

Do you work offline often?

Essential.

fancyfredbot 3 days ago ago

Most laptops have 16GB of RAM or less. A little more than a year ago I think the base model Mac laptop had 8GB of RAM which really isn't fantastic for running LLMs.

layer8 3 days ago ago

By “PC”, they mean non-Apple devices.

Also, macOS only has around 10% desktop market share globally.

[-]

dangus 3 days ago ago

It's actually closer to 20% globally. Apple now outsells Lenovo:

https://www.mactech.com/2025/03/18/the-mac-now-has-14-8-of-t...

[-]

layer8 2 days ago ago

I meant market share in terms of installed base: https://gs.statcounter.com/os-market-share/desktop/worldwide...

[-]

dangus 2 days ago ago

macOS and OS X are split on this graph, and “Unknown” could be anything? This might actually show Apple install base close to 20%.

DANmode 3 days ago ago

> Apple has a huge marketshare in the laptop market

Hello, from outside of California!

[-]

dangus 3 days ago ago

Global Mac marketshare is actually higher than the US: https://www.mactech.com/2025/03/18/the-mac-now-has-14-8-of-t...

[-]

DANmode 3 days ago ago

Less than 1 in 5 doesn’t feel like huge market share,

but it’s more than I have!

[-]

3 days ago ago

[deleted]

dangus 2 days ago ago

Apple outsells Lenovo, if that puts it in a different perspective.

whazor 3 days ago ago

But economically, it is still much better to buy a lower spec't laptop and to pay a monthly subscription for AI.

However, I agree with the article that people will run big LLMs on their laptop N years down the line. Especially if hardware outgrows best-in-class LLM model requirements. If a phone could run a 512GB LLM model fast, you would want it.

[-]

m4rtink 3 days ago ago

Are you sure the subscription will still be affordable after the venture capital flood ends and the dumping stops?

[-]

nl 3 days ago ago

100% yes.

The amount of compute in the world is doubling over 2 years because of the ongoing investment in AI (!!)

In some scenario where new investment stops flowing and some AI companies go bankrupt all that compute will be looking for a market.

Inference providers are already profitable so with cheaper hardware it will mean even cheaper AI systems.

[-]

AyyEye 3 days ago ago

You should probably disclose that you're a CTO at an AI startup, I had to click your bio to see that.

> The amount of compute in the world is doubling over 2 years because of the ongoing investment in AI (!!)

All going into the hands of a small group of people that will soon need to pay the piper.

That said, VC backed tech companies almost universally pull the rug once the money stops coming in. And historically those didn't have the trillions of dollars in future obligations that the current compute hardware oligopoly has. I can't see any universe where they don't start charging more, especially now that they've begun to make computers unaffordable for normal people.

And even past the bottom dollar cost, AI provides so many fun, new, unique ways for them to rug pull users. Maybe they start forcing users to smaller/quantized models. Maybe they start giving even the paying users ads. Maybe they start inserting propaganda/ads directly into the training data to make it more subtle. Maybe they just switch out models randomly or based on instantaneous hardware demand, giving users something even more unstable than LLMs already are. Maybe they'll charge based on semantic context (I see you're asking for help with your 2015 Ford Focus. Please subscribe to our 'Mechanic+' plan for $5/month or $25 for 24 hours). Maybe they charge more for API access. Maybe they'll charge to not train on your interactions.

I'll pass, thanks.

[-]

nl 3 days ago ago

I'm not longer CTO at an AI startup. Updated, but don't actually see how that is relevant.

> All going into the hands of a small group of people that will soon need to pay the piper.

It's not very small! On the inference side there are many competitive providers as well as the option of hiring GPU servers yourself.

> And historically those didn't have the trillions of dollars in future obligations that the current compute hardware oligopoly has. I can't see any universe where they don't start charging more, especially now that they've begun to make computers unaffordable for normal people.

I can't say how strongly I disagree with this - it's just not how competition works, or how the current market is structured.

Take gpt-oss-120B as an example. It's not frontier level quality but it's not far off and certainly gives a strong redline that open source models will never get less intelligent than.

There is a competitive market in hosting providers, and you can see the pricing here: https://artificialanalysis.ai/models/gpt-oss-120b/providers?...

In what world is there a way in which all the providers (who are want revenue!) raise prices above the premium price Cerebas is charging for their very high speed inference?

There's already Google, profitable serving at the low-end at around half the price of Cerebas (but then you have to deal with Google billing!)

The fact that Azure/Amazon are all pricing exactly the same as 8(!) other providers as well as the same price https://www.voltagepark.com/blog/how-to-deploy-gpt-oss-on-a-... gives for running your own server shows how the economics work on NVidia hardware. There's no subsidy going on there.

This is on hardware that is already deployed. That isn't suddenly going to get more expensive unless demand increases... in which case the new hardware coming online over the next 24 months is a good investment, not a bad one!

jeremyjh 3 days ago ago

Datacenters full of GPU hosts aren't like dark fiber - they require massive ongoing expense, so the unit economics have to work really well. It is entirely possible that some overbuilt capacity will be left idle until it is obsolete.

[-]

nl 3 days ago ago

The ongoing costs are mostly power, and aren't that massive compared to the investment.

No one is leaving an H100 cluster not running because the power costs too much - this is why remnants markets like Vast.ai exist.

[-]

jeremyjh 2 days ago ago

They absolutely will leave them idle if the market is so saturated that no one will pay enough for tokens to cover power and other operational costs. Demand is elastic but will not stretch forever. The build out assumes new applications with ROI will be found, and I'm sure they will be, but those will just drive more investment. A massive over build is inevitable.

[-]

nl 2 days ago ago

Of course!

But the operational costs are much lower than some people in this thread seem to think.

You can find a safe margin for the price by looking at aggregators.

https://gpus.io/gpus/h100 is showing $1.83/hour lowest price, around $2.85 average.

That easily pays running costs - a H100 server with cooling etc is around $0.10/hour to keep running

And a massive overbuild pushes prices down not up!

oa335 3 days ago ago

> Inference providers are already profitable.

That surprises me, do you remember where you learned that?

[-]

nl 3 days ago ago

Lots of sources, and you can do the math yourself.

Here's a few good ones:

https://github.com/deepseek-ai/open-infra-index/blob/main/20... (suggests Deepseek is making 80% raw margin on inference)

https://www.snellman.net/blog/archive/2025-06-02-llms-are-ch...

https://martinalderson.com/posts/are-openai-and-anthropic-re... (there's a HN discussion of this where it was pointed out this overestimates the costs)

https://www.tensoreconomics.com/p/llm-inference-economics-fr... (long, but the TL;DR is that serving Lllama 3.3 70B costs around $0.28/million tokens input, $0.95 output at high utilization. These are close to what we see in the market: https://artificialanalysis.ai/models/llama-3-3-instruct-70b/... )

blibble 3 days ago ago

> The amount of compute in the world is doubling over 2 years because of the ongoing investment in AI (!!)

which is funded by the dumping

when the bubble pops: these DCs are turned off and left to rot, and your capacity drops by a factor of 8192

[-]

nl 3 days ago ago

> which is funded by the dumping

What dumping do you mean?

Are you implying NVidia is selling H200s below cost?

If not then you might be interested to see that Deepseek has released there inference costs here: https://github.com/deepseek-ai/open-infra-index/blob/main/20...

If they are losing money it's because they have a free app they are subsidizing, not because the API is underpriced.

TeMPOraL 3 days ago ago

Doesn't matter now. GP can revisit the math and buy some hardware once the subscription prices actually grow too high.

solatic 2 days ago ago

You have to remember that companies are kind of fungible in the sense that founders can close old companies and start new ones to walk away from bankruptcies in the old companies. When there's a bust and a lot of companies close up shop, because data centers were overbuilt, there's going to be a lot of GPUs being sold at firesale prices - imagine chips sold at $300k today being sold for $3k tomorrow to recoup a penny on the dollar. There's going to be a business model for someone buying those chips at $3k, then offering subscription prices at little more than the cost of electricity to keep the dumped GPUs running somewhere.

[-]

m4rtink 2 days ago ago

I do wonder how usable the hardware will be once the creditors are trying to sell it - as far as I can tell is seems the current trend is more and more custom no-matter-the cost super expensive power-inefficient hardware.

The situation might be a lot different than people selling ex-crypto mining GPUs to gamers. There might be a lot of effective scrap that is no longer usable when it is no longer part of a some companies technological fever dream.

anonzzzies 3 days ago ago

They will go down. Or the company will be gone.

seanmcdirmid 3 days ago ago

Running an LLM locally means you never have to worry about how many tokens you've used, and also it allows for a lot of low latency interactions on smaller models that can run quickly.

I don't see why consumer hardware won't evolve to run more LLMs locally. It is a nice goal to strive for, which consumer hardware makers have been missing for a decade now. It is definitely achievable, especially if you just care about inference.

[-]

KellyCriterion 3 days ago ago

isnt this what all these NPUs are created for?

[-]

seanmcdirmid 2 days ago ago

I haven’t seen an NPU that can compete with a GPU yet. Maybe for really small models, I’m still not sure where they are going with those.

ignoramous 3 days ago ago

> economically, it is still much better to buy a lower spec't laptop and to pay a monthly subscription for AI

Uber is economical, too; but folks prefer to own cars, sometimes multiple.

And how there's market for all kinds of vanity cars, fast sportscars, expensive supercars... I imagine PCs & Laptops will have such a market, too: In probably less than a decade, may be a £20k laptop running a 671b+ LLM locally will be the norm among pros.

[-]

subjectsigma 3 days ago ago

> Uber is economical, too

One time I took an Uber to work because my car broke down and was in the shop and the Uber driver (somewhat pointedly) made a comment that I must be really rich to commute to work via Uber because Ubers are so expensive

[-]

prmoustache 3 days ago ago

Most people don't realise the amount of money they spend per year on cars.

joshred 3 days ago ago

Paying $30-$70/day to commute is economical?

[-]

zmmmmm 3 days ago ago

if you calculate depreciation and running costs on a new car in most places - I think it probably would be.

[-]

adrianN 3 days ago ago

If Uber were cheaper than the depreciation and running costs of a car, what would be left for the driver (and Uber)?

[-]

zmmmmm 3 days ago ago

a big part of the whole "hack" of Uber in the first place is that people are using their personal vehicles. So the depreciation and many of the running costs are sunk costs already. Once you paid those already it becomes a super good deal to make money from the "free" asset you already own.

robotresearcher 3 days ago ago

My private car provides less than one commute per day, on average.

An Uber car can provide several.

__turbobrew__ 3 days ago ago

While your car in sitting in the parking lot, the uber driver is utilizing their car throughout the day.

FuckButtons 2 days ago ago

If you’re using uber to and from work, presumably you would buy a car that’s worth more than the 10 year old Prius your uber driver has 200k miles on.

cjbgkagh 3 days ago ago

The depreciation would be amortized to cover more than one person. I only travel once or twice per week, it cost me less to use an Uber than to own a car.

ignoramous 3 days ago ago

> Paying $30-$70/day to commute is economical?

When LLM use approaches this number, running one locally would be, yes. What you and other commentator seem to miss is, "Uber" is a stand-in for Cloud-based LLMs: Someone else builds and owns those servers, runs the LLMs, pays the electricity bills... while its users find it "economical" to rent it.

(btw, taxis are considered economical in parts of the world where owning cars is a luxury)

NooneAtAll3 2 days ago ago

any "it's cheaper to rent than to own" arguments can be (and must be) completely disregarded due to experience of the last decade

so stop it

azuanrb 3 days ago ago

You still need ridiculously high spec hardware, and at Apple’s prices, that isn’t cheap. Even if you can afford it (most won't), the local models you can run are still limited and they still underperform. It’s much cheaper to pay for a cloud solution and get significantly better result. In my opinion, the article is right. We need a better way to run LLMs locally.

[-]

onion2k 3 days ago ago

You still need ridiculously high spec hardware, and at Apple’s prices, that isn’t cheap.

You can easily run models like Mistral and Stable Diffusion in Ollama and Draw Things, and you can run newer models like Devstral (the MLX version) and Z Image Turbo with a little effort using LM Studio and Comfyui. It isn't as fast as using a good nVidia GPU or a cloud GPU but it's certainly good enough to play around with and learn more about it. I've written a bunch of apps that give me a browser UI talking to an API that's provided by an app running a model locally and it works perfectly well. I did that on an 8GB M1 for 18 months and then upgraded to a 24GB M4 Pro recently. I still have the M1 on my network for doing AI things in the background.

[-]

liuliu 3 days ago ago

You can run newer models like Z Image Turbo or FLUX.2 [dev] using Draw Things with no effort too.

jki275 3 days ago ago

I bought my M1 Max w/ 64gb of ram used. It's not that expensive.

Yes, the models it can run do not perform like chatgpt or claude 4.5, but they're still very useful.

[-]

mirror_neuron 2 days ago ago

I’m curious to hear more about how you get useful performance out of your local setup. How would you characterize the difference in “intelligence” of local models on your hardware vs. something like chatgpt? I imagine speed is also a factor. Curious to hear about your experiences in as much detail as you’re willing to share!

almosthere 3 days ago ago

749 for an M4 air at Amazon right now

[-]

tossandthrow 3 days ago ago

Try running anything interesting on these 8gb of ram.

You need 96gb or 128gb to do non trivial things. That is not yet 749 usd

[-]

badc0ffee 3 days ago ago

Fair enough, but they start at 16GB nowadays.

kylec 3 days ago ago

The M4 starts with 16GB, though that can also be tight for local LLMs. You can get one with 24GB for $1149 right now though, which is good value.

[-]

almosthere 2 days ago ago

899 at B&H started today 12/24

jki275 3 days ago ago

64gb is fine.

[-]

kibwen 3 days ago ago

This subthread is about the Macbook Air, which tops out at 32 GB, and can't be upgraded further.

While browsing the Apple website, it looks like the cheapest Macbook with 64 GB of RAM is the Macbook Pro M4 Max with 40-core GPU, which starts at $3,899, a.k.a. more than five times more expensive than the price quoted above.

seanmcdirmid 3 days ago ago

if you are going for 64GB, you need at least a Max CPU or you will be bandwidth/GPU limited.

whitehexagon 3 days ago ago

I was pleasantly surprised at the speed and power of my second hand M1 Pro 32GB running Asahi & Qwen3:32B. It does all I need, and I dont mind the reading pace output, although I'd be tempted by M2 Ultra if the secondhand market hadn't also exploded with the recent RAM market manipulations.

Anyway, I'm on a mission to have no subscriptions in the New Year. Plus it feels wrong to be contributing towards my own irrelevance (GAI).

dangus 3 days ago ago

Yeah, any Mac system specced with a decent amount of RAM since the M1 will run LLMs locally very well. And that’s exactly how the built-in Apple Intelligence service works: when enabled, it downloads a smallish local model. Since all Macs since the M1 have very fast memory available to the integrated GPU, they’re very good at AI.

The article kinda sucks at explaining how NPUs aren’t really even needed, they just have potential to make things more efficient in the future rather than depending on the power consumption involved with running your GPU.

terafo 3 days ago ago

This article specifically talks about PC laptops and discusses changes in them.

cmxch 3 days ago ago

Only if you want to take all the proprietary baggage and telemetry that comes with Apple platforms by default.

A Lenovo T15g with a 16gb 3080 mobile doesn’t do too badly and will run more than just Windows.

[-]

pimeys 3 days ago ago

I just got a Framework desktop with 128 GB of shared RAM just before the memory prices rocketed, and I can comfortably run many even bigger oss models locally. You can dedicate 112GB to the GPU and it runs Linux perfectly.

selinkocalar 3 days ago ago

The M-series chips really changed the game here

reactordev 3 days ago ago

This article is to sell more laptops.

seunosewa 3 days ago ago

"How many TOPS do you need to run state-of-the-art models with hundreds of millions of parameters? No one knows exactly."

What's he talking about? It's trivial to calculate that.

[-]

RobotToaster 3 days ago ago

Isn't the ability to run it more dependant on (V)RAM? With TOPS just dictating the speed at which it runs?

[-]

zozbot234 3 days ago ago

Strictly speaking, you don't need that much VRAM or even plain old RAM - just enough to store your context and model activations. It's just that as you run with less and less (V)RAM you'll start to bottleneck on things like SSD transfer bandwidth and your inference speed goes down to a crawl. But even that may or may not be an issue depending on your exact requirements: perhaps you don't need your answer instantly and can wait while it gets computed in the background. Or maybe you're running with the latest PCIe 5 storage which overall gives you comparable bandwidth to something like DDR3/DDR4 memory.

NitpickLawyer 3 days ago ago

A good rule of thumb is that PP (Prompt Processing) is compute bound while TG (Token Generation) is (V)RAM speed bound.

fny 3 days ago ago

It's also been done before...[0]

[0]: https://www.edge-ai-vision.com/2024/05/2024-edge-ai-and-visi...

cramcgrab 3 days ago ago

It’s trivial to ask an AI to answer that. Well, I guess we know it’s not an AI generated article!

swyx 3 days ago ago

> state-of-the-art models

> hundreds of millions of parameters

lol

lmao, even

mattas 3 days ago ago

See: "3D TVs are driving the biggest change in TVs in decades"

[-]

eleventyseven 3 days ago ago

A lazy easy cheap shot. But do you deny these aspects from the article are not coming? Or won't be still here in 5 years?

- Addition of more—and faster—memory.

- Consolidation of memory.

- Combination of chips on the same silicon.

All of these are also happening for non AI reasons. The move to SoC that really started with the M1 wasn't because of AI, but unified memory being the default is something we will see in 5 years. Unlike 3D TV.

[-]

technion 3 days ago ago

We just had a series of articles and sysadmin outcry that major vendors were bringing 8gb laptops back to standard models because of the ram prices. In the short term, we're seeing a reduction.

[-]

frank_nitti 3 days ago ago

In terms of demand, anecdotally-speaking I can certainly see this influencing some decisions when other circumstances permit. Many people I know are both excited for new and better games, and equally exited about running LLM/SD/etc models locally with Comfy, LM studio and the like

estimator7292 3 days ago ago

Memory is absolutely not coming in the near future. Nobody can afford it.

MisterTea 3 days ago ago

> The move to SoC that really started with the M1

No it did not. There were numerous SoC that came before it and was inevitable in this space.

[-]

robotresearcher 3 days ago ago

Which widely available laptops with comparable but prior to M1 SOCs are you thinking of?

blibble 3 days ago ago

> Addition of more—and faster—memory.

probably not after scam altman bought up half the world's supply for his shit company

heavyset_go 2 days ago ago

The move to SoC happened long before the M1, it was the state of things in the ARM space for over a decade, and most x86 laptops have been SoCs for quite some time.

ToucanLoucan 3 days ago ago

In order:

- People wanting more memory is not a novel feature. I am excited to find out how many people immediately want to disable the AI nonsense to free up memory for things they actually want to do.

- Same answer.

- I think the drive towards SOCs has been happening already. Apple's M-series utterly demolishes every PC chip apart from the absolute bleeding-edge available, includes dedicated memory and processors for ML tasks, and it's mature technology. Been there for years. To the extent PC makers are chasing this, I would say it's far more in response to that than anything to do with AI.

3 days ago ago

[deleted]

3 days ago ago

[deleted]

m4rtink 3 days ago ago

Blockchain is making money obsolete.

j45 3 days ago ago

This article is just saying more laptops will have power efficient GPUs in it. A bit better than 3D TVs.

They might not use Apple silicon often. Other options are encouraging.

[-]

3 days ago ago

[deleted]

NedF 3 days ago ago

[dead]

tengbretson 3 days ago ago

Outside of Apple laptops (and arguably the Ryzen AI MAX 390), an "AI ready" laptop is simply marketing speak for "is capable of making HTTP requests."

tracerbulletx 3 days ago ago

This mostly just shows you how far behind the M1 (which came out 5 years ago) all the non Apple laptops are.

[-]

properbrew 3 days ago ago

Was never really into Apple hardware (mainly the price), however I recently got an M1 Mac Mini and an iPhone for app development, and the inference speed for as you say, a 5 year old chip is actually crazy.

If they made the M series fully open for Linux (I know Asahi is working away) I probably would never buy another non-M series processor again.

[-]

dpedu 3 days ago ago

I got an M1 Mac Mini somewhat recently as well, to replace my ~2012 Mac Mini that I use as a media center PC. And frankly, it's overkill. Used ones can be had for $200-$300 USD, lower side with cosmetic damage. An absolute steal, IMO.

[-]

bnolsen 3 days ago ago

Work gave me an m1 pro with 32gb on it. A year ago I put together one of those minisforum board+laptop apu with 64gb ram and 2tb nvme for not much money at the time, likely 500usd. For the performance sensitive software I was working on the 7935hs ran with about 50x more throughout using compilers with llvm backend.

jeffbee 3 days ago ago

You can still get an M1 Macbook Air at retail for $599 ($300 for refurbs), which is a Chromebook price for a laptop that is better in pretty much every respect than any Chromebook.

[-]

nyarlathotep_ 3 days ago ago

https://slickdeals.net/f/19004236-select-micro-center-stores...

MicroCenter has(had? OOS near me) M4 Minis for $400!

A remarkable bargain, even more so considering the recent hardware price hikes.

heavyset_go 2 days ago ago

If you're going for refurbs, you can get a device with an AMD 7000/8000/9000 APU, at the same or lower price point, and the iGPU itself will perform better than an M1 for prompt processing and generation, even with SODIMM memory.

aappleby 4 days ago ago

I predict we will see compute-in-flash before we see cheap laptops with 128+ gigs of ram.

[-]

14113 3 days ago ago

There was a company that did compute-in-dram, which was recently acquired by Qualcomm: https://www.emergentmind.com/topics/upmem-pim-system

znpy 4 days ago ago

You could get 128gb ram laptops from the time ddr4 came around: workstation class laptops with 4 ram slots would happily take 128gb of memory.

The fact that nowadays there are little to no laptops with 4 ran slots is entirely artificial.

[-]

mhitza 3 days ago ago

I was mussing this summer if I should get a refurbed Thinkpad P16 with 96GB of RAM to run VMs purely in memory. Now that 96GB of ram cost as much as a second P16.

[-]

znpy 3 days ago ago

I feel you, so much. I was thinking of getting a second 64gb node for my homelab and i thought i’d save those money… now the ram alone cost as much as the node, and I’m crying.

Lesson learned: you should always listen to that voice inside your head that say: “but i need it…” lol

[-]

pluralmonad 3 days ago ago

I rebuilt a workstation after a failed motherboard a year ago. I was not very excited about being forced to replace it on a days notice and cheaped out on the RAM (only got 32GB). This is like the third or fourth time I've taught myself the lesson to not pinch pennies when buying equipment/infrastructure assets. It's the second time the lesson was about RAM, so clearly I'm a slow learner.

zamadatix 4 days ago ago

I can't tell if this is optimism for compute-in-flash or pessimism with how RAM has been going lately!

ajb 3 days ago ago

The thing that is supposed to happen next is high-bandwidth flash. In theory, it could allow laptops to run the larger models without being extortionately costly, by loading directly from flash into the GPU (not by executing in flash) But I haven't seen figures of the actual bandwidth yet, and no doubt to start with it will be expensive. The underlying technology of flash has much higher read latency than dram, so it's not really clear (to me, at least) if they can deliver the speeds needed to remove the need to cache in VRAM just by increasing parallelism.

p1esk 4 days ago ago

We’ve had “compute in flash” for a few years now: https://mythic.ai/product/

wkat4242 4 days ago ago

Yeah especially since what is happening in the memory market

[-]

noosphr 4 days ago ago

Feast and famine.

In three years we will be swimming in more ram than we know what to do with.

[-]

fallat 4 days ago ago

Kind of feel that's already the case today... 4GB I find is still plenty for even business workloads.

[-]

autoexec 4 days ago ago

Video games have driven the need for hardware more than office work. Sadly games are already being scaled back and more time is being spent on optimization instead of content since consumers can't be expected to have the kind of RAM available they normally would and everyone will be forced to make do with whatever RAM they have for a long time.

znpy 4 days ago ago

That might not be the case. The kind of memory that will flood the second-hand market could not be the kind of memory we can stuff in laptops or even desktop systems.

[-]

3 days ago ago

[deleted]

aitchnyu 4 days ago ago

Memristors are (IME) missing from the news. They promised to act as both persistent storage and fast RAM.

[-]

ACCount37 3 days ago ago

If only memristors weren't vaporware that has "shown promise" for 3 decades now and went nowhere.

3 days ago ago

[deleted]

112233 3 days ago ago

By "we" do you mean consumers? No, "we" will get neither. This is unexpected, irresistable opportunity to create a new class, by controlling the technology that people are required and are desiring to use (large genAI) with a comprehensive moat — financial, legislative and technological. Why make affordable devices that enable at least partial autonomy? Of course the focus will be on better remote operation (networking, on-device secure computation, advancing narrative that equates local computation with extremism and sociopathy).

[-]

cmxch 3 days ago ago

Push Washington to grill the foundries and their customers. Repeat until prices drop.

Groxx 3 days ago ago

re NPUs: they've been a marketing thing for years now, but I really have no idea how many of them are actually used when you run [whatever]. particularly after a year or two of software updates.

anyone have numbers? are they just an added expense that is supported for first party stuff for 6 months before they need a bigger model, or do they have staying power? clearly they are capable of being used to save power, but does anything do that in practice, in consumer hardware?

socketcluster 4 days ago ago

I feel like there's no point to get a graphics card nowadays. Clearly, graphics cards are optimized for graphics; they just happened to be good for AI but based on the increased significance of AI, I'd be surprised if we don't get more specialized chips and specialized machines just for LLMs. One for LLMs, a different one for stable diffusion.

With graphics processing, you need a lot of bandwidth to get stuff in and out of the graphics card for rendering on a high-resolution screen, lots of pixels, lots of refreshes, lots of bandwidth... With LLMs, a relatively small amount of text goes in and a relatively small amount of text comes out over a reasonably long amount of time. The amount of internal processing is huge relative to the size of input and output. I think NVIDIA and a few other companies already started going down that route.

But probably graphics cards will still be useful for stable diffusion; especially AI-generated videos as the inputs and output bandwidth is much higher.

[-]

ACCount37 3 days ago ago

Nah, that's just plain wrong.

First, GPGPU is powerful and flexible. You can make an "AI-specific accelerator", but it wouldn't be much simpler or much more power-efficient - while being a lot less flexible. And since you need to run traditional graphics and AI workloads both in consumer hardware? It makes sense to run both on the same hardware.

And bandwidth? GPUs are notorious for not being bandwidth starved. 4K@60FPS seems like a lot of data to push in or out, but it's nothing compared to how fast modern PCIe 5.0 x16 goes. AI accelerators are more of the same.

[-]

djsjajah 3 days ago ago

GPUs might not be bandwidth starved most of the time, but they absolutely are when generating text from an llm. It’s the whole reason why low precision floating point numbers are being pushed by nvidia.

[-]

ACCount37 3 days ago ago

That's memory bandwidth, not I/O. Unless your LLM doesn't fit into VRAM.

Legend2440 4 days ago ago

LLMs are enormously bandwidth hungry. You have to shuffle your 800GB neural network in and out of memory for every token, which can take more time/energy than actually doing the matrix multiplies. GPUs are almost not high bandwidth enough.

[-]

socketcluster 4 days ago ago

But even so, for a single user, the output rate for a very fast LLM would be like 100 tokens per second. With graphics, we're talking like 2 million pixels, 60 times a second; 120 million pixels per second for a standard high res screen. Big difference between 100 tokens vs 120 million pixels.

24 bit pixels gives 16 million possible colors... For tokens, it's probably enough to represent every word of the entire vocabulary of every major national language on earth combined.

> You have to shuffle your 800GB neural network in and out of memory

Do you really though? That seems more like a constraint imposed by graphics cards. A specialized AI chip would just keep the weights and all parameters in memory/hardware right where they are and update them in-situ. It seems a lot more efficient.

I think that it's because graphics cards have such high bandwidth that people decided to use this approach but it seems suboptimal.

But if we want to be optimal; then ideally, only the inputs and outputs would need to move in and out of the chip. This shuffling should be seen as an inefficiency; a tradeoff to get a certain kind of flexibility in the software stack... But you waste a huge amount of CPU cycles moving data between RAM, CPU cache and Graphics card memory.

[-]

djsjajah 3 days ago ago

> Do you really though?

Yes.

It stays in on the hbm but it need to get shuffled to the place where it can actually do the computation. It’s a lot like a normal cpu. The cpu can’t do anything with data in the system memory, it has to be loaded into a cpu register. For every token that is generated, a dense llm has to read every parameter in the model.

visarga 3 days ago ago

If we did that it would be much more expensive, keeping all weights in SRAM is done by Groq for example.

Zambyte 4 days ago ago

This doesn't seem right. Where is it shuffling to and from? My drives aren't fast enough to load the model every token that fast, and I don't have enough system memory to unload models to.

[-]

Legend2440 4 days ago ago

From VRAM to the tensor cores and back. On a modern GPU you can have 1-2tb moving around inside the GPU every second.

This is why they use high bandwidth memory for VRAM.

[-]

Zambyte 3 days ago ago

This makes sense now, thanks!

zamadatix 4 days ago ago

If you're using a MoE model like DeepSeek V3 the full model is 671 GB but only 37 GB are active per token, so it's more like running a 37 GB model from the memory bandwidth perspective. If you do a quant of that it could e.g. be more like 18 GB.

smallerize 4 days ago ago

You're probably not using an 800GB model.

p1esk 4 days ago ago

It is right. The shuffling is from CPU memory to GPU memory, and from GPU memory to GPU. If you don’t have enough memory you can’t run the model.

[-]

Zambyte 3 days ago ago

How can I observe it being loaded into CPU memory? When I run a 20gb model with ollama, htop reports 3gb of total RAM usage.

[-]

zamadatix 3 days ago ago

Think of it like loading a moving truck where:

- The house is the disk

- You are the RAM

- The truck is the VRAM

There won't be a single time you can observe yourself carrying the weight of everything being moved out of the house because that's not what's happening. Instead you can observe yourself taking many tiny loads until everything is finally moved, at which point you yourself should not be loaded as a result of carrying things from the house anymore (but you may be loaded for whatever else you're doing).

Viewing active memory bandwidth can be more complicated than it'd seem to set up, so the easier way is to just view your VRAM usage as you load in the model freshly into the card. The "nvtop" utility can do this for most any GPU on Linux, as well as other stats you might care about as you watch LLMs run.

[-]

Zambyte 3 days ago ago

My confusion was on the shuffling process happening per token. If this was happening per token, it would be effectively the same as loading the model from disk every token.

[-]

a day ago ago

[deleted]

p1esk 3 days ago ago

The model might get loaded on every token - from GPU memory to GPU. This depends on how much of it is cached on GPU. Inputs to every layer must be loaded as well. Also, if your model doesn’t fit in GPU memory but fits in CPU memory, and you’re doing GPU offloading, then you’re also shuffling between CPU and GPU memory.

p1esk 3 days ago ago

Depends on map_location arg in torch.load: might be loaded straight to GPU memory

zamadatix 4 days ago ago

> Clearly, graphics cards are optimized for graphics; they just happened to be good for AI

I feel like the reverse has been true since after the Pascal era.

autoexec 4 days ago ago

I don't doubt that there will be specialized chips that make AI easier, but they'll be more expensive than the graphics cards sold to consumers which means that a lot of companies will just go with graphics cards, either because the extra speed of specialized chips won't be worth the cost, or will they'll be flat out too expensive and priced for the small number of massive spenders who'll shell out insane amounts of money for any/every advantage (whatever they think that means) they can get over everyone else.

spullara 4 days ago ago

I'm running GPT-OSS 120B on a MacBook Pro M3 Max w/128 GB. It is pretty good, not great, but better than nothing when the wifi on the plane basically doesn't work.

juancn 3 days ago ago

The price of RAM is going to throw a wrench at that

seanmcdirmid 4 days ago ago

I’ve been running LLMs on my laptop (M3 Max 64GB) for a year now and I think they are ready, especially with how good mid sized models are getting. I’m pretty sure unified memory and energy efficient GPUs will be more than just a thing on Apple laptops in the next few years.

[-]

noman-land 3 days ago ago

You doing code completion and agentic stuff successfully with local models? Got any tips? I've been out of the game for [checks watch] a few months and am behind on the latest. Is Cline the move?

[-]

seanmcdirmid 3 days ago ago

I haven't bothered doing code completion locally yet, though its something I want to try with the QWEN model. I'm mostly using it to generate/fix code CLI style.

[-]

noman-land 2 days ago ago

I had some pretty decent but very non-state-of-the-art success with it even cobbled together with LM Studio and VSCode plugins. I'm excited to keep trying it over the next months and years.

allovertheworld 4 days ago ago

Only because of Apples unified memory architecture. The groundwork is there, we just need memory to be cheaper so we can fit 512+GB now ;)

[-]

seanmcdirmid 4 days ago ago

Memory prices will rise short term and generally fall long term, even with the current supply hiccup the answer is to just build out more capacity (which will happen if there is healthy competition). I meant, I expect the other mobile chip providers to adopt unified architecture and beefy GPU cores on chip and lots of bandwidth to connect it to memory (at the max or ultra level, at least), I think AMD is already doing UM at least?

[-]

spwa4 3 days ago ago

> Memory prices will rise short term and generally fall long term, even with the current supply hiccup the answer is to just build out more capacity (which will happen if there is healthy competition)

Don't worry! Sam Altman is on it. Making sure there never is healthy competition that is.

https://www.mooreslawisdead.com/post/sam-altman-s-dirty-dram...

[-]

seanmcdirmid 3 days ago ago

We’ve been through multiple cycles of scarcity/surplus DRAM cycles in the last couple of decades. Why do we think it will be different now?

[-]

re-thc 3 days ago ago

> Why do we think it will be different now?

Margins. AI usage can pay a lot more. Even if they sell less than can still be more profitable.

In the past there wasn’t a high margin usage. Servers didn’t charge such a high premium.

[-]

seanmcdirmid 3 days ago ago

Do you not think that some DRAM producer isn't going to see the high margins as a signal to create more capacity to get ahead of the other DRAM producers? This is how it always has worked before, but somehow it is different this time?

[-]

re-thc 3 days ago ago

> Do you not think that some DRAM producer isn't going to see the high margins as a signal to create more capacity to get ahead of the other DRAM producers?

They took the bite during COVID and failed, so there's still fear from over supply.

[-]

seanmcdirmid 3 days ago ago

It only works if they collude on keeping supply steady. If anyone gets greedy for a bigger share of the AI pie, then it implodes quickly. Not all DRAM is made in South Korea so some nationalism will muddy the waters as well.

zozbot234 3 days ago ago

High margins are exactly what should create a strong incentive to build more capacity. But that dynamic has been tamped down so far because we're all scared of a possible AI bubble that might pop at any moment.

zmmmmm 3 days ago ago

There's not in the end all that much point having more memory than you can compute on in a reasonable time. So I think probably the useful amount tops out in the 128GB range where you can still run a 70b model and get a useful token rate out of it.

wkat4242 4 days ago ago

This article is so dumb. It totally ignores the memory price explosion that will make large fast memory laptops unfeasible for years and states stuff like this:

> How many TOPS do you need to run state-of-the-art models with hundreds of millions of parameters? No one knows exactly. It’s not possible to run these models on today’s consumer hardware, so real-world tests just can’t be done.

We know exactly the performance needed for a given responsiveness. TOPS is just a measurement independent from the type of hardware it runs on..

The less TOPS the slower the model runs so the user experience suffers. Memory bandwidth and latency plays a huge role too. And context, increase context and the LLM becomes much slower.

We don't need to wait for consumer hardware until we know much much is needed. We can calculate that for given situations.

It also pretends small models are not useful at all.

I think the massive cloud investments will put pressure away from local AI unfortunately. That trend makes local memory expensive and all those cloud billions have to be made back so all the vendors are pushing for their cloud subscriptions. I'm sure some functions will be local but the brunt of it will be cloud, sadly.

[-]

dcreater 3 days ago ago

Horrible article. Low effort, low knowledge. Had no idea the bar was so low for an IEEE publication

vegabook 4 days ago ago

also, state of the art models have hundreds of _billions_ of parameters.

[-]

omneity 4 days ago ago

It tells you about their ambitions..

layer8 3 days ago ago

The article is from mid-November (and probably was written even earlier), where the RAM price explosion wasn’t as striking yet.

kristianp 3 days ago ago

"Local AI" could be many different things. NPUs are too puny to run many recent models, such as image generation and llms. The article seems to gloss over many important details like this, for example the creative agency, what AI work are they doing?

> marketing firm Aigency Amsterdam, told me earlier this year that although she prefers macOS, her agency doesn’t use Mac computers for AI work.

openquery 3 days ago ago

For 99% of people I don't see the usecase (except for privacy but that ship sailed a decade ago for the aforementioned 99%). If the argument is inference offline - the modern computing experience is basically all done through the browser anyway so I don't buy it.

GPUs for video games where you need low latency makes sense. Nvidia GeForce Now works but not for any serious gaming. But when it comes to LLMs at least, the 100ms latency between you and the Gemini API or whichever provider you use is negligible compared to the inference time.

What am I missing?

[-]

mginszt 3 days ago ago

I'm sure giants like Microsoft would like to add more AI capabilities, and I'm also sure they would like to avoid running them on their own servers.

Another thing is that I wouldn’t expect LLMs to be free forever. One day, CEOs will decide that everyone has become accustomed to them - and that will be the first day of a subscription-based model and the last day of AI companies reporting financial losses.

bfrog 4 days ago ago

I suppose it depends on the model, code was useless. As a lossy copy of an interactive Wikipedia it could be ok not good or great just ok.

Maybe for creative suggestions and editing it’d be ok.

TrackerFF 3 days ago ago

With the wild ram prices, which btw are probably going to last out 2026, I expect 8 GB ram to be the new standard going on forward.

32 GB ram will be for enthusiasts with deep pockets, and professionals. Anything over that, exclusively professionals.

The conspiracy theorist inside me is telling me that big AI companies like OpenAI would rather see that people are using their puny laptops as terminals / shells only, to reach sky-based models, than to let them have beefy laptops and local models.

[-]

cmxch 3 days ago ago

Not if a few investigations into the foundries and their datacenter deals stops that.

andy99 3 days ago ago

  The conspiracy theorist inside me is telling me that big AI companies...

I don’t believe in conspiracies but I do believe in incentives sometimes lining up. Now that there is a RAM heavy cloud application, cloud providers are suddenly in direct competition with consumers for scarce resources, with the winner being able to control where people run their models.

fwipsy 4 days ago ago

Seems like wishful thinking.

> How many TOPS do you need to run state-of-the-art models with hundreds of millions of parameters? No one knows exactly.

Why not extrapolate from open-source AIs which are available? The most powerful open-source AI (which I know of) is Kimi K2 and >600gb. Running this at acceptable speed requires 600+gb GPU/NPU memory. Even $2000-3000 AI-focused PCs like the DGX spark or Strix Halo typically top out at 128gb. Frontier models will only run on something that costs many times a typical consumer PC, and only going to get worse with RAM pricing.

In 2010 the typical consumer PC had 2-4gb of RAM. Now the typical PC has 12-16gb. This suggests RAM size doubling perhaps every 5 years at best. If that's the case, we're 25-30 years away from the typical PC having enough RAM to run Kimi K2.

But the typical user will never need that much RAM for basic web browsing, etc. The typical computer RAM size is not going to keep growing indefinitely.

What about cheaper models? It may be possible to run a "good enough" model on consumer hardware eventually. But I suspect that for at least 10-15 years, typical consumers (HN readers may not be typical!) will prefer capability, cheapness, and especially reliability (not making mistakes) over being able to run the model locally. (Yes AI datacenters are being subsidized by investors; but they will remain cheaper, even if that ends, due to economies of scale.)

The economics dictate that AI PCs are going to remain a niche product, similar to gaming PCs. Useful AI capability is just too expensive to add to every PC by default. It's like saying flying is so important, everyone should own an airplane. For at least a decade, likely two, it's just not cost-effective.

[-]

sipjca 4 days ago ago

> It may be possible to run a "good enough" model on consumer hardware eventually

10-15 years?!!!! What is the definition of good enough? Qwen3 8B or A30B are quite capable models which run on a lot of hardware even today. SOTA is not just getting bigger, it's also getting more intelligence and running it more efficiently. There have been massive gains in intelligence at the smaller model sizes. It is just highly task dependent. Arguably some of these models are "good enough" already, and the level of intelligence and instruction following is much better from even 1 year ago. Sure not Opus 4.5 level, but still much could be done without that level of intelligence.

[-]

fwipsy 3 days ago ago

"Good enough" has to mean users won't be frequently frustrated if they transition to it from a frontier model.

> it is highly task dependent... much could be done without that level of intelligence

This is an enthusiast's glass-half-full perspective, but casual end users are gonna have a glass-half-empty perspective. Quen3-8B is impressive, but how many people use it as a daily driver? Most casual users will toss it as soon as it screws up once or twice.

The phrase you quoted in particular was imprecise (sorry) but my argument as a whole still stands. Replace "consumer hardware" with "typical PCs" - think $500 bestseller laptops from Walmart. AI PCs will remain niche luxury products, like gaming PCs. But gaming PCs benefit from being part of gaming culture and because cloud gaming adds input latency. Neither of these affects AI much.

[-]

sipjca 3 days ago ago

How many consumers (not business) are genuinely using frontier models? You think OpenAI and Anthropic will forever serve the most intelligent models to free users? Heck they don’t already

Efficiency gains exist and likely will continue, as well as hardware generally accelerating, as software and hardware starts to become co-optimized. This will take time no doubt but 10-15 years is hilariously long in this world. The iPhone has barely been out that long

And to be clear I think the other arguments are valid I just think the timeline is out of whack

epicureanideal 4 days ago ago

You may be correct, but I wonder if we'll see Mac Mini sized external AI boxes that do have the 1TB of RAM and other hardware for running local models.

Maybe 100% of computer users wouldn't have one, but maybe 10-20% of power users would, including programmers who want to keep their personal code out of the training set, and so on.

I would not be surprised though if some consumer application made it desirable for each individual, or each family, to have local AI compute.

It's interesting to note that everyone owns their own computer, even though a personal computer sits idle half the day, and many personal computers hardly ever run at 80% of their CPU capacity. So the inefficiency of owning a personal AI server may not be as much of a barrier as it would seem.

[-]

saltcured 3 days ago ago

But will it ever lead to a Mac Mini-priced external AI box? Or will this always be a premium "pro" tier that seems to rival used car prices?

seanmcdirmid 4 days ago ago

> but I wonder if we'll see Mac Mini sized external AI boxes that do have the 1TB of RAM

Isn't that the Mac Studio already? Ok, it seems to max at 512 GB.

marcus_holmes 4 days ago ago

> In 2010 the typical consumer PC had 2-4gb of RAM. Now the typical PC has 12-16gb. This suggests RAM size doubling perhaps every 5 years at best. If that's the case, we're 25-30 years away from the typical PC having enough RAM to run Kimi K2.

Part of the reason that RAM isn't growing faster is that there's no need for that much RAM at the moment. Technically you can put multiple TB of RAM in your machine, but no-one does that because it's a complete waste of money [0]. Unless you're working in a specialist field 16Gb of RAM is enough, and adding more doesn't make anything noticeably faster.

But given a decent use-case, like running an LLM locally, and you'd find demand for lots more RAM, and that would drive supply, and new technology developments, and in ten years it'll be normal to have 128TB of RAM in a baseline laptop.

Of course, that does require that there is a decent use-case for running an LLM locally, and your point that that is not necessarily true is well-made. I guess we'll find out.

[0] apart from a friend of mine working on crypto who had a desktop Linux box with 4TB of RAM in it.

chnmig 3 days ago ago

The power and resource consumption of local large models are problems that laptops have to solve, and new versions of models are constantly being released, which means that laptop configurations will soon become outdated.

meisel 3 days ago ago

I think only a small percentage of users care that much about running LLMs locally to pay for extra hardware for it, put up with slower and lower-quality responses, etc. . It’ll never be as good as non-local offerings, and is more hassle.

superkuh 3 days ago ago

The problem with this is that NPU have terrible, terrible support in the various software ecosystems because they are unique to their particular soc or whatever. No consistency even within particular companies.

gamblor956 3 days ago ago

The "AI laptop" boom is already fading. It turns out that LLMs, local or otherwise, just aren't very useful.

Like Big Data, LLMs are useful in a small niche of areas, like poorly summarizing meeting notes, or grammar check at a middle-school level.

On LLMs for coding tasks: I asked a programmer why they loved Claude and he showed me the output. Twenty years ago, that kind of code would have gotten someone PIP'd. Today it's considered better than most junior programmers...which is a sign of how far programming standards have fallen, and explains why most programs and apps are such buggy pieces of sh$t these days.

j45 4 days ago ago

This must be referring mostly to windows, or non-Apple laptops

0xbadcafebee 3 days ago ago

Wirth's Law in action. Eventually it's going to take an entire datacenter to read the news.

4 days ago ago

[deleted]

ge96 3 days ago ago

Wonder if this relates to/overlaps those Coral Accelerator devices.

xrd 3 days ago ago

The takeaway from these comments are that you can really run local models if you use m-series devices from apple.

But, can you do that if you install Linux on that hardware?

I hate to admit apple hardware is incredible. But, I can't say the same about macos anymore.

Can I run Linux and reap the benefits of m-series chips with local inference?

Or, are there any alternatives where I can use llms on Linux on a laptop?

bad_haircut72 3 days ago ago

My recent shower thought was the idea that Moores law hasnt slowed at all, we just went multi-core. Its crazy that the intel folks were so interested in optimizing for single thread CPU design they completely misunderstood where the best effort would be spent - if I had been around back then (speaking as an Elixir dev) I would have been way more interested in having 500 theead CPUs than getting down to nanometer scale dies. Thats what you get when everyone on the team is a bunch of C programmers

[-]

astrange 2 days ago ago

Intel designed a super high threaded CPU like that, Knightsbridge. It was useless. Single threaded programs are good.

ip26 3 days ago ago

Before LLMs, the use of parallelism on your typical laptop was limited to application level parallelism, e.g. one thread for Outlook and one for each tab in Chrome.

tehjoker 3 days ago ago

I mean, having a more powerful laptop is great, but at the same time, these guys are calling for a >10x increase in RAM and a far more powerful NPU. How will this affect pricing? How will it affect power management? It made it seem like most of the laptop will be dedicated to gen AI services, which I'm still not entirely convinced are quite THAT useful. I still want a cheap laptop that lasts all day and I also want to be able to tap that device's full power for heavy compute jobs!

esses 4 days ago ago

I spent a good 30 seconds trying to figure out what DDS was an acronym for in this context.

[-]

lucb1e 3 days ago ago

Care to share the answer?

[-]

esses 2 days ago ago

Turns out the first word of the article was odds.

suprjami 3 days ago ago

Extremely cringe article.

The biggest thing to affect laptops in "decades" is solid state storage. No longer do you need to worry about killing your entire device simply by putting it down on a solid surface.

There are also plenty of other things like modern dense lithium ion batteries with 12+ hour runtimes, super power friendly CPUs of all architectures, the ultra-thin body and metal body popularised by Apple, LCD panels without ghosting, external power bricks instead of literally a PC power supply in a briefcase.

But yeah sure, the infinite slop plagiarism machine is coming. Gotta get some clicks!

zkmon 3 days ago ago

You don't understand the needs of a common laptop user. Define the usecases that require reaching out to laptop instead of using the phone that is nearby. Those usecases don't need LLM for a common laptop user.

darkreader 3 days ago ago

[dead]

gguncth 4 days ago ago

I have no desire to run an LLM on my laptop when I can run one on a computer the size of six football fields.

[-]

theshrike79 3 days ago ago

The point is that when you run it on your own hardware you can feed the model your health data, bank statements and private journals and can be 5000% sure they’re not going anywhere

[-]

dboreham 3 days ago ago

Regular people don't understand nor care about any of that. They'll happily take the Faustian bargain.

[-]

theshrike79 3 days ago ago

It only needs one highly public breach and there's going to be a full-on business for someone selling a local-only AI processor for homes.

Combine it with a media player like an Apple TV or Nvidia Shield and people might buy it.

sandworm101 4 days ago ago

I've been playing around with my own home-built AI server for a couple months now. It is so much better than using a cloud provider. It is the difference between drag racing in your own car, and renting one from a dealership. You are going to learn far more doing things yourself. Your tools will be much more consistent and you will walk away with a far greater understanding of every process.

A basic last-generation PC with something like a 3060ti (12GB) is more than enough to get started. My current rig pulls less than 500w with two cards (3060+5060). And, given the current temperature outside, the rig helps heat my home. So I am not contributing to global warming, water consumption, or any other datacenter-related environmental evil.

[-]

DamonHD a day ago ago

Unless you normally use electric resistance heating (or some kind of fossil fuel with higher gCO2/kWh) then you don't get necessarily a free pass on the global warming thing!

Our whole home is heated with <500W on average: at this moment the heat pump is drawing 501W (H4 boundary) at close to freezing outside, and its demand is intermittent.

HelloUsername 3 days ago ago

> I am not contributing to global warming

lol

Local AI is driving the biggest change in laptops in decades