6 comments

  • swq115 4 hours ago ago

    Interesting approach. The mmap streaming idea is clever, but I'd love to see real-world benchmarks beyond TinyLlama — especially for the 140B claim. Running that on a Mac Mini with 16GB would be the real proof point.

    For context, I run a Mac Mini M4 as a homelab server and the memory pressure from even 7B models is noticeable. Curious how this handles sustained inference without thermal throttling.

  • pcf 7 hours ago ago

    Hi @fatihturker – exciting project if it works!

    I have a MacBook Pro M1 Max w/64 GB RAM, and a Mac Studio M3 Ultra w/96 GB RAM. What do you think is possible to run on these? Just curious before I really try it out.

  • deflator 9 hours ago ago

    Fascinating. I don't understand the technical terms, but running a big coding agent locally is a dream of mine, so I thank you for your efforts!

  • ryanholtdev 13 hours ago ago

    Running a Mac Mini M4 as a home server for a bunch of automation scripts right now. The mmap-based layer streaming is the part I'm most curious about -- how does latency look when you're streaming layers from disk mid-inference? I'd expect throughput to degrade sharply once you exceed unified memory, but maybe the Top-K sparsity masks enough of the weight accesses that it's not as bad as sequential streaming would be. What's the actual tokens/sec at 140B scale on the base Mac Mini config?