1 comments

  • mechramc 8 hours ago ago

    Hi HN, It is hard to communicate how frustratingly opaque Apple's hardware stack can be. Everyone targets the Mac's GPU for local models, but there is a dedicated accelerator—the Apple Neural Engine (ANE)—sitting completely dark for LLM workloads. CoreML treats it as a black-box scheduler, stripping away any direct control or ability to train.

    There are a few real caveats here, but imo the fundamental constraint to using the ANE hasn't been compute—it’s been the complete lack of a native orchestration layer. Building on incredible foundational reverse-engineering by maderix (who mapped the private ANEClient/ANECompiler APIs and discovered the ~19 TFLOPS fp16 ceiling), I wanted to see if we could bridge the gap from a raw hardware exploit to a stable runtime. I just open-sourced Orion: an end-to-end system that bypasses CoreML entirely to run and train LLMs directly on the ANE. Just to be concrete about what this took to build: my day-to-day is in enterprise systems orchestration, not writing low-level Objective-C kernels. I approached this entire build as an exercise in architectural delegation—using Claude to rapidly generate the execution syntax while I managed the system state, debugged the hardware limits, and held the structural vision. What we ran into was a wall of undocumented silicon behavior—what I'll call the hardware impedance mismatch. We cataloged 17 total programming constraints, 11 of which were completely undocumented.

    For example: • The concat operation causes an immediate, silent compiler failure.

    • BLOBFILE weights require a bizarre 64-byte offset from the chunk header, or you get silent numerical corruption.

    • The ANE maintains internal state that hard-caps you at ~119 compilations per process.

    Previous attempts at ANE training (like ANEgpt) hit a wall of NaN divergence after a single step. We solved this by wiring up a deferred compilation pipeline and implementing strict activation clamping to stop the fp16 overflow cascade (clamping activations to [-65504, 65504]). To bypass the 119-compilation limit, I used an exec() process restart loop after every training step.

    The leverage here is real. Orion currently hits 170+ tokens/s for GPT-2 124M decode, and more importantly, achieves mechanically stable multi-step training on a 110M parameter transformer (loss dropping 12.3 to 6.2 over 1,000 steps with zero NaNs).

    It’s not entirely clean yet. The ANE bakes weights at compile time, meaning every training update requires a ~4.2s recompilation penalty. But imo, extracting raw, zero-idle-power throughput directly from Apple's silicon isn't just a benchmark iteration—this is a layer change for local, always-on AI. The repo (Objective-C runtime, 5-pass graph compiler, no Python orchestration) is up. I’d love to know what the systems engineers here think about the constraint catalog or potential weight-patching workarounds.