AMD's EPYC 9355P: Inside a 32 Core Zen 5 Server Chip

(chipsandcheese.com)

160 points | by rbanffy 2 days ago ago

43 comments

  • haunter 2 days ago ago

    >768 GB of DDR5-5200. The 12 memory controllers on the IO die provide a 768-bit memory bus, so the setup provides just under 500 GB/s of theoretical bandwidth

    I know it's a server but I'd be so ready to use all of that as RAM disk. Crazy amount at a crazy high speed. Even 1% would be enough just to play around with something.

    • ksec a day ago ago

      I have been waiting for Netflix using FreeBSD to serve video at 1600Gb/s. They announced their 800Gbps record in 2021, and they were previously limited by CPU and Memory bandwidth. With 500GB/s that is pretty much not a thing.

      • NaomiLehman 15 hours ago ago

        damn, that's a lot of gigabytes for a movie

    • mtoner23 a day ago ago

      For our build servers for devs we utilize roughly this setup as a ram disk. It's amazing. Build times are lighting fast (compared to HDD/SSD)

      • privatelypublic a day ago ago

        I'm interested in... why? What are you building that loading data from disk is so lopsided vs CPu load from compiling, or network load/latency(one 200ms of "is this the current git repo?" Is a heck of a lot of NVMe latency... and its going to be closer to 2s than 200ms)

        • finaard a day ago ago

          I'm running the same setup - our larger builders have 2 32-core epycs with 2TB RAM. We were doing that type of setup already almost two decades ago in a different company, and in that one for over a decade now - back then that was the only option for speed.

          Nowadays nvmes might indeed be able to get close - but we'd probably need to still span over multiple SSDs (reducing the cost savings), and the developers there are incredible sensitive to build times. If a 5 minute build suddenly takes 30 seconds more we have some unhappy developers.

          Another reason is that it'd eat SSDs like candy. Current enterprise SSDs have something like a 10000 TBW rating, which we'd exceed in the first month. So we'd either get cheap consumer SSDs and replace them every few days, or enterprise SSDs and replace them every few months - or stick with the RAM setup, which over the live of the build system will be cheaper than constantly buying SSDs.

          • trogdor 20 hours ago ago

            > Current enterprise SSDs have something like a 10000 TBW rating, which we'd exceed in the first month

            Wow. What’s your use case?

            • finaard 19 hours ago ago

              Same as the one earlier in the thread: Build servers, nicely loaded. A build generates a ridiculous amount of writes for stuff that just gets thrown out after the build.

              We actually did try with SSDs about 15 years ago, and had a lot of dead SSDs in a very short time. After that we went for estimating data written, it's cheaper. While SSD durability increased a lot since then everything else got faster as well - so we'd have SSDs last a bit longer now (back then it was a weekly thing), but still nowhere near where it'd be a sensible thing to do.

          • rbanffy 18 hours ago ago

            > If a 5 minute build suddenly takes 30 seconds more we have some unhappy developers

            They sound incredibly spoiled. Where should I send my CV?

            • finaard 17 hours ago ago

              You don't really want that. I'm keeping my sanity there just because my small company is running their CI and testing as contractor.

              They indeed are quite spoiled - and that's not necessarily a good thing. Part of the issue is that our CI was good and fast enough that at some point a lot of the new hires never bothered to figure out how to build the code - so for quite a few the workflow is "commit to a branch, push it, wait for CI, repeat". And as they often just work on a single problem the "wait" is time lost for them, which leads to the unhappiness if we are too slow.

        • motorest a day ago ago

          > I'm interested in... why? What are you building that loading data from disk is so lopsided vs CPu load from compiling (...)

          This has been the basic pattern for ages, particularly with large C++ projects. C++ builds, specially with the introduction of multi-CPU and multi-core systems, turns builds into IO-bound workflows, specially during linking.

          Creating RAM disks to speed up builds is one of the most basic and low effort strategies to improve build times, and I think it was the main driver for a few commercial RAM drive apps.

          • john01dav 20 hours ago ago

            Why do we need commercial ram drive apps when Linux has tmpfs, or is this a historical thing?

            • p_l 20 hours ago ago

              Historical, but also there was a bunch of physical ram drives - RAMsan, for example, sold DRAM-based (with battery backup) appliances connected by fiber channel - they were used for all kinds of tasks but often as very fast scratch space for databases. Some VAXen had a "RAM disk" card that was IIRC used as NFS cache on some unix variants. etc. etc.

              • rbanffy 17 hours ago ago

                Still odd. The OS should be able to manage the memory and balance performance more efficiently than that. There’s no reason to preallocate memory by hardware.

                • p_l 16 hours ago ago

                  It was often used to supplement memory available in cheaper ways or otherwise more flexible. For example many hardware solutions allowed connecting more RAM than otherwise possible to be accessed by main bus, or at lower cost than the main memory (for example due to differences in interfaces required, adding battery backup, etc.)

                  RAMsan line for example started in 2000 with 64GB DRAM-based SSD with up to 15 1Gbit FC interfaces, providing a shared SAN SSD for multiple hosts (very well utilized by some of the beefier cluster SQL databases like Oracle RAC) but the company itself has been providing high speed specialized DRAM-based SSDs since 1978

        • bob1029 a day ago ago

          > one 200ms of "is this the current git repo?" Is a heck of a lot of NVMe latency... and its going to be closer to 2s than 200ms

          I don't know where you're buying your NVMe drives, but mine usually respond within a hundred microseconds.

        • mikepurvis a day ago ago

          For the ROS ecosystem you’re often building dozens or hundreds of small CMake packages, and those configure steps are very io bound— it’s a ton of does this file exist, what’s in this file, compile this tiny test program, etc.

          I assume the same would be true for any project that is configure-heavy.

    • skhameneh a day ago ago

      12 memory channels per CPU and DDR5-6400 may be supported (for reference, I found incorrect specs when I was looking at Epyc CPU retail listings some weeks ago), see https://www.amd.com/en/products/processors/server/epyc/9005-...

    • summarity 20 hours ago ago

      > Crazy amount at a crazy high speed

      That's 300GB/s slower than my old Mac Studio (M1 Ultra). Memory speeds in 2025 remain thouroughly unimpressive outside of high-end GPUs and fully integrated systems.

      • AnthonyMouse 17 hours ago ago

        The server systems have that much memory bandwidth per socket. Also, that generation supports DDR5-6400 but they were using DDR5-5200. Using the faster stuff gets you 614GB/s per socket, i.e. a dual socket system with DDR5-6400 is >1200GB/s. And in those systems that's just for the CPU; a GPU/accelerator gets its own.

        The M1 Ultra doesn't have 800GB/s because it's "integrated", it simply has 16 channels of DDR5-6400, which it could have whether it was soldered or not. And none of the more recent Apple chips have any more than that.

        It's the GPUs that use integrated memory, i.e. GDDR or HBM. That actually gets you somewhere -- the RTX 5090 has 1.8TB/s with GDDR7, the MI300X has 5.3TB/s with HBM3. But that stuff is also more expensive which limits how much of it you get, e.g. the MI300X has 192GB of HBM3, whereas normal servers support 6TB per socket.

        And it's the same problem with Apple even though there's no great reason for it to be. The 2019 Intel Xeon Mac Pro supported 1.5TB of RAM -- still in slots -- but the newer ones barely reach a third of that at the top end.

        • wtallis 4 hours ago ago

          > The M1 Ultra doesn't have 800GB/s because it's "integrated", it simply has 16 channels of DDR5-6400, which it could have whether it was soldered or not.

          The M1 Ultra has LPDDR5, not DDR5. And the M1 Ultra was running its memory at 6400MT/s about two and a half years before any EPYC or Xeon parts supported that speed—due in part to the fact that the memory on a M1 Ultra is soldered down. And as far as I can tell, neither Intel nor AMD has shipped a CPU socket supporting 16 channels of DRAM; they're having enough trouble with 12 channels per socket often meaning you need the full width of a 19-inch rack for DIMM slots.

          • AnthonyMouse 27 minutes ago ago

            LPDDR5 is "low power DDR5". The difference between that and ordinary DDR5 isn't that it's faster, it's that it runs at a lower voltage to save power in battery-operated devices. DDR5-6400 DIMMs were available for desktop systems around the same time as Apple. Servers are more conservative about timings for reliability reasons, the same as they use ECC memory and Apple doesn't. Moreover, while Apple was soldering their memory, Dell was shipping systems using CAMM with LPDDR5 that isn't soldered, and there are now systems from multiple vendors with CAMM2 and LPDDR5X.

            Existing servers typically have 12 channels per socket, but they also have two DIMMs per channel, so you could double the number of channels per socket without taking up any more space for slots. You could also use CAMM which takes up less space.

            They don't currently use more than 12 channels per socket even though they could because that's enough to not be a constraint for most common workloads, more channels increase costs, and people with workloads that need more can get systems with more sockets. Apple only uses more because they're using the same memory for the GPU and that is often constrained by memory bandwidth.

      • matja 15 hours ago ago

        Do you have a benchmark that shows the M1 Ultra CPU to memory throughput?

    • tehlike a day ago ago

      I have 1TB ram on my home server. It's 2666 though...

      • WarOnPrivacy a day ago ago

        > I have 1TB ram on my home server. It's 2666 though...

        this kit? https://www.newegg.com/nemix-ram-1tb/p/1X5-003Z-01930

        • tehlike 14 hours ago ago

          No, 16*64 Samsung LRDIMM sticks off of ebay. 35$ each stick iirc.

        • prodipto81 a day ago ago

          3 just !!!!

        • mulmen a day ago ago

          Wow. I tried to tap but the Newegg app has an unskippable 5 second ad for something I didn’t read. What a shame. My fault for having their app installed I guess.

      • saltcured 13 hours ago ago

        Man, here I am in 2025 and my home server is a surplus Thinkpad P70 with just 64 GB RAM...

    • elorant a day ago ago

      Even better you could use it for inference and with that much RAM you could load any model.

    • bigiain a day ago ago

      Indeed. I wonder what a system like that would cost (at consumer available prices)?

      • magicalhippo a day ago ago

        From what I can find here in Norway the CPU would be $3800, mobo around $2000, and one stick of 64 GB 6400 MHz registered ECC runs about $530, so about $6400 for the full 768 GB. Couldn't find any kits for those.

        So just those components would be just over $12k.

        That's just from regular consumer shops, and includes 25% VAT. Without the VAT it's about $9800.

        Problem for consumers is that a just about all the shops that sells such and you might get a deal from would be geared towards companies, and not interested in deal with consumers due to consumer protection laws.

        • mlrtime a day ago ago

          The best deals on these high end servers for consumers is to find a local large server reseller. Meaning a company who buys used datacenter equipment in bulk then resells. It may not always be used equipment or old.

          • magicalhippo 18 hours ago ago

            True, though at least here that'll be older stuff, and seems almost exclusively Intel parts.

            I found a used server with 768 GB DDR4 and dual Intel Gold 6248 CPUs for $4200 including 25% VAT.

            That's a complete 2U server, the CPUs are a bit weak but not too bad all in all.

  • ashvardanian 2 days ago ago

    Those are extremely uniform latencies. Seems like on these CPUs most benefits from NUMA-aware thread-pools will be coming from reduced contention - mostly synchronizing small subsets of cores, rather than the actual memory affinity.

    • afr0ck a day ago ago

      NUMA is only useful if you have multiple sockets, because then you have several I/O dies and you want your workload 1) to be closer to the I/O device and 2) avoid crossing the socket interconnect. Within the same socket, all CPUs shared the same I/O die, thus uniform latency.

    • PunchyHamster 2 days ago ago

      Well, all of the memory is at IO die. I remember AMD docs outright recommend to make processor hide NUMA nodes from the workload as trying to optimize for it might not even do anything for a lot of workloads

      • phire 2 days ago ago

        That AMD slide (in the conclusion) claims their switching fabric has some kind of bypass mode to improve latency when utilisation is low.

        So they have been really optimising that IO die for latency.

        NUMA is already workload sensitive, you need to benchmark your exact workload to know if it’s worth enabling or not, and this change is probably going to make it even less worthwhile. Sounds like you will need a workload that really pushes total memory bandwidth to make NUMA worthwhile.

  • flumpcakes 2 days ago ago

    The first picture has a typo on it's left hand side.

    It says 16 cores per die with up 16 zen 5 dies per chip. For zen 5 it's 8 cores per die, 16 dies per chip giving a total of 128 cores.

    For zen 5c it's 16 cores per die, 12 dies per chip giving a total of 192 cores.

    Weirdly it's correct on the right side of the image.

  • iberator a day ago ago

    Is it true that EPYC doesn't use the program counter as in: next instruction address is in the second operand for some operations?

    • nine_k a day ago ago

      EPYC runs x64 code. In it, jump instructions work exactly as you describe.