Pointer Tagging in C++: The Art of Packing Bits into a Pointer

(vectrx.substack.com)

66 points | by signa11 2 days ago ago

53 comments

  • jandrewrogers 2 days ago ago

    As someone that uses pointer tagging, I must point out that this article is insufficiently defensive.

    I've done my own exploration of what I can get away with across 64-bit x86 and ARM in this regard. It has been a while but the maximum number of bits that are reliably taggable across all environments and use cases that I have been able to determine is six. Can you get away with more? Probably yes, but there are identifiable environments where it will explode if you do so. That may not apply to your use case.

    Reliable pointer tagging is not trivial.

    • dzaima a day ago ago

      At least on Linux, you're guaranteed to get the top 16 bits free on at least x86[1] and ARM[2]. Maybe other OSes are less nice, but generally an OS doesn't really have a reason to force down the full 57-or-whatever-bit range before the program actually requests 256 terrabytes of virtual address space.

      It is generally kinda sad though that there's not a way to request from mmap or equivalent that the result is in a specific range of memory (in (0; 1<<48) here). Would be useful for JIT-compiling code that needs to call into precompiled functions too.

      [1]: https://www.kernel.org/doc/html/v5.8/x86/x86_64/5level-pagin...

      [2]: https://www.kernel.org/doc/html/v5.8/arm64/memory.html#bit-u...

      • bigstrat2003 a day ago ago

        > Maybe other OSes are less nice, but generally an OS doesn't really have a reason to force down the full 57-or-whatever-bit range before the program actually requests 256 terrabytes of virtual address space.

        The fact that Linux does this isn't nice, it's a huge mistake. It means that the kernel can't automatically use 5-level page tables on processors that support it, because backwards compatibility guarantees mean the programs must be able to use those bits in a pointer. AMD was smart enough to throw an exception if programs use those bits in a pointer (thus guaranteeing forward compatibility), so why Linux didn't follow suit is puzzling.

        • dzaima a day ago ago

          Indeed most architectures error if the top bits are non-zero, but that's trivially worked around by just a manual `x & 0xffffffffffff` or `x << 16 >> 16` or similar, and software already utilizes that (and indeed must do that to not immediately break on x86).

          It is somewhat unfortunate to just force the larger address space to specific mmap usage, but it's hard for me to imagine that many programs actually needing more than 256TB of virtual memory that aren't doing so in a very-specialized way.

          Certainly much less frequent than the already-infrequent (but very much existing, and significant! incl. both Firefox/SpiderMonkey and WebKit/JavaScriptCore) cases of programs utilizing top 16 bits being zeroes.

          Then there's the option of mmap returning ranges from from the low 2^48 while possible, and using larger addresses only when that completely runs out; should mean existing software works fine before it needs more than 256TB of RAM, and, if the software checks the top bits of mmap's result being zeroes, it's not negatively affected anyway.

          Really the proper solution is to go back in time and make mmap have separate lower and upper bounds addresses though.

    • orlp a day ago ago

      > Reliable pointer tagging is not trivial.

      It is if you use alignment bits. Not always possible if you don't control the data though.

    • 2 days ago ago
      [deleted]
    • forrestthewoods 2 days ago ago

      Can you share details? What modern platforms/environments does this not work on? Are you saying the intersection of available bits on all platforms is just 6? Or are there platforms that actually use 58 bits?

      Would be great to hear some actionable details.

      • jandrewrogers 2 days ago ago

        It wasn’t anything clever. A couple years ago I did a dive into x86 and ARM literature to determine what bits of a pointer were in use in various environments or were on a roadmap to be used in the future. To be honest, it was more bits than I was expecting.

        Note also that this is the intersection of bits that are available on both ARM and x86. If you want it to be portable, you need both architectures. Just because ARM64 doesn’t use a bit doesn’t mean that x86 doesn’t and vice versa.

        Both x86 and ARM have proposed standards for pointer tagging in the high bits. However, those bits don’t perfectly overlap. Also, some platforms don’t fully conform to this reservation of high bits for pointer tagging, so there is a backward compatibility issue.

        Across all of that, I found six high bits that were guaranteed to be safe for all current and future platforms. In practice you can probably use more but there is a portability risk.

        • loeg 2 days ago ago

          > Note also that this is the intersection of bits that are available on both ARM and x86. If you want it to be portable, you need both architectures. Just because ARM64 doesn’t use a bit doesn’t mean that x86 doesn’t and vice versa.

          Your mask/tag doesn't need to use the same bits on x86 and ARM to be portable, though.

          • jandrewrogers 2 days ago ago

            It depends on the application, those bits may be materialized across architectures. The objective was maximizing safety in all contexts.

            My perspective is biased by the requirements of high-assurance systems.

            • rhdjebejdbd 2 days ago ago

              It doesn't depend on the application unless the application shares the same pointers between x86 and arm which doesn't make any sense to me.

              Otherwise they're right, it's not the intersection that matters but just the total bits available

              • jandrewrogers a day ago ago

                Eh? The values you can store in the tags are absolutely dependent on the number of available bits. That’s a simple type safety problem. This requirement is architecture independent.

                You can’t cram 8 bits of tag in 7 bits if the latter is all the architecture has available. Hence why you have to design for the smallest reliable target.

                • injidup a day ago ago

                  You are agreeing with each other whilst still arguing

            • forrestthewoods a day ago ago

              If one platform uses the upper 56 bits and another uses the lower 56 bits that doesn’t mean you have 0 bits available for tagging. It means you have 8 bits and have to go through a conversation when moving from one platform to another. This is perhaps annoying but perfectly fine.

              Kinda weird to materialize pointers across architectures rather than indices.

              But in any case surely the relevant consideration is “fewest number of free pointer bits on any single platform”. And not “intersection of free bits across all platforms”. Right?

              • loeg a day ago ago

                Right.

        • menaerus 2 days ago ago

          How about taking the advantage of max_align_t pointer alignment guarantees, which on x86-64 Linux (glibc) is 16-bytes? This would leave you with the 4 lowest bits to be used.

          • jandrewrogers a day ago ago

            Unfortunately, low bits vary considerably from a large number to zero. If you need a minimum number of bits to be reliably available then you have to look at the high bits of a pointer. Naturally, implementations use the low bits of alignment makes them available.

            • menaerus a day ago ago

              Hm, what I have seen so far is that pointers returned by system malloc are usually aligned either to 8-byte boundary (windows) or 16-byte boundary (linux). I think jemalloc interprets the C standard guarantees a bit differently and will return the 8-byte aligned pointer for allocations whose size <= 8. But even this still leaves us with 3 LBSs to use.

      • jandrewrogers a day ago ago

        Page tables can optionally consume a very large number of bits on x86 (57?). Not every platform enables it but your code may run on a platform that uses it. There are a bunch of proposals from Intel, AMD, ARM, et al about officially recognizing some set of high bits in 64-bit pointers as tags in user space, with implementations. Unfortunately, these “standards” don’t agree on which high bits can be safely reserved for tagging.

        IIRC, the 6 high bits I mentioned was the intersection of every tag reservation implementation and/or proposal. In other words, it was the set of bits that Intel, AMD, and ARM agreed would be safe for tagging for the foreseeable future.

        Fewer bits than I would like and can probably exploit, but nonetheless the number I can reasonably rely on. If a consistent standard is ever agreed upon, the number of bits may increase.

      • joz1-k a day ago ago

        > are there platforms that actually use 58 bits?

        The original article already contains a note that "Some more recent x64 CPUs use 5-level paging, which would increase this number to 57 bits [0]"

        Apparently server-level "Sunny Cove" Intel CPUs implement this extension [1].

        [0]: https://en.wikipedia.org/wiki/Intel_5-level_paging

        [1]: <https://en.wikipedia.org/wiki/Sunny_Cove_(microarchitecture)>

      • nervoir 2 days ago ago

        If you include ARM then PAC and MTE will consume a few of those precious bits. Don’t think any platforms use PAC for pointers to allocated objects though unless they’re determined to be exceptionally important like creds structure pointers in task structures in the kernel.

  • jbn a day ago ago

    This uses the Most Significant Bits to pack some bits into pointers..

    Another technique is to use the Least Significant Bits: indeed since pointers are 64 bit aligned, the lowest 3 bits are always zero, therefore it is possible to pack 3 bits of information there.

    And there's a third technique in C++, which is to use placement new to morph an object in-place. This can be use to toggle between 2 classes with a boolean method, where one has this method return true and the other class has this method return false. This creates per-object state that really uses bits in the vptr to store this boolean state. Obviously this can be used with a whole set of classes to stores more bits. I have used this successfully to store values of refcount (each AddRef/Release using placement new to morph from RefCounted1/RefCounted2/../RefCountedN classes) to implement reference counting without a counter.

  • sema4hacker 2 days ago ago

    In the 70's when memory was always small and expensive, I had to keep things as packed and tight as possible. But now memory is so huge and cheap that it's been a long time since I had to worry about things like packing bits, which is incredibly bug-prone anyway.

    • RossBencina a day ago ago

      Cache is not huge. If you want good performance you need to keep your working set in cache. Sometimes that requires storing offsets (say 32-bit) rather than pointers. If you can also pack info into spare bits, all the better. Perhaps not everyone needs to care about this, but there are still reasons to care.

      • gpderetta a day ago ago

        Indeed, you might have hundreds of GBs, or even TBs of memory, but caches are still (relatively) small. L1 is going to be in the order of 64k of memory!

    • archi42 a day ago ago

      Depending on what you're building, memory is still scarce. Of course in the past that was much more common, but even today just doubling memory doesn't always fly: Embedded is still not a land of plenty; and not every computational expensive task is a money earner, which might put you in a difficult spot when arguing with controlling about a bigger server or at least a memory upgrade.

      Also: Caching.

    • orlp a day ago ago

      I take it you never wrote code involving atomic pointers. Regardless of memory usage, a lot of platforms only provide single-word atomics (efficiently), making bitpacking crucial for lockfree algorithms.

    • dh2022 2 days ago ago

      A benefit for packing pointers is when the data needed is already packed-this will avoid a pointer reference.

  • lentil_soup a day ago ago

    One cool technique is to use the packed bits to check if a raw pointer is still valid.

    Say you have a pool of objects, when you allocate one you give out a pointer, but in the packed bits you add a 'generation' counter. Every time you free that object from the pool you increment the counter. When you dereference a pointer you can check the generation in the packed pointer and the one in the pool. If they don't match you're using a pointer to an invalid object.

    • simonask a day ago ago

      It's a cool technique, but only useful in very limited scenarios. In particular, it is unsuitable when the number of reallocations is unbounded, because the generation counter can easily overflow.

      Even if you can squeeze in 32 tag bits (you would need to leverage the virtual memory system in some platform-specific way), counting to 4-billion-and-change takes no time on modern CPUs, so you would realistically overflow the tag in milliseconds.

      But if your system is aware of virtual memory, you can still do interesting things, like remapping a whole arena with a different "tag" and use the information in GC write barriers.

      • lentil_soup a day ago ago

        ah for sure, the context will matter but I'd say that if you're generating and destroying 4 billion objects very fast then probably storing pointers to those objects is already a bad idea!

        • simonask a day ago ago

          It's mostly a question of whether a user can accidentally or maliciously cause 4 billion objects to be constructed in the same location. Maybe it happens accidentally after the system has been running for a year, and everything comes tumbling down.

    • RossBencina a day ago ago

      A similar technique is sometimes used to mitigate the ABA problem in lock-free data structures.

      • j_seigh a day ago ago

        Only using way more bits. The original IBM lock-free stack algorithm assumed 32 bits was safe because it would take 100 years for a 32 bit counter to wrap at the time. Now it's less than 1 second.

        There's some Bugblatter Beast logic here. They assume that if they can't imagine a counter/pointer value reuse ever reoccurring then malicious actors won't be able to imagine it either.

        Here's an idea. How about fixing the buffer overflows and lack of data sanitation that allow code injection to occur in the first place.

  • gblargg 2 days ago ago

    What's old is new again. The original 68000 processor only had a 24-bit physical address bus, so the MacOS used the upper 8 bits for tag information, and didn't even need to clear it when accessing. Once they started using CPUs and more RAM that needed these upper bits, they had to make "32-bit clean" versions of programs.

    I wonder whether you could use the MMU to ignore these upper bits, by mapping each combination of bits to the same address as with them clear.

  • int_19h 2 days ago ago

    I find it rather concerning that an article of this nature has not a single mention of the words "undefined behavior", even though that is something one has to be very much aware of when doing such shenanigans.

    • wavemode a day ago ago

      I don't think pointer tagging (in and of itself, when implemented correctly for a particular platform) is ever undefined behavior. Rather, it is implementation-defined behavior - the result of converting a pointer to an integer is not specified, and could in theory be anything, so you have to know how your compiler and platform behave.

      • RossBencina a day ago ago

        We pray that it is implementation defined. According to some [1] even arithmetic on intptr_t is UB, let alone bit operations.

        [1] https://stackoverflow.com/a/64863331/2013747

        • a day ago ago
          [deleted]
        • gpderetta a day ago ago

          Eh, that's about strictly conforming programs. In practice the vast majority of programs are going to use platform facilities and extensions (POSIX for example) while still being conforming.

    • RossBencina a day ago ago

      What exactly do you propose such articles say about potential UB? If you are aware of UB, you are also aware that (1) it is extremely difficult (if not impossible) to conclusively determine whether such a thing is UB without retaining a language lawyer at great expense, or ending up in an infinite regress of bickering, (2) simultaneously you know that in practice, arithmetic and bit-wise pointer manipulations need to work for many intended C use cases, so it is highly unlikely that such a thing will stop working in production compilers. Perhaps this places unrealistic trust in the ISO working groups, but the only valid alternative, in my view, is to give up and go back to writing low-level code in assembly where such things have well-defined semantics.

    • tialaramex a day ago ago

      IIUC Because C++ has pointer provenance but no explicit mechanism for tagging or clear instructions about what is guaranteed, this is basically always all Undefined Behaviour and yet in practice lots of this will work on real C++ compilers.

      Today in C++ the way to avoid technical UB for such tricks will be going via an integer which is promised to have the same width as a pointer. In practice the compiler won't do anything different (types evaporate during compilation) but that isn't UB.

  • Panzerschrek 2 days ago ago

    Sadly on 32-bit platforms there is not so many unused bits in pointers - typically only 3 least significant bits. But for some cases even 3 bits may be useful.

  • billpg a day ago ago

    I wonder if the language could support a tagged pointer type, where you ask the compiler for n extra bits along with the pointer. If the pointer's internal structure has space for those bits, it uses them, otherwise the compiler allocates space for the pointer and the tag as a larger structure.

  • mlhpdx 2 days ago ago

    Brings back memories. I cut my teeth in C++ working on a system that used pointer tagging and transactional memory. Good times. That mind mending experience perhaps made my career.

    • brcmthrowaway 2 days ago ago

      Whatever happened to STM?

      • jandrewrogers 2 days ago ago

        My understanding is that it didn’t provide much practical benefit that could justify the cost. I was interested at the time but the zeitgeist was that no one could come up with a scenario where it was worth it.

        Most of what hardware STM provided could be achieved by designing the software better.

      • RossBencina a day ago ago

        Clojure has transactional memory primitives. I've no idea how often they are used.

  • fooker 2 days ago ago

    This sort of stuff is starting to have hardware support now, so you no longer have to perform this wizardry.

    Enabled on most iphones even!