39 comments

  • hardwaresofton 20 hours ago ago

    Been doing some IPC experiments recently following the 3tilley post[0], because there just isn't enough definitive information (even if it's a snapshot in time) out there.

    Shared memory is crazy fast, and I'm surprised that there aren't more things that take advantage of it. Super odd that gRPC doesn't do shared memory, and basically never plans to?[1].

    All that said, the constructive criticism I can offer for this post is that in mass-consumption announcements like this one for your project, you should:

    - RPC throughput (with the usual caveats/disclaimers) - Comparison (ideally graphed) to an alternative approach (ex. domain sockets) - Your best/most concise & expressive usage snippet

    100ns is great to know, but I would really like to know how much RPC/s this translates to without doing the math, or seeing it with realistic de-serialization on the other end.

    [0]: https://3tilley.github.io/posts/simple-ipc-ping-pong/

    [1]: https://github.com/grpc/grpc/issues/19959

    • a_t48 12 hours ago ago

      In my experience shared memory is really hard to implement well and manage:

      1. Unless you're using either fixed sized or specially allocated structures, you end up paying for serialization anyhow (zero copy is actually one copy).

      2. There's no way to reference count the shared memory - if a reader crashes, it holds on to the memory it was reading. You can get around this with some form of watchdog process, or by other schemes with a side channel, but it's not "easy".

      3. Similar to 2, if a writer crashes, it will leave behind junk in whatever filesystem you are using to hold the shared memory.

      4. There's other separate questions around how to manage the shared memory segments you are using (one big ring buffer? a segment per message?), and how to communicate between processes that different segments are in use and that new messages are available for subscribers. Doable, but also not simple.

      It's a tough pill to swallow - you're taking on a lot of complexity in exchange for that low latency. If you can do so, it's better to put things in the same process space if you can - you can use smart pointers and a queue and go just as fast, with less complexity. Anything CUDA will want to be single process, anyhow, (ignoring cuda IPC, anyhow). The number of places where you need (a) ultra low latency (b) high bandwidth/message size (c) can't put everything in the same process (d) are using data structures suited to shared memory and finally (e) are okay with taking on a bunch of complexity just isn't that high. (It's totally possible I'm missing a Linux feature that makes things easy, though).

      I plan on integrating iceoryx into a message passing framework I'm working on now (users will ask for SHM), but honestly either "shared pointers and a queue" or "TCP/UDS" are usually better fits.

      • elBoberido an hour ago ago

        > In my experience shared memory is really hard to implement well and manage:

        I second that. It took us quite some time to get the correct architecture. After all, iceoryx2 is the third incarnation of this piece of software, with elfepiff an me working on the last two.

        > 1. Unless you're using either fixed sized or specially allocated structures, you end up paying for serialization anyhow (zero copy is actually one copy).

        Indeed, we are using fixed size structures with a bucket allocator. We have ideas on how to enable the usage on types which support custom allocators and even with raw pointers but that is just a crazy idea which might not pan out to work.

        > 2. There's no way to reference count the shared memory - if a reader crashes, it holds on to the memory it was reading. You can get around this with some form of watchdog process, or by other schemes with a side channel, but it's not "easy". > > 3. Similar to 2, if a writer crashes, it will leave behind junk in whatever filesystem you are using to hold the shared memory.

        Indeed, this is a complicated topic and support from the OS would be appreciated. We found a few ways on how to make this feasible, though.

        The origins of iceoryx are in automotive and there it is required to split functionality up into multiple processes. When one process goes down, the system can still operate in a degraded mode or just restart the faulty process. With this, one needs an efficient and low-latency solution else the CPU is spending more time on copying data than on doing real work.

        Of course there are issues like the producer mutating data after delivery, but here are also solutions for this. It will of course affect the latency but should still be better than using e.g. unix domain sockets.

        Fun fact. For iceoryx1 we supported only 4GB memory chunks and some time ago someone came and asked if we could lift this limitation since he wanted to transfer a 92GB large language model via shared memory.

      • hardwaresofton 10 hours ago ago

        Thanks for sharing here -- yeah these are definitely huge issues that make shared memory hard -- the when-things-go-wrong case is definitely quite hairy.

        I wonder if it would work well as a sort of opt-in specialization? Start with TCP/UDS/STDIN/whatever, and then maybe graduate, and if anything goes wrong, report errors via the fallback?

        I do agree it's rarely worth it (and same-machine UDS is probably good enough), but with the 10x gain essentially I'm quite surprised.

        One thing I've also found that actually performed very well is ipc-channel[0]. I tried it because I wanted to see how something I might actually use would perform, and it was basically 1/10th the perf of shared memory.

        [0]: https://crates.io/crates/ipc-channel

        • a_t48 4 hours ago ago

          The other thing is 10x improvement on basically nothing is quite small. Whatever time it takes for a message to be processed is going to be dominated by actually consuming the message. If you have a great abstraction, cool - use it anyhow, but it's probably not worth developing a shared memory library yourself.

    • elBoberido an hour ago ago

      Thanks for the tips. We have a comparison with message queues and unix domain sockets [1] on the repo on github [2].

      ~~It's nice to see that independent benchmarks are in the same ballpark than the one we perform.~~ Edit: sorry, I confused your link with another one which also has ping-pong in its title

      We provide data types which are shared memory compatible, which means one does not have to serialize/deserialize. For image or lidar data, one also does not have to serialize and this is where copying large data really hurts. But you are right, if your data structures are not shared memory compatible, one has to serialize the data first and this has its cost, depending on what serialization format one uses. iceoryx is agnostic to this though and one can select what's the best for a given use case.

      [1]: https://raw.githubusercontent.com/eclipse-iceoryx/iceoryx2/r... [2]: https://github.com/eclipse-iceoryx/iceoryx2

    • abhirag 15 hours ago ago

      At $work we are evaluating different IPC strategies in Rust. My colleague expanded upon 3tilley's work, they have updated benchmarks with iceoryx2 included here[0]. I suppose the current release should perform even better.

      [0]: https://pranitha.rs/posts/rust-ipc-ping-pong/

      • elBoberido an hour ago ago

        Sweet. Can we link to your benchmark from the main iceoryx2 readme?

      • nh2 13 hours ago ago

        Interesting that on Linux Unix Domain Sockets are not faster than TCP.

        People often say that the TCP stack overhead is high but this benchmark does not confirm that.

        • jcelerier 10 hours ago ago

          I'm curious about the benchmark. In my own for another network IPC library (https://GitHub.com/ossia/libossia) Unix sockets were consistently faster than the alternatives when sending the same payloads.

        • billywhizz 8 hours ago ago

          the linux results are for a vm running on macos. not sure how useful that is. i certainly wouldn't draw any wider conclusions from them without trying to reproduce yourself. pretty sure they will be very different on bare metal.

        • billywhizz 8 hours ago ago

          i couldn't resist reproducing on bare metal linux (8th gen core i5, ubuntu 22.04):

            cargo run --release -- -n 1000000 --method unixstream
            cargo run --release -- -n 1000000 --method tcp
          
          ~9μs/op for unixstream, ~14μs/op for TCP.

          unixstream utilizes two cores at ~78% each core, tcp only utilizes ~58% of each core. so there is also something wrong in the benchmarks where blocking is happening and cores are not being fully utilized.

      • hardwaresofton 9 hours ago ago

        Excellent writeup! I performed just about the same test, but I didn't see 13M rps in my testing, shared memory went up to about 1M.

        That said, I made sure to include serialization/deserialization (and JSON at that) to see what a realistic workload might be like.

    • pjmlp 12 hours ago ago

      Yeah, I think it is about time we re-focus on multi-processing as extension mechanism, given the available hardware we have nowadays.

      Loading in-process plugins was a great idea 20 - 30 years ago, however it has been proven that is isn't such a great idea in regards to host stability, or exposed to possible security exploits.

      And shared memory is a good compromise between both models.

      • elBoberido an hour ago ago

        Indeed. That's our goal :)

  • emmanueloga_ a day ago ago

    Looks great! From a quick glance it seems like it is a cross platform shared memory library. Maybe similar to this? [1].

    Suggestion: would be cool to have a quick description of the system calls involved for each supported platform [2]. I'm guessing mmap on linux/osx and CreateFileMapping on Windows?

    --

    1: https://github.com/LiveAsynchronousVisualizedArchitecture/si...

    2: https://github.com/eclipse-iceoryx/iceoryx2?tab=readme-ov-fi...

    • elfenpiff 11 hours ago ago

      You guessed right. We have a layered architecture that abstracts this away for every platform. With this, we can support every OS as long as it has a way of sharing memory between processes (or tasks as some RTOSes are calling it) and you have a way of sending notifications.

  • fefe23 9 hours ago ago

    This smells like they are using shared memory, which is almost certainly a security nightmare. The way they are selling it makes me fear they aren't aware of what a time bomb they are sitting on.

    Shared memory works as a transport if you either assume that all parties are trusted (in which case why do IPC in the first place? Just put them in a monolith), or you do hardcore capturing (make a copy of each message in the framework before handing it off). Their web page mentions zero copy, so it's probably not the second one.

    Also, benchmarks are misleading.

    It's easy to get good latency if your throughput is so high that you can do polling or spin locks, like for example in benchmarks. But that's probably not a good assumption for general usage because it will be very inefficient and waste power and require more cooling as well.

    • zbentley 9 hours ago ago

      > Shared memory works as a transport if you either assume that all parties are trusted (in which case why do IPC in the first place? Just put them in a monolith)

      There are all sorts of domains where mutually-trusted parties need IPC. Off the top of my head and in no particular order:

      - Applications that pass validated data to/from captive subprocesses. Not everything is available as a natively-linked library. Not every language's natively-linked libraries are as convenient to reliably install as external binaries.

      - Parallelism/server systems farming work out to forked (but not exec'd) subprocesses. Not everything needs setuid. Somtimes you just want to parallelize number crunching without the headache of threads (or are on a platform like Python which limits threads' usefulness).

      - Replatforming/language transitions in data-intensive applications. Running the new runtime/platform in the same address space as the legacy platform can bring some hairy complexity, which is sidestepped (especially given the temporary-ness of the transitional state) with careful use of shared memory.

      And aren't systems like Postgres counterpoints to your claim? My memory isn't the greatest, but IIRC postgres's server-side connections are subprocesses which interact with the postmaster via shared memory, no?

      • fefe23 6 hours ago ago

        If you use shared memory with a captive process, that process can probably hack you if it gets taken over by an attacker.

        I agree with your parallelism counter-argument in principle. However even there it would probably make sense to not trust each other, to limit the blast radius of successful attacks.

        In your next point the "careful" illustrates exactly my point. Using shared memory for IPC is like using C or C++ and saying "well I'll be careful then". It can work but it will be very dangerous and most likely there will be security issues. You are much better off not doing it.

        Postgres is a beautiful argument in that respect. Yes you can write a database in C or C++ and have it use shared memory. It's just not recommended because you need professionals of the caliber of the Postgres people to pull it off. I understand many organizations think they have those. I don't think they actually do though.

    • gnulinux 3 hours ago ago

      > Shared memory works as a transport if you either assume that all parties are trusted (in which case why do IPC in the first place? Just put them in a monolith), or you do hardcore capturing (make a copy of each message in the framework before handing it off). Their web page mentions zero copy, so it's probably not the second one.

      This is an extremely puzzling comment. I can think of thousands of such cases.

      First, there are many reasons to split your program into processes instead of threads (e.g. look at browsers) so even if you have a monolith, you may need IPC between trusted parties simply because of software engineering practices. As a more extreme example, if you're writing code in a language like Python, where multi-threading is a huge liability due to GIL and the standard solution is to just use multi-processing, you'll need a channel between your processes (even if they're just fork()'d) and so you need to use something like filesystem, unix pipe, postgresql, redis, some ipc lib (e.g. TFA)... whatever as a way to communicate.

      Second, your comment implies there is no scenario where implementing two separate programs is preferable to building a monolith. Even though you believe in general monoliths are better, it doesn't follow that they have to always be the right approach for every software. You may have a program that requires extremely different computational techniques, e.g. one part written in Prolog because it needs logical constraint satisfaction solving, or one part needs X language because you have to use a specialized library only available in language X, or you may need one part of your program to be in C/C++/Go/Rust for improved latency, or you may need part of your program in "slow" Y language because every other codebase in your company is written in Y. This language barrier is simply one reason. I can come up with many others. For example, parts of the software may be developed by two separate teams and the IPC is decided as the interface between them.

      Long story short, it's pretty normal to have a monolithic codebase, but N processes running at the same. In such cases since all N processes are written by you, running in hopefully-trusted hardware, using an IPC framework like this is a good idea. This is not necessarily the most common problem in software engineering, but if you do enough systems programming you'll see that a need for IPC between trusted processes is hardly niche. I personally reach for tools like iceoryx quite frequently.

    • elfenpiff 4 hours ago ago

      > This smells like they are using shared memory, which is almost certainly a security nightmare.

      Yes, we are using shared memory, and I agree that shared memory is a challenge but there are some mechanisms that can make it secure.

      The main problem with shared memory is, that one process can corrupt the data structure while another process is consuming it. Even verifying the contents of the data structure is insufficient since it can always be corrupted afterwards. We have named the problem "modify-after-delivery problem" - a sender modifies the data after it has been delivered to a receiver.

      This can be handled with:

      1. memfd: The sender acquires it, writes its payload, seals it so that it is read-only and then transfers the file descriptor to all receivers. The receiver can verify the read-only seal with fcntl. Since linux guarantees us that it cannot be reverted the receiver can now safely consume the data. This allows it to be used even in a zero-trust environment. [1] provides a good introduction (see the File-Sealing IPC subsection). 2. Memory protection keys [2]: I do not have too much experience with them, but as far as I understand, they solve the problem with mprotect, meaning, the sender can call mprotect and make the segment read only for it self, but the receiver has no way of verifying it or to prevent the sender from calling mprotect again and granting it read/write access again to corrupt the data.

      So, the approach is that a sender acquires shared memory, writes its payload into it, makes it read-only, and then transfers it to the receivers.

      > Shared memory works as a transport if you either assume that all parties are trusted (in which case why do IPC in the first place?

      Robustness is another use case. In mission-critical systems you trust each process but a crash caused by a bug in one sub-system shall not bring down the whole system. So you split up the monolith in many processes and the overall system survives if one process goes down or deadlocks, assuming you have a shared memory library that itself is safe. If you detect a process crash, you can restart it and continue operations.

      [1] https://dvdhrm.wordpress.com/2014/06/10/memfd_create2/ [2] https://www.kernel.org/doc/html/latest/core-api/protection-k...

  • boolit2 10 hours ago ago

    I'm in robotics education and we mostly work with Python, to make life easier for students. I'd love to push for more Rust, but so far there's no point to it.

    Multiprocess communication is something that we found lacking in Python (we want everything to be easily pip installable) and we ended up using shared memory primitives, which is a lot of code to maintain.

    What is the main roadblock for iceoryx2 Python bindings? Is it something you are looking for contributors for?

    • orecham 3 hours ago ago

      The only real roadblock for this is time. We'd welcome any contributor, especially with such an impactful contribution as this. Making iceoryx2 an option for Python would be amazing. If you or someone you know wants to take it on, we would support as much as possible.

      So far for this topic, we have only done some brief research on the options available to us, such us going over the C API or using something like PyO3.

  • npalli a day ago ago

    Congrats on the release.

    What's the difference between iceoryx and iceoryx2? I don't want to use Rust and want to stick to C++ if possible.

    • elBoberido a day ago ago

      Besides being written in Rust, the big difference is the decentralized approach. With iceoryx1 a central daemon is required but with iceoryx2 this in not the case anymore. Furthermore, more fine grained control over the resources like memory and endpoints like publisher. Overall the architecture is more modular and it should be easier to port iceoryx2 to even more platforms and customize it with 3rd party extension.

      With this release we have initial support for C and C++. Not all features of the Rust version are supported yet, but the plan is to finish the bindings with the next release. Furthermore, with an upcoming release we will make it trivial to communicate between Rust, C and C++ applications and all the other language bindings we are going to provide, with Python being probably the next one.

      • sebastos 17 hours ago ago

        I've been looking around for some kind of design documents that explain how you were able to ditch the central broker, but I haven't found much. Do you have breadcrumbs?

        • elfenpiff 11 hours ago ago

          This is a longer story, but I'll try to provide the essence.

          * All IPC resources are represented in the file system and have a global naming scheme. So if you would like to perform a service discovery, you take a look at the `/tmp/iceoryx2/services` list all service toml files that you are allowed to access and handle them.

          * Connecting to a service means, under the hood, opening a specific shared memory identified via a naming scheme, adding yourself to the participant list, and receiving/sending data.

          * Crashing/resource cleanup is done decentrally by every process that has the permissions to perform them.

          * In a central/broker architecture you would have the central broker that checks this in a loop.

          * In a decentralized architecture, we defined certain sync points where this is checked. These points are placed so that you check the misbehavior before it would affect you. For instance, when a sender shall send you a message every second but you do not receive it, you would actively check if it is still alive. Other sync points are, when an iceoryx2 node is created or you connect or disconnect to a service.

          The main point is that the API is decentralized but you can always use it in a central daemon if you like - but you don't have to. It is optional.

        • simfoo 17 hours ago ago

          Same here. Shared memory is one of those things where the kernel could really help some more with reliable cleanup (1). Until then you're mostly doomed to have a rock solid cleanup daemon or are limited to eventual cleanup by restarting processes. I have my doubts that it isn't possible to get into a situation where segments are being exhausted and you're forced to intervene

          (1) I'm referring to automatic refcounting of shm segments using posix shm (not sys v!) when the last process dies or unmaps

    • tbillington a day ago ago

      > Language bindings for C and C++ with CMake and Bazel support right out of the box. Python and other languages are coming soon.

    • tormeh a day ago ago

      Looks like it has significantly lower latency.

      > want to stick to C++ if possible

      The answer to that concern is in the title of the submission.

  • westurner 21 hours ago ago

    How does this compare to and/or integrate with OTOH Apache Arrow which had "arrow plasma IPC" and is supported by pandas with dtype_backend="pyarrow", lancedb/lancedb, and Serde.rs? https://serde.rs/#data-formats

    • zxexz 18 hours ago ago

      The other commenter answering you is, I think, trying to point out that the Arrow plasma store is deprecated (and no longer present in the arrow project).

      I think it's worth being a little more clear here - Arrow IPC is _not_ deprecated, and has massive momentum - so much so that it's more or less already become the default IPC format for many libraries.

      To me it remains unclear what the benefits of Iceoryx2 over the Arrow ecosystem is, and what the level of interoperability is, and what the tradeoffs of either are relative to eachother. Within a single machine, you can mmap the IPC file. You can use Arrow Flight for inter-node or inter-process communication. You can use Arrow with Ray, which is where Plasma went.

      I love anything new in this space though, if/when I have time I'll check this out - would love it if somebody could actually eloborate on the differences though.

    • dumah 20 hours ago ago
  • MuffinFlavored 8 hours ago ago

    Pretty cool. I see publish + subscribe example on GitHub but no request/response. Am I missing something?

    • elBoberido an hour ago ago

      Request-response is on our todo list and will be introduced in an upcoming release :)

      What are you needing request-response for?

  • forrestthewoods 16 hours ago ago

    Why is Windows target support tier 2 and not tier 1?

    • elfenpiff 11 hours ago ago

      Tier 1 also means all security/safety features. Windows is not used in mission-critical systems like cars or plans, so we do not need to add those to Windows.

      We aim to support Windows so iceoryx2 can be used safely and securely in a desktop environment.