Show HN: Slipstream – A Python library for stateful stream processing

(slipstream.readthedocs.io)

25 points | by menziess a day ago ago

6 comments

  • skadamat 20 hours ago ago

    Interesting! How do you see this comparing with Bytewax - https://github.com/bytewax/bytewax

    • menziess 10 hours ago ago

      Compared to Bytewax, Slipstream puts the emphasis on freedom, at the cost of having to implement certain features yourself.

      This freedom let's you do things that other libraries may not offer within the bounds of their API. For instance, I do see that joins and windows are supported in Bytewax, but is it possible to do more complex stateful joins based timestamps (temporal joins) or other arbitrary conditions?

      If it does, then that's great. But I've had experiences where limitations became apparent during an end-phase of a project. When the API starts to reach its limits, but you're already invested in it quite deeply.

  • woile 20 hours ago ago

    Looks interesting! Great work! When using cache, if you have 2 topics, each with multiple partitions. How does the join operation works? what if the partitions don't have the same id?

    • menziess 10 hours ago ago

      Thanks! This example uses nearby-joins (based on event time): https://slipstream.readthedocs.io/en/1.0.1/cookbook.html#joi....

      I hope I get your question right, but if you're joining on a fixed key, the solution should be simpler. If the data is partitioned by id (partition 0 has id's 0 - 9, and 1 has 10 - 19), each message still passes through the handler that stores each message in the cache. So in the cache it will no longer be partitioned. Let's say each weather update has a fixed `id` we could join on, we'd simply use the following logic instead:

      ``` if w := weather_cache[id]: return f'The weather during {a["value"]} was {w["value"]}'

      return a['value'], '?' ```

      It may be the case that the weather updates come in late, in which case we may be joining with stale data. For that we can use Synchronization: https://slipstream.readthedocs.io/en/1.0.1/cookbook.html#syn...

      Here's a full example that sends out corrections: https://gist.github.com/Menziess/22d8a511f61c04a8142d81510a0...

      Instead, you could also wait by pausing the activity stream by setting the Checkpoints Dependeny downtime_threshold to 0. Perhaps negative values may also work, although I haven't tried yet.

  • menziess 21 hours ago ago

    The link leads to the readthedocs. The library can be used to parallelize data processing. `AsyncIterables` can be used a data sources, and any `Callable` can be used as a sink. RocksDB is used to preserve state. Checkpoints are used to detect stream downtimes, which will pause dependent streams until the dependency streams have recovered and have caught up.

  • menziess a day ago ago