4 comments

  • SachitRafa 6 hours ago ago

    The payments background makes total sense as an origin for this I've seen teams reach for Ray or Prefect and spend weeks configuring things they didn't need, when the actual problem was just 'how do I get work off this one machine cleanly.

    One thing I'm curious about: what happens when a gRPC worker goes quiet mid execution?

    Does the caller find out, or is it purely fire and forget? I hit a similar decision point building a memory layer for AI agents ended up skipping retry logic entirely because the coordination overhead just wasn't worth it for my use case. Wondering if you landed in the same place or have a different take.

    Sub-millisecond dispatch locally is a good sign. The number I'd really want to see is how that holds up once you've got 20-30 workers in the mesh that's usually where the interesting degradation starts.

  • takahitoyoneda a day ago ago

    As a solo dev, I usually avoid distributed Python runtimes entirely because managing the infrastructure overhead of Celery or Ray is a massive time sink. If Wool genuinely abstracts away those complex locking mechanisms without requiring a heavy Redis or Postgres cluster just to manage state, that is a huge win for smaller teams. How does your scheduler handle node failures mid-execution when exactly-once processing is strictly required?

    • bzurak a day ago ago

      I wouldn't say it abstracts the locking mechanisms away - if you need synchronization in your app, it's probably best to leave how that's achieved up to the user - what it does is make it possible to contain your business logic end-to-end in a single application/codebase without obfuscating it with distribution boundaries (e.g., calls out to other REST APIs or message queues). There are also still worker nodes to manager, BUT the architecture is much simpler in the sense that there are only workers to deal with - no control plane, scheduler, or other services involved.

      Regarding failures - Wool workers are simple gRPC services under the hood, and connections are long-lived HTTP2 connections that persist for the life of the request. Worker-side failures simply manifest as Python exceptions on the client side, with the added nicety of preserving the FULL stack trace across worker boundaries (achieved with tbpickle). A core tenet of Wool is that it makes no assumptions about your workload - I leave it up to you to write a try-catch block and handle exceptions in a manner appropriate to your use case. The goal is to keep Wool as unopinionated about this sort of thing as possible.

      I'm not sure about your specific needs, but I'm considering adding a simple CLI-based worker management tool for users that don't want or need a full service orchestrator like Kubernetes in their stack.

    • bzurak a day ago ago

      I should add- Wool supports ephemeral worker pools, i.e., pools that are spawned by your application directly that live for the life of the WorkerPool context. The limitation right now is that there’s no remote worker factory - you would need to implement a factory that spawns remote a remote worker as well as a truly remote discovery protocol. These are things I plan to add in future updates, but for now only machine-local and LAN discovery is implemented.