19 comments

  • MidasTools an hour ago ago

    Running a solo dev business on top of multi-agent Claude Code workflows (OpenClaw stack) -- the cascading context drift problem is real and the state partitioning approach in this thread is the right instinct.

    The failure mode that bit us hardest: agents sharing a single context window where early tool outputs pollute later reasoning. Fixed it by treating each agent turn as append-only -- worker writes output to a structured log, reviewer reads only that log (not the raw conversation history). Isolated drift. Night and day.

    The confidence score idea is underutilized. We log tool call outcomes as: {action, result, confidence: 0-3}. The reviewer agent pattern-matches on low-confidence streaks before they compound into something unfixable.

    On the multi-model review question from another commenter: different models catch different failure types. Claude catches logical inconsistencies; a smaller/faster model catches format errors and incomplete outputs. Cheap pre-check before the expensive reviewer saves a lot of token burn.

    What's your retry strategy when the reviewer blocks -- exponential backoff on the same worker context, or fresh context each retry? We do fresh context after 2 failures.

    • unohee an hour ago ago

      OpenSwarm isolates context at the agent level — each worker is spawned via Claude Code’s -p flag, so there’s no shared conversation history between agents. The only shared state is written artifacts and a global work memory layer (CLAUDE.md + structured output). Each instance treats that as its single source of truth, rather than reading other agents’ raw context. One thing I’m actively formalizing: a CONFIDENCE-HALT mechanism. Currently it lives as a defined concept in CLAUDE.md, but the next revision will have OpenSwarm inject it explicitly into each worker context — so low-confidence streaks trigger a halt before they compound. Your {action, result, confidence: 0-3} logging pattern is basically the same instinct. Still early, but converging fast. Curious how you handle the structured log schema — do you version it across runs?

  • das-bikash-dev 31 minutes ago ago

    the context isolation approach is smart — cascading drift between agents is a real problem. i run 10 microservices with claude code and solved a similar issue by maintaining curated reference docs that agents read on-demand per task area instead of loading everything. the model escalation on failure (haiku → sonnet) is a nice touch too. do you find the lancedb memory layer actually helps with repeated similar tasks, or is it more useful for the code knowledge graph side?

  • jamiecode 4 hours ago ago

    The reviewer/worker pattern gets tricky when they share state. The pattern I've found that works: each agent owns a separate state partition, and they communicate through a shared message queue (even a simple append-only JSONL file works). Worker writes output + confidence score. Reviewer reads, adds a decision record, worker reads that before retrying.

    The key thing to get right: make the retry idempotent. If worker retries the same task, it should produce the same side effects as a fresh run, not double them. This is harder than it sounds when agents are calling real APIs or writing files.

    How does OpenSwarm handle the case where worker keeps failing reviewer? Is there a max retry count, and if so, what happens to the Linear issue?

    • unohee 3 hours ago ago

      For the current build, OpenSwarm uses max retry count with an escalation scheme: the first worker starts with Haiku, and if the tester/reviewer blocks enough times, it escalates to Sonnet. Each pipeline step updates Linear's updates tab with iteration count and total cost, so there's a full audit trail per issue. Failed jobs stay as 'in progress' or 'in review' in Linear rather than being auto-closed. I'm currently working on an 'Auditor' layer that analyzes why jobs failed — and longer term, the goal is for OpenSwarm to maintain itself using its own agents. That said, not every failure should be resolved automatically. Some errors genuinely need human judgment, and the dashboard chat interface and Discord are there for exactly that. I think knowing when to hand off to a human is part of what makes an autonomous system actually trustworthy.

  • csto12 7 hours ago ago

    Is there a new agent orchestrater posted every day? Is this the new JS framework?

    • guessmyname 4 hours ago ago

      Yes. Everyone and their grandma wants to build the ultimate panacea of AI so of course you’ll see a myriad of AI-powered products and services on a daily basis until the tech industry as a whole is done with the topic.

    • himata4113 6 hours ago ago

      Everyone has different needs. I've made one for oh-my-pi that has file backed tasks which accept natural language to create jobs (parallelize them whenever relevant).

      Haven't felt the need to show the world tho.

      • avoutic 4 hours ago ago

        This! I have one with Linear, Nanobot, Claude Code, all automated in a way that works for me.

        Welcome to the age of selfware! Where everybody makes what they need! :)

        • verdverm 4 hours ago ago

          I'll chime in that I use CUE, ADK-Go, Dagger, and Gemini-flash to build a Copilot alternative that is much better.

          The best part of building your own is all the things you will learn along the way.

    • unohee 4 hours ago ago

      Kind of. My point is that agent orchestrators become actually useful when the framework is specific about what's safe to delegate to machines — things that reduce friction in CI/CD operations, not agents that shoot iMessages, click around in browsers, or delete files without approval.

    • verdverm 4 hours ago ago

      life with tools like openclaw means life with ns;nt abundance

      hopefully it dies down as people realize there's more to it that the code

    • reconnecting 4 hours ago ago

      The timeline is always the same.

      Day one: Develop a new agent orchestration with 70K LOC from Claude.

      Day three: Post it on Show HN.

      Day four: Get 50–150 stars on GitHub.

      Day seven: Never open this repo again.

      • verdverm 3 hours ago ago

        That's slow, plenty of Claw HN pulling this off the first half in a couple of hours. Best I've seen is 25m

  • vladgur 2 hours ago ago

    have you consider having different models(e.g. codex) do the reviews? i wonder if its presents an opportunity to catch more issues than the same model

    • unohee an hour ago ago

      For the collaboration between two different models — I’d love to explore that. Expanding model compatibility to broader providers (Codex, Aider, and other API models) is already on my roadmap. I’m planning to add a reviewer feature that supports multiple models, configurable simply by adding an API key to the .env file. Thanks for the suggestion!

      • kaicianflone an hour ago ago

        I’ve been running OpenClaw Docker agents in Slack in a similar setup, using Gemini 2.5 Flash Lite through OpenRouter for most tasks, then Opus 4.6 and Codex 5.3 for heavier lifts. They share context via embeddings right now, but I’m going to try parameterizing them like you suggested because they can drift prettyy hard once a hallucinated idea takes off. I’m trying to get to a point where I don’t have to babysit them. I’ve also been thinking about giving them some “democracy” under the hood with a consensus policy engine. I’ve started tinkering an open-source version of that called consensus-tools that I can swap between agentic frameworks. Checking out if it can work with openswarm to work for me too.

  • mihneadevries 7 hours ago ago

    the reviewer/worker pipeline is honestly the part I'm most curious about. like how do you handle disagreements between agents, does the reviewer just block and the worker retries, or is there a loop with a hard cutoff?

    the failure mode I'd worry about most is cascading context drift, where each agent in the chain slightly misunderstands the task and by the time you get to the test agent it's validating the wrong thing entirely. fwiw I think the LanceDB memory is the right call for this kind of setup, keeping shared context grounded is probably what prevents most of those drift issues.

    • unohee 4 hours ago ago

      The worker-reviewer pipeline typically runs 1–2 self-revision iterations. In my experience, agents handle most tasks fine, but they tend to miss quality gates — docstrings, minor business logic edge cases, that kind of thing. The reviewer catches what slips through on the code quality side. This is all based on observed behavior from daily Claude Code CLI usage, where I've added hooks specifically to catch systematic failure patterns. OpenSwarm is essentially a productized version of those scaffoldings from my actual workflow — packaged into a more reusable architecture. On context drift — good call, and yeah, that's exactly why the shared memory layer matters. LanceDB keeps the grounding consistent across the chain so each agent isn't just working off its own drifting interpretation. As for disagreements: right now the reviewer blocks and the worker retries with feedback, with a hard cutoff to prevent infinite loops. It's simple but it works — the revision depth rarely needs to go beyond 2 rounds. And when it does fail, that's actually the useful signal — especially when you're triaging larger projects, the points where agents break down are exactly where a human engineer needs to step in. At this point, what OpenSwarm really needs is broader testing from other users to validate these patterns outside my own workflow.