Show HN: We built a public CTF to stress-test AI agent guardrails

(vault.aport.io)

1 points | by uchibeke 6 hours ago ago

3 comments

uchibeke 6 hours ago ago
Since October I've been building APort — an authorization layer that intercepts every AI agent tool call before execution and evaluates it against a versioned policy. The problem I kept running into: internal tests always passed. My test suite maps the space I imagined, which is exactly what an adversarial input tries to escape.
So I built this CTF to find the gaps I couldn't find myself.
A few things we learned before opening it publicly — we spent two weeks breaking it ourselves first:
• Prompt injection worked better than expected. Not because detection was weak, but because we were matching content not intent. Reframing "retrieve the restricted file" as "open the user-requested file" shifted the evaluator's judgment. We fixed this by mapping semantic equivalence — every synonym of a blocked operation routes to the same evaluation path.
• Policy ambiguity was a free pass. Any undefined term in a policy is exploitable. "Don't read sensitive files" left "sensitive" undefined. We moved to explicit default-deny: if the policy doesn't explicitly allow it, it's denied.
• Multi-step chaining went undetected. Our guardrail evaluated each call independently. A denied macro-action split into ten individually-approved micro-actions passed clean. We only caught it by looking at the full session replay. This is the same composability problem as transaction laundering in fintech — each transaction passes compliance, the composed behavior doesn't.
We fixed what we found before launch. Level 5 (full system bypass) hasn't been cracked yet. I'm genuinely uncertain if the architecture has a systemic weakness — that's the point of opening it up.
Runs on a Hetzner VPS, ~$10/month. Levels 1 and 2 are free, no sign-up. Levels 3-5 pay out $500/$1,000/$5,000.
Happy to go deep on the policy engine design, the evaluation architecture, or anything about how the levels were constructed.
ollybrinkman 5 hours ago ago
Interesting approach to security testing. One angle we've been exploring: what if the authentication layer itself was the guardrail?
With x402, every API call requires a signed payment. No API keys to steal, no credentials to leak. The economic cost of each call is itself a rate limiter and audit trail.
Not a replacement for proper guardrails, but it eliminates the credential-based attack surface entirely.
[-]