New comment by dmagog in "What happened after 2k people tried to hack my AI assistant"

dmagog — Fri, 26 Jun 2026 04:31:09 +0000

Nice experiment, but I'd temper the optimism. "Zero breaches in 6k attempts" is a success-rate estimate, and the model is nondeterministic, so a failed jailbreak isn't proof it's blocked, just that it didn't fire on that sample. 6k different prompts isn't 6k tries of the worst one; an attack with even a 0.1% success rate usually shows zero in a handful of attempts, and the tail is what bites in production. Also, this is direct user injection, the easy case. The channel people actually lose to is indirect: untrusted content arriving via a tool result or fetched doc, which Fiu never had in the loop.

Hacker News: dmagog

New comment by dmagog in "What happened after 2k people tried to hack my AI assistant"