New comment by TerryBenedict in "The Policy Puppetry Attack: Novel bypass for major LLMs"

TerryBenedict — Sat, 26 Apr 2025 01:15:07 +0000

Right, so a filter that sits behind the model and blocks certain undesirable responses. Which you have to assume is something the creators already have, but products built on top of it would want the knobs turned differently. Fair enough.

I'm personally somewhat surprised that things like system prompts get through, as that's literally a known string, not a vague "such and such are taboo concepts". I also don't see much harm in it, but given _that_ you want to block it, do you really need a whole other network for that?

FWIW by "input" I was referring to what the other commenter mentioned: it's almost certainly explicitly present in the training set. Maybe that's why "leetspeak" works -- because that's how the original authors got it past the filters of reddit, forums, etc?

If the model can really work out how to make a bomb from first principles, then they're way more capable than I thought. And, come to think of it, probably also clever enough to encode the message so that it gets through...

New comment by TerryBenedict in "The Policy Puppetry Attack: Novel bypass for major LLMs"

TerryBenedict — Fri, 25 Apr 2025 17:50:58 +0000

And how exactly does this company's product prevent such heinous attacks? A few extra guardrail prompts that the model creators hadn't thought of?

Anyway, how does the AI know how to make a bomb to begin with? Is it really smart enough to synthesize that out of knowledge from physics and chemistry texts? If so, that seems the bigger deal to me. And if not, then why not filter the input?

Hacker News: TerryBenedict

New comment by TerryBenedict in "The Policy Puppetry Attack: Novel bypass for major LLMs"

New comment by TerryBenedict in "The Policy Puppetry Attack: Novel bypass for major LLMs"