Hacker News: jamiejones1

New comment by jamiejones1 in "The Policy Puppetry Attack: Novel bypass for major LLMs"

jamiejones1 — Sat, 26 Apr 2025 00:46:51 +0000

If a company discloses vulnerabilities, they can't also then write that their product can actually help mitigate those vulnerabilities? So, you want them to offer problems without solutions?

I get that ideally the company would offer a slew of solutions across many companies, but this is still good, no?

I mean it looks like finding vulnerabilities is central to this company's goal, which is why they employ many researchers. I'd imagine they also incorporate the mitigations for the vulns into their product. So it's sort of weird to be "against" this. Like, do you just not want companies who deal in selling cybersecurity solutions simultaneously involved in finding vulnerabilities?

New comment by jamiejones1 in "The Policy Puppetry Attack: Novel bypass for major LLMs"

jamiejones1 — Fri, 25 Apr 2025 21:00:03 +0000

God forbid a company tries to advertise a solution to a real problem!

New comment by jamiejones1 in "The Policy Puppetry Attack: Novel bypass for major LLMs"

jamiejones1 — Fri, 25 Apr 2025 20:57:17 +0000

They're focused on making their models better at answering questions accurately. They still have a long way to go. Until they get to that magical terminal velocity of accuracy and efficiency, they will not have time to focus on security and safety. Security is, as always, an afterthought.

New comment by jamiejones1 in "The Policy Puppetry Attack: Novel bypass for major LLMs"

jamiejones1 — Fri, 25 Apr 2025 20:54:51 +0000

Not really. If HiddenLayer sold its own models for commercial use, then sure, but it doesn't. It only sells security.

So, it's more like a window glass company advertising its windows are unsmashable, and another company comes along and runs a commercial easily smashing those windows (and offers a solution on how to augment those windows to make them unsmashable).

New comment by jamiejones1 in "The Policy Puppetry Attack: Novel bypass for major LLMs"

jamiejones1 — Fri, 25 Apr 2025 20:43:02 +0000

The company's product has its own classification model entirely dedicated to detecting unusual, dangerous prompt responses, and will redact or entirely block the model's response before it gets to the user. That's what their AIDR (AI Detection and Response) for runtime advertises it does, according to the datasheet I'm looking at on their website. Seems like the classification model is run as a proxy that sits between the model and the application, inspecting inputs/outputs, blocking and redacting responses as it deems fit. Filtering the input wouldn't always work, because they get really creative with the inputs. Regardless of how good your model is at detecting malicious prompts, or how good your guardrails are, there will always be a way for the user to write prompts creatively (creatively is an understatement considering what they did in this case), so redaction at the output is necessary.

Often, models know how to make bombs because they are LLMs trained on a vast range of data, for the purpose of being able to answer any possible question a user might have. For specialized/smaller models (MLMs, SLMs), not really as big of an issue. But with these foundational models, this will always be an issue. Even if they have no training data on bomb-making, if they are trained on physics at all (which is practically a requirement for most general purpose models), they will offer solutions to bomb-making.