Hacker News: veganmosfet

New comment by veganmosfet in "What happened after 2k people tried to hack my AI assistant"

veganmosfet — Fri, 26 Jun 2026 14:27:46 +0000

Thanks! I tried to submit the posts but for some reason my submissions are not published in HN any more. I tried to reach out to HN admins but no response so far.

New comment by veganmosfet in "What happened after 2k people tried to hack my AI assistant"

veganmosfet — Fri, 26 Jun 2026 06:41:06 +0000

It would be nice to publish the exact setup used (workspace dump, OpenClaw version, ...) to be able to reproduce and try out more payloads.

In general I have mixed feelings about this result: sure, opus4.6 is excellent at following user intent and recognise potential prompt injection attempts. But: Is the "security" prompt used realistic for a generic use-case (processing of emails)? I guess not.

In my experiments - without this specific prompt - I was able to derail the user intent to make opus4.8 download and execute a malicious script [0] just by asking "Summarize my new emails".

[0] https://itmeetsot.eu/posts/2026-06-04-openclaw_opus48/

New comment by veganmosfet in "Prompt Injection as Role Confusion"

veganmosfet — Tue, 23 Jun 2026 06:34:21 +0000

Very interesting research. I would be interested to know how closed source AI labs implement the role thing in their inference. Is it still only a separation token? Frontier closed source LLMs are quite good at flagging any spoofing attempt from tool call results.

However, in some prompt injection experiments [0], I found it's possible to "derail" the user intent only with tool call results, here are some tricks:

* Frame the injection as a challenge. * Always use "soft" instructions ("You may", "Try to", ...). Hard instructions are almost always flagged. * Force the model to do multiple tool calls. * Bloat the context. * In the injection payload, better use LLM output (which correlates somehow with this research). I like using LLM generated poems but that's probably irrelevant. * Use multiple encoding steps to force the model to use tools, but this may be detected by the external guardrails (Anthropic does this in my experience). * Hide malicious code payload from the model context. * Last but not least, understand the agent harness used and its weaknesses (e.g., in OpenClaw, they injected emails as user message - not tool call results [1]).

[0] https://itmeetsot.eu/posts/2026-06-14-yolo_harness/ [1] https://itmeetsot.eu/posts/2026-02-02-openclaw_mail_rce/

New comment by veganmosfet in "Ramp's Sheets AI Exfiltrates Financials"

veganmosfet — Thu, 30 Apr 2026 04:34:25 +0000

I think a better comparison is humans versus LLMs - not computer programs. However, most of the non-technical 'countermeasures' used for humans (contracts, laws,...) do not work for LLMs because they are not accountable.

New comment by veganmosfet in "Show HN: BrokenClaw Part 5: GPT-5.4 Edition (Prompt Injection)"

veganmosfet — Fri, 10 Apr 2026 03:59:10 +0000

Thanks!

I did (not extensively) tried hackmyclaw but no success. The challenge is a complete black box and the user intent (e.g., "summarize my emails") is not known - this is critical for the prompt injection payload. I also suspect that batch processing of "malicious" emails (every 3 hours) adds a bias to the model behaviour (a lot of potential and detected prompt injection payloads are injected in context). That's why I always start my experiments with a fresh context. Moreover, "hacking" the VPS is not allowed.

Imho the author shall disclose more info about the setup (version, user intent, exact config) to make it more realistic. I read people saying "OpenClaw is secure against prompt injection" because nobody was able to solve the challenge - it's not.

Show HN: BrokenClaw Part 5: GPT-5.4 Edition (Prompt Injection)

veganmosfet — Thu, 09 Apr 2026 16:32:25 +0000

Some prompt injection experiments with OpenClaw and GPT-5.4. Last part of the BrokenClaw series.

Comments URL: https://news.ycombinator.com/item?id=47705780

Points: 10

# Comments: 2

New comment by veganmosfet in "OpenClaw privilege escalation vulnerability"

veganmosfet — Sun, 05 Apr 2026 07:20:47 +0000

Good question.

CaMeL is imho safer, but hard to implement into modern agents like OpenClaw. Its core idea is that a privileged LLM plans from the (trusted) user request only, while a restricted interpreter executes that plan (and enforces policies). Untrusted content is parsed separately and is not fed back into the privileged LLM.

Modern agents are useful exactly because they run a feedback loop (observe, reason, adapt, use tools, repeat). CaMeL breaks that loop, which improves security but makes it a poor fit for highly general agents like OpenClaw.

New comment by veganmosfet in "OpenClaw privilege escalation vulnerability"

veganmosfet — Sat, 04 Apr 2026 09:48:40 +0000

You're welcome!

My main takeaway message is: models (even opus4.6) do not follow security "instructions" reliably. In OpenClaw, they added security warnings, tags, random IDs... None of these countermeasures work reliably. Even sandboxing can be escaped (not in the classical sense using vulnerabilities, but using multi-layered prompt injection payload with natural language only)[0]. As soon as untrusted content is injected in the context, do not trust any actions downstream.

[0] https://itmeetsot.eu/posts/2026-02-15-openclaw_sandbox/

New comment by veganmosfet in "OpenClaw privilege escalation vulnerability"

veganmosfet — Fri, 03 Apr 2026 20:24:39 +0000

I am experimenting prompt injection on OpenClaw [0][1], quite exciting.

[0] https://itmeetsot.eu/posts/2026-03-27-openclaw_webfetch/

[1] https://itmeetsot.eu/posts/2026-03-03-openclaw3/

Show HN: Prompt Injection Experiments in OpenClaw with Opus4.6

veganmosfet — Sun, 29 Mar 2026 08:13:42 +0000

Article URL: https://veganmosfet.codeberg.page/posts/2026-03-27-openclaw_webfetch/

Comments URL: https://news.ycombinator.com/item?id=47561273

Points: 2

# Comments: 0

New comment by veganmosfet in ""Disregard That" Attacks"

veganmosfet — Thu, 26 Mar 2026 05:55:37 +0000

This! And even more, the role model extends beyond system and user: system > user > tool > assistant. This reflects "authority" and is one of the best "countermeasure": never inject untrusted content in "user" messages, always use "tool".

New comment by veganmosfet in "BrokenClaw Part 3: Remote Code Execution in OpenClaw via Email Again"

veganmosfet — Wed, 04 Mar 2026 18:15:10 +0000

Third part of the BrokenClaw saga! This time, Remote Code Execution via email using gogcli as email tool, no hook. Based only on the email content from a tool, the model (opus4.6) ignores all warnings, pipes a python script to the python3 interpreter and says: "I did not execute it".

BrokenClaw Part 3: Remote Code Execution in OpenClaw via Email Again

veganmosfet — Wed, 04 Mar 2026 18:15:10 +0000

Article URL: https://veganmosfet.codeberg.page/posts/2026-03-03-openclaw3/

Comments URL: https://news.ycombinator.com/item?id=47251524

Points: 1

# Comments: 1

New comment by veganmosfet in "BrokenClaw – RCE in OpenClaw via Gmail Hook"

veganmosfet — Tue, 24 Feb 2026 10:13:40 +0000

I experimented with OpenClaw (using opus4.6 and gpt5.2) and found this interesting way to get silent Remote Code Execution via email when using Gmail pub/sub Hook, exploiting prompt injection (out of scope from the security policy of OpenClaw) and insecure plugin design (properly documented as such). Works only with the full Gmail pub/sub hook. If your agent uses gogcli without the webhook, it is not affected.

Main issue: OpenClaw injects untrusted content in user messages instead of using the tool channel (less authoritative) when using the Gmail webhook.

Original links:

https://veganmosfet.codeberg.page/posts/2026-02-02-openclaw_...

https://veganmosfet.codeberg.page/posts/2026-02-15-openclaw_...

BrokenClaw – RCE in OpenClaw via Gmail Hook

veganmosfet — Tue, 24 Feb 2026 10:13:40 +0000

Article URL: https://veganmosfet.codeberg.page/posts/2026-02-02-openclaw_mail_rce/

Comments URL: https://news.ycombinator.com/item?id=47135210

Points: 2

# Comments: 1

New comment by veganmosfet in "HackMyClaw"

veganmosfet — Wed, 18 Feb 2026 08:08:19 +0000

Nice idea! But OpenClaw is not stateless - it learns it's under attack / plays a CTF and gets overparanoid (and opus 4.6 is already paranoid). It seems now it summarizes all emails with "Thread contains 1 me" (a new personality disorder for llm?). Imho it's not a realistic scenario. Better would be to reset the agent (context / md files) between each email to draw conclusions (slow). I was able to prompt inject OpenClaw (2026.2.14) with opus4.6 using gmail pub/sub automation. The issue: OpenClaw injects untrusted content in user channel (message role), it's possible to confuse the model. Better would be to use tool.

New comment by veganmosfet in "0-Click Remote Code Execution in OpenClaw with GPT5.2 via Gmail Hook"

veganmosfet — Sun, 15 Feb 2026 13:27:01 +0000

Second part of the saga, now escaping sandbox with multi-layered prompt injection:

BrokenClaws: Escape the Sub-Agent Sandbox with Indirect Prompt Injection in OpenClaw (via Gmail Hook, 0-Click RCE)

https://veganmosfet.github.io/2026/02/15/openclaw_sandbox.ht...

New comment by veganmosfet in "OpenClaw is changing my life"

veganmosfet — Sun, 08 Feb 2026 16:31:15 +0000

The issue with filtering tags: LLM still react to tags with typos or otherwise small changes. It makes sanitization an impossible problem (!= standard programs). Agree with policies, good idea.

New comment by veganmosfet in "OpenClaw is changing my life"

veganmosfet — Sun, 08 Feb 2026 16:28:31 +0000

There is no silver bullet, but my point is: it's possible to lower the risk. Try out by yourself with a frontier model and an otherwise 'secure' system: the "ignore previous instructions" and co. are not working any more. This is getting quite difficult to confuse a model (and I am the last person to say prompt injection is a solved problem, see my blog).

New comment by veganmosfet in "OpenClaw is changing my life"

veganmosfet — Sun, 08 Feb 2026 16:12:07 +0000

Agree for a general AI assistant, which has the same permissions and access as the assisted human => Disaster. I experimented with OpenClaw and it has a lot of issues. The best: prompt injection attacks are "out of scope" from the security policy == user's problem. However, I found the latest models to have much better safety and instruction following capabilities. Combined with other security best practices, this lowers the risk.