Hacker News: xdotli

New comment by xdotli in "ClawsBench shows GPT-5.4 tries to reward hack 80% of the time"

xdotli — Wed, 08 Apr 2026 17:34:06 +0000

Author here. We built 5 high-fidelity mock Google Workspace + Slack services and ran 7,224 trials across 6 frontier models and 4 agent harnesses.

The headline finding that surprised us most: scaffolding (skills + meta prompt) gives a 39-63pp lift, while the top 5 models are statistically indistinguishable (53-63% TSR, no pairwise comparison survives correction). Your choice of scaffolding matters ~6x more than your choice of model.

The safety findings are darker: Opus leads on task success (63%) but ties for most unsafe (23% UAR). GPT-5.4 is the safest (7% UAR) but mid-tier on tasks. There's no capability-safety tradeoff — they're decoupled.

Also I'm reviewer of Terminal Bench 3.0. Here's what I've heard from contributors as well.

> I noticed that when I was building tasks with harbor. Claude is a good student which generally follows the instruction. But gpt always try to find a short path to cheat. Like reversing the binary directly instead of interaction

Another friends added ways to address this: https://x.com/xeophon/status/2041772210562511080?s=20 > Just ask codex to not reward hack > It literally works. And it works even better when you state which things you consider reward hacking, eg wrapping a CLI or something

Paper: https://arxiv.org/abs/2604.05172 Traces (7,834 on HF): https://huggingface.co/datasets/benchflow/ClawsBench

ClawsBench shows GPT-5.4 tries to reward hack 80% of the time

xdotli — Wed, 08 Apr 2026 17:34:06 +0000

Article URL: https://arxiv.org/abs/2604.05172

Comments URL: https://news.ycombinator.com/item?id=47693529

Points: 3

# Comments: 1

New comment by xdotli in "Chaos of Agent"

xdotli — Fri, 13 Mar 2026 05:07:50 +0000

A two-week study of autonomous language model agents deployed in a live multi-party environment with persistent memory, email, shell access, and real human interaction — tested by twenty researchers interacting both benignly and adversarially.

Chaos of Agent

xdotli — Fri, 13 Mar 2026 05:07:50 +0000

Article URL: https://agentsofchaos.baulab.info/

Comments URL: https://news.ycombinator.com/item?id=47360903

Points: 1

# Comments: 1

New comment by xdotli in "Native CLI scaffolds consistently outper-form OpenCode when using the same model"

xdotli — Fri, 13 Mar 2026 04:17:26 +0000

Agent scaffold comparison. We additionally evaluateOpenCode, an open-source scaffold that supports multiplemodel providers. Native CLI scaffolds consistently outper-form OpenCode when using the same underlying model.GPT-5.1 Codex Max achieves 20.2% on Codex CLI butonly 7.7% on OpenCode. Similarly, Gemini 3 Pro scores18.3% on Gemini CLI versus 14.9% on OpenCode. The one exception is Claude Opus 4.5, which scores 17.1% on Claude Code and 17.3% on OpenCode — effectively equivalent, and the only case where the open-source scaffold matches or slightly exceeds the native one.

Native CLI scaffolds consistently outper-form OpenCode when using the same model

xdotli — Fri, 13 Mar 2026 04:17:26 +0000

Article URL: https://arxiv.org/abs/2603.08640

Comments URL: https://news.ycombinator.com/item?id=47360638

Points: 1

# Comments: 1

We compare model quality in Cursor

xdotli — Fri, 13 Mar 2026 03:57:42 +0000

Article URL: https://cursor.com/blog/cursorbench

Comments URL: https://news.ycombinator.com/item?id=47360528

Points: 2

# Comments: 0

Automatically Learning Skills for Coding Agents

xdotli — Tue, 24 Feb 2026 01:52:26 +0000

Article URL: https://gepa-ai.github.io/gepa/blog/2026/02/18/automatically-learning-skills-for-coding-agents/

Comments URL: https://news.ycombinator.com/item?id=47131862

Points: 4

# Comments: 0

We Reached 74.8% on terminal-bench with Terminus-KIRA

xdotli — Tue, 24 Feb 2026 01:47:42 +0000

Article URL: https://krafton-ai.github.io/blog/terminus_kira_en/

Comments URL: https://news.ycombinator.com/item?id=47131828

Points: 2

# Comments: 0

New comment by xdotli in "Self-generated skills don't do much for AI agents, but human-curated skills do"

xdotli — Mon, 23 Feb 2026 08:46:19 +0000

yeah we didn't give agents access to the internet for creating their domain knowledge skills

New comment by xdotli in "Self-generated skills don't do much for AI agents, but human-curated skills do"

xdotli — Mon, 23 Feb 2026 08:13:40 +0000

The Register wrote about works on SkillsBench.ai

Self-generated skills don't do much for AI agents, but human-curated skills do

xdotli — Mon, 23 Feb 2026 08:13:40 +0000

Article URL: https://www.theregister.com/2026/02/19/ai_agents_cant_teach_themselves/

Comments URL: https://news.ycombinator.com/item?id=47119464

Points: 2

# Comments: 3

New comment by xdotli in "SkillsBench: Benchmarking how well agent skills work across diverse tasks"

xdotli — Wed, 18 Feb 2026 07:47:32 +0000

no worries it's totally fine! there is indeed work needs to be done on the feedbacks generated skills. Thanks for helping us submitting on HackerNews. And for > a lot of Skills on GitHub are just AI-generated without any feedback or deliberative refinement. Many thought those would still be valuable, but you've shown evidence otherwise. we do find most skills on the internet to be useless, and thanks to the generosity of https://skillsmp.com/ author, we were able to get all the meta data of the 99k skills indexed on his website. We did a lot of filtering and deduping and we discovered ~40k+ skills were relevant at the time we did the study.

New comment by xdotli in "First Agent Skills Hackathon by the Authors of SkillsBench"

xdotli — Wed, 18 Feb 2026 00:26:38 +0000

20+ Anthropic Default Skills, 200k+ community skills on skillsmp. People talk about skills without knowing how well they work. We're hosting the largest Agent Skills hackathon at Founders, Inc. (March 7 - 8) from our lessons learned at SkillsBench No sims. No slides. No flops.

First Agent Skills Hackathon by the Authors of SkillsBench

xdotli — Wed, 18 Feb 2026 00:26:38 +0000

Article URL: https://www.skillathon.ai/

Comments URL: https://news.ycombinator.com/item?id=47055428

Points: 2

# Comments: 1

New comment by xdotli in "SkillsBench: Benchmarking how well agent skills work across diverse tasks"

xdotli — Tue, 17 Feb 2026 21:14:33 +0000

Did you check our repos and sites? the repo is skills native. Also please don't be misled by the original title, we have this configuration to eliminate the impact of internal knowledge of LLMs. It's in the paper.

New comment by xdotli in "The First Agent Skills Benchmark"

xdotli — Tue, 17 Feb 2026 20:56:25 +0000

We collected 86 tasks from 105 domain experts across 11 domains, every task is verifiable, human created and has verified Skills. SOTA model without skills score ~30% without skills.

We found a few interesting things: 1. Skills substitute for model scale — Haiku 4.5 with Skills (27.7%) beats Opus 4.5 without (22.0%). The right procedural knowledge can be worth more than a bigger model. 2. Skills' improvement has nothing to do with LLMs' internal knowledge. We have an ablation where no Skills provided for the agent, but the agent is prompted to generate relevant procedural knowledge before solving the task. This isolates the impact of LLMs' latent domain knowledge. The result is: Curated Skills: +16.2pp average improvement across all 7 agent configs Self-generated Skills: -1.3pp: models can't write their own procedural knowledge pre-trajectory feedbacks. This is used to isolate the impact of LLMs' latent domain knowledge.

The First Agent Skills Benchmark

xdotli — Tue, 17 Feb 2026 20:56:25 +0000

Article URL: https://huggingface.co/papers/2602.12670

Comments URL: https://news.ycombinator.com/item?id=47053217

Points: 1

# Comments: 1

New comment by xdotli in "SkillsBench: Benchmarking how well agent skills work across diverse tasks"

xdotli — Tue, 17 Feb 2026 20:17:37 +0000

we didn't create that headline yeah thanks for liking it

New comment by xdotli in "SkillsBench: Benchmarking how well agent skills work across diverse tasks"

xdotli — Tue, 17 Feb 2026 20:16:41 +0000

Thanks @dang for moderating! This is indeed not our original findings and this is a sub conclusion for an ablation we did to remove the confound of LLMs internal domain knowledge. Thanks for submitting for us @mustaphah here's a little bit more details on how we approach this:

> I would frame the 'post-trajectory generated skills' as feedback-generated skills, so is Letta: https://www.letta.com/blog/skill-learning. We haven't seen existing research or hypothesis debating whether the skills improvement might come from the skill prompt themselves activated knowledge in LLMs that can help itself. So that's why we added an ablation of 'pre-trajectory generated skills' because we have that hypothesis and this seems a very clean way to test it. Also it is very logical that feedback generated skills can help, because it most certainly contain the failure mode of agents on that specific tasks.