<rss version="2.0" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:atom="http://www.w3.org/2005/Atom"><channel><title>Hacker News: xdotli</title><link>https://news.ycombinator.com/user?id=xdotli</link><description>Hacker News RSS</description><docs>https://hnrss.org/</docs><generator>hnrss v2.1.1</generator><lastBuildDate>Thu, 23 Apr 2026 15:25:44 +0000</lastBuildDate><atom:link href="https://hnrss.org/user?id=xdotli" rel="self" type="application/rss+xml"></atom:link><item><title><![CDATA[New comment by xdotli in "ClawsBench shows GPT-5.4 tries to reward hack 80% of the time"]]></title><description><![CDATA[
<p>Author here. We built 5 high-fidelity mock Google Workspace + Slack services and ran 7,224 trials across 6 frontier models and 4 agent harnesses.<p>The headline finding that surprised us most: scaffolding (skills + meta prompt) gives a 39-63pp lift, while the top 5 models are statistically indistinguishable (53-63% TSR, no pairwise comparison survives correction). Your choice of scaffolding matters ~6x more than your choice of model.<p>The safety findings are darker: Opus leads on task success (63%) but ties for most unsafe (23% UAR). GPT-5.4 is the safest (7% UAR) but mid-tier on tasks. There's no capability-safety tradeoff — they're decoupled.<p>Also I'm reviewer of Terminal Bench 3.0. Here's what I've heard from contributors as well.<p>> I noticed that when I was building tasks with harbor. Claude is a good student which generally follows the instruction. But gpt always try to find a short path to cheat. Like reversing the binary directly instead of interaction<p>Another friends added ways to address this: <a href="https://x.com/xeophon/status/2041772210562511080?s=20" rel="nofollow">https://x.com/xeophon/status/2041772210562511080?s=20</a>
> Just ask codex to not reward hack
> It literally works. And it works even better when you state which things you consider reward hacking, eg wrapping a CLI or something<p>Paper: <a href="https://arxiv.org/abs/2604.05172" rel="nofollow">https://arxiv.org/abs/2604.05172</a>
Traces (7,834 on HF): <a href="https://huggingface.co/datasets/benchflow/ClawsBench" rel="nofollow">https://huggingface.co/datasets/benchflow/ClawsBench</a></p>
]]></description><pubDate>Wed, 08 Apr 2026 17:34:06 +0000</pubDate><link>https://news.ycombinator.com/item?id=47693530</link><dc:creator>xdotli</dc:creator><comments>https://news.ycombinator.com/item?id=47693530</comments><guid isPermaLink="false">https://news.ycombinator.com/item?id=47693530</guid></item><item><title><![CDATA[ClawsBench shows GPT-5.4 tries to reward hack 80% of the time]]></title><description><![CDATA[
<p>Article URL: <a href="https://arxiv.org/abs/2604.05172">https://arxiv.org/abs/2604.05172</a></p>
<p>Comments URL: <a href="https://news.ycombinator.com/item?id=47693529">https://news.ycombinator.com/item?id=47693529</a></p>
<p>Points: 3</p>
<p># Comments: 1</p>
]]></description><pubDate>Wed, 08 Apr 2026 17:34:06 +0000</pubDate><link>https://arxiv.org/abs/2604.05172</link><dc:creator>xdotli</dc:creator><comments>https://news.ycombinator.com/item?id=47693529</comments><guid isPermaLink="false">https://news.ycombinator.com/item?id=47693529</guid></item><item><title><![CDATA[New comment by xdotli in "Chaos of Agent"]]></title><description><![CDATA[
<p>A two-week study of autonomous language model agents deployed in a live multi-party environment with persistent memory, email, shell access, and real human interaction — tested by twenty researchers interacting both benignly and adversarially.</p>
]]></description><pubDate>Fri, 13 Mar 2026 05:07:50 +0000</pubDate><link>https://news.ycombinator.com/item?id=47360904</link><dc:creator>xdotli</dc:creator><comments>https://news.ycombinator.com/item?id=47360904</comments><guid isPermaLink="false">https://news.ycombinator.com/item?id=47360904</guid></item><item><title><![CDATA[Chaos of Agent]]></title><description><![CDATA[
<p>Article URL: <a href="https://agentsofchaos.baulab.info/">https://agentsofchaos.baulab.info/</a></p>
<p>Comments URL: <a href="https://news.ycombinator.com/item?id=47360903">https://news.ycombinator.com/item?id=47360903</a></p>
<p>Points: 1</p>
<p># Comments: 1</p>
]]></description><pubDate>Fri, 13 Mar 2026 05:07:50 +0000</pubDate><link>https://agentsofchaos.baulab.info/</link><dc:creator>xdotli</dc:creator><comments>https://news.ycombinator.com/item?id=47360903</comments><guid isPermaLink="false">https://news.ycombinator.com/item?id=47360903</guid></item><item><title><![CDATA[New comment by xdotli in "Native CLI scaffolds consistently outper-form OpenCode when using the same model"]]></title><description><![CDATA[
<p>Agent scaffold comparison. We additionally evaluateOpenCode, an open-source scaffold that supports multiplemodel providers. Native CLI scaffolds consistently outper-form OpenCode when using the same underlying model.GPT-5.1 Codex Max achieves 20.2% on Codex CLI butonly 7.7% on OpenCode. Similarly, Gemini 3 Pro scores18.3% on Gemini CLI versus 14.9% on OpenCode. The
one exception is Claude Opus 4.5, which scores 17.1% on Claude Code and 17.3% on OpenCode — effectively equivalent, and the only case where the open-source scaffold
matches or slightly exceeds the native one.</p>
]]></description><pubDate>Fri, 13 Mar 2026 04:17:26 +0000</pubDate><link>https://news.ycombinator.com/item?id=47360639</link><dc:creator>xdotli</dc:creator><comments>https://news.ycombinator.com/item?id=47360639</comments><guid isPermaLink="false">https://news.ycombinator.com/item?id=47360639</guid></item><item><title><![CDATA[Native CLI scaffolds consistently outper-form OpenCode when using the same model]]></title><description><![CDATA[
<p>Article URL: <a href="https://arxiv.org/abs/2603.08640">https://arxiv.org/abs/2603.08640</a></p>
<p>Comments URL: <a href="https://news.ycombinator.com/item?id=47360638">https://news.ycombinator.com/item?id=47360638</a></p>
<p>Points: 1</p>
<p># Comments: 1</p>
]]></description><pubDate>Fri, 13 Mar 2026 04:17:26 +0000</pubDate><link>https://arxiv.org/abs/2603.08640</link><dc:creator>xdotli</dc:creator><comments>https://news.ycombinator.com/item?id=47360638</comments><guid isPermaLink="false">https://news.ycombinator.com/item?id=47360638</guid></item><item><title><![CDATA[We compare model quality in Cursor]]></title><description><![CDATA[
<p>Article URL: <a href="https://cursor.com/blog/cursorbench">https://cursor.com/blog/cursorbench</a></p>
<p>Comments URL: <a href="https://news.ycombinator.com/item?id=47360528">https://news.ycombinator.com/item?id=47360528</a></p>
<p>Points: 2</p>
<p># Comments: 0</p>
]]></description><pubDate>Fri, 13 Mar 2026 03:57:42 +0000</pubDate><link>https://cursor.com/blog/cursorbench</link><dc:creator>xdotli</dc:creator><comments>https://news.ycombinator.com/item?id=47360528</comments><guid isPermaLink="false">https://news.ycombinator.com/item?id=47360528</guid></item><item><title><![CDATA[Automatically Learning Skills for Coding Agents]]></title><description><![CDATA[
<p>Article URL: <a href="https://gepa-ai.github.io/gepa/blog/2026/02/18/automatically-learning-skills-for-coding-agents/">https://gepa-ai.github.io/gepa/blog/2026/02/18/automatically-learning-skills-for-coding-agents/</a></p>
<p>Comments URL: <a href="https://news.ycombinator.com/item?id=47131862">https://news.ycombinator.com/item?id=47131862</a></p>
<p>Points: 4</p>
<p># Comments: 0</p>
]]></description><pubDate>Tue, 24 Feb 2026 01:52:26 +0000</pubDate><link>https://gepa-ai.github.io/gepa/blog/2026/02/18/automatically-learning-skills-for-coding-agents/</link><dc:creator>xdotli</dc:creator><comments>https://news.ycombinator.com/item?id=47131862</comments><guid isPermaLink="false">https://news.ycombinator.com/item?id=47131862</guid></item><item><title><![CDATA[We Reached 74.8% on terminal-bench with Terminus-KIRA]]></title><description><![CDATA[
<p>Article URL: <a href="https://krafton-ai.github.io/blog/terminus_kira_en/">https://krafton-ai.github.io/blog/terminus_kira_en/</a></p>
<p>Comments URL: <a href="https://news.ycombinator.com/item?id=47131828">https://news.ycombinator.com/item?id=47131828</a></p>
<p>Points: 2</p>
<p># Comments: 0</p>
]]></description><pubDate>Tue, 24 Feb 2026 01:47:42 +0000</pubDate><link>https://krafton-ai.github.io/blog/terminus_kira_en/</link><dc:creator>xdotli</dc:creator><comments>https://news.ycombinator.com/item?id=47131828</comments><guid isPermaLink="false">https://news.ycombinator.com/item?id=47131828</guid></item><item><title><![CDATA[New comment by xdotli in "Self-generated skills don't do much for AI agents, but human-curated skills do"]]></title><description><![CDATA[
<p>yeah we didn't give agents access to the internet for creating their domain knowledge skills</p>
]]></description><pubDate>Mon, 23 Feb 2026 08:46:19 +0000</pubDate><link>https://news.ycombinator.com/item?id=47119712</link><dc:creator>xdotli</dc:creator><comments>https://news.ycombinator.com/item?id=47119712</comments><guid isPermaLink="false">https://news.ycombinator.com/item?id=47119712</guid></item><item><title><![CDATA[New comment by xdotli in "Self-generated skills don't do much for AI agents, but human-curated skills do"]]></title><description><![CDATA[
<p>The Register wrote about works on SkillsBench.ai</p>
]]></description><pubDate>Mon, 23 Feb 2026 08:13:40 +0000</pubDate><link>https://news.ycombinator.com/item?id=47119465</link><dc:creator>xdotli</dc:creator><comments>https://news.ycombinator.com/item?id=47119465</comments><guid isPermaLink="false">https://news.ycombinator.com/item?id=47119465</guid></item><item><title><![CDATA[Self-generated skills don't do much for AI agents, but human-curated skills do]]></title><description><![CDATA[
<p>Article URL: <a href="https://www.theregister.com/2026/02/19/ai_agents_cant_teach_themselves/">https://www.theregister.com/2026/02/19/ai_agents_cant_teach_themselves/</a></p>
<p>Comments URL: <a href="https://news.ycombinator.com/item?id=47119464">https://news.ycombinator.com/item?id=47119464</a></p>
<p>Points: 2</p>
<p># Comments: 3</p>
]]></description><pubDate>Mon, 23 Feb 2026 08:13:40 +0000</pubDate><link>https://www.theregister.com/2026/02/19/ai_agents_cant_teach_themselves/</link><dc:creator>xdotli</dc:creator><comments>https://news.ycombinator.com/item?id=47119464</comments><guid isPermaLink="false">https://news.ycombinator.com/item?id=47119464</guid></item><item><title><![CDATA[New comment by xdotli in "SkillsBench: Benchmarking how well agent skills work across diverse tasks"]]></title><description><![CDATA[
<p>no worries it's totally fine! there is indeed work needs to be done on the feedbacks generated skills. Thanks for helping us submitting on HackerNews. And for
> a lot of Skills on GitHub are just AI-generated without any feedback or deliberative refinement. Many thought those would still be valuable, but you've shown evidence otherwise.
we do find most skills on the internet to be useless, and thanks to the generosity of <a href="https://skillsmp.com/" rel="nofollow">https://skillsmp.com/</a> author, we were able to get all the meta data of the 99k skills indexed on his website. We did a lot of filtering and deduping and we discovered ~40k+ skills were relevant at the time we did the study.</p>
]]></description><pubDate>Wed, 18 Feb 2026 07:47:32 +0000</pubDate><link>https://news.ycombinator.com/item?id=47058384</link><dc:creator>xdotli</dc:creator><comments>https://news.ycombinator.com/item?id=47058384</comments><guid isPermaLink="false">https://news.ycombinator.com/item?id=47058384</guid></item><item><title><![CDATA[New comment by xdotli in "First Agent Skills Hackathon by the Authors of SkillsBench"]]></title><description><![CDATA[
<p>20+ Anthropic Default Skills, 200k+ community skills on skillsmp. People talk about skills without knowing how well they work. 
We're hosting the largest Agent Skills hackathon at Founders, Inc. (March 7 - 8) from our lessons learned at SkillsBench
No sims. No slides. No flops.</p>
]]></description><pubDate>Wed, 18 Feb 2026 00:26:38 +0000</pubDate><link>https://news.ycombinator.com/item?id=47055429</link><dc:creator>xdotli</dc:creator><comments>https://news.ycombinator.com/item?id=47055429</comments><guid isPermaLink="false">https://news.ycombinator.com/item?id=47055429</guid></item><item><title><![CDATA[First Agent Skills Hackathon by the Authors of SkillsBench]]></title><description><![CDATA[
<p>Article URL: <a href="https://www.skillathon.ai/">https://www.skillathon.ai/</a></p>
<p>Comments URL: <a href="https://news.ycombinator.com/item?id=47055428">https://news.ycombinator.com/item?id=47055428</a></p>
<p>Points: 2</p>
<p># Comments: 1</p>
]]></description><pubDate>Wed, 18 Feb 2026 00:26:38 +0000</pubDate><link>https://www.skillathon.ai/</link><dc:creator>xdotli</dc:creator><comments>https://news.ycombinator.com/item?id=47055428</comments><guid isPermaLink="false">https://news.ycombinator.com/item?id=47055428</guid></item><item><title><![CDATA[New comment by xdotli in "SkillsBench: Benchmarking how well agent skills work across diverse tasks"]]></title><description><![CDATA[
<p>Did you check our repos and sites? the repo is skills native. Also please don't be misled by the original title, we have this configuration to eliminate the impact of internal knowledge of LLMs. It's in the paper.</p>
]]></description><pubDate>Tue, 17 Feb 2026 21:14:33 +0000</pubDate><link>https://news.ycombinator.com/item?id=47053442</link><dc:creator>xdotli</dc:creator><comments>https://news.ycombinator.com/item?id=47053442</comments><guid isPermaLink="false">https://news.ycombinator.com/item?id=47053442</guid></item><item><title><![CDATA[New comment by xdotli in "The First Agent Skills Benchmark"]]></title><description><![CDATA[
<p>We collected 86 tasks from 105 domain experts across 11 domains, every task is verifiable, human created and has verified Skills. SOTA model without skills score ~30% without skills.<p>We found a few interesting things:
1. Skills substitute for model scale — Haiku 4.5 with Skills (27.7%) beats Opus 4.5 without (22.0%).
The right procedural knowledge can be worth more than a bigger model.
2. Skills' improvement has nothing to do with LLMs' internal knowledge. We have an ablation where no Skills provided for the agent, but the agent is prompted to generate relevant procedural knowledge before solving the task. This isolates the impact of LLMs' latent domain knowledge. The result is:
Curated Skills: +16.2pp average improvement across all 7 agent configs
Self-generated Skills: -1.3pp: models can't write their own procedural knowledge pre-trajectory feedbacks. This is used to isolate the impact of LLMs' latent domain knowledge.</p>
]]></description><pubDate>Tue, 17 Feb 2026 20:56:25 +0000</pubDate><link>https://news.ycombinator.com/item?id=47053218</link><dc:creator>xdotli</dc:creator><comments>https://news.ycombinator.com/item?id=47053218</comments><guid isPermaLink="false">https://news.ycombinator.com/item?id=47053218</guid></item><item><title><![CDATA[The First Agent Skills Benchmark]]></title><description><![CDATA[
<p>Article URL: <a href="https://huggingface.co/papers/2602.12670">https://huggingface.co/papers/2602.12670</a></p>
<p>Comments URL: <a href="https://news.ycombinator.com/item?id=47053217">https://news.ycombinator.com/item?id=47053217</a></p>
<p>Points: 1</p>
<p># Comments: 1</p>
]]></description><pubDate>Tue, 17 Feb 2026 20:56:25 +0000</pubDate><link>https://huggingface.co/papers/2602.12670</link><dc:creator>xdotli</dc:creator><comments>https://news.ycombinator.com/item?id=47053217</comments><guid isPermaLink="false">https://news.ycombinator.com/item?id=47053217</guid></item><item><title><![CDATA[New comment by xdotli in "SkillsBench: Benchmarking how well agent skills work across diverse tasks"]]></title><description><![CDATA[
<p>we didn't create that headline  yeah thanks for liking it</p>
]]></description><pubDate>Tue, 17 Feb 2026 20:17:37 +0000</pubDate><link>https://news.ycombinator.com/item?id=47052705</link><dc:creator>xdotli</dc:creator><comments>https://news.ycombinator.com/item?id=47052705</comments><guid isPermaLink="false">https://news.ycombinator.com/item?id=47052705</guid></item><item><title><![CDATA[New comment by xdotli in "SkillsBench: Benchmarking how well agent skills work across diverse tasks"]]></title><description><![CDATA[
<p>Thanks @dang for moderating! This is indeed not our original findings and this is a sub conclusion for an ablation we did to remove the confound of LLMs internal domain knowledge. Thanks for submitting for us @mustaphah here's a little bit more details on how we approach this:<p>> I would frame the 'post-trajectory generated skills' as feedback-generated skills, so is Letta: <a href="https://www.letta.com/blog/skill-learning" rel="nofollow">https://www.letta.com/blog/skill-learning</a>. We haven't seen existing research or hypothesis debating whether the skills improvement might come from the skill prompt themselves activated knowledge in LLMs that can help itself. So that's why we added an ablation of 'pre-trajectory generated skills' because we have that hypothesis and this seems a very clean way to test it. Also it is very logical that feedback generated skills can help, because it most certainly contain the failure mode of agents on that specific tasks.</p>
]]></description><pubDate>Tue, 17 Feb 2026 20:16:41 +0000</pubDate><link>https://news.ycombinator.com/item?id=47052689</link><dc:creator>xdotli</dc:creator><comments>https://news.ycombinator.com/item?id=47052689</comments><guid isPermaLink="false">https://news.ycombinator.com/item?id=47052689</guid></item></channel></rss>