<rss version="2.0" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:atom="http://www.w3.org/2005/Atom"><channel><title>Hacker News: alex_metacraft</title><link>https://news.ycombinator.com/user?id=alex_metacraft</link><description>Hacker News RSS</description><docs>https://hnrss.org/</docs><generator>hnrss v2.1.1</generator><lastBuildDate>Fri, 29 May 2026 19:12:10 +0000</lastBuildDate><atom:link href="https://hnrss.org/user?id=alex_metacraft" rel="self" type="application/rss+xml"></atom:link><item><title><![CDATA[New comment by alex_metacraft in "AGENTS.md outperforms skills in our agent evals"]]></title><description><![CDATA[
<p>Good catch on the numbers. 29/33 vs 33/33 is the kind of gap that could easily be noise with that sample size. You'd need hundreds of runs to draw any meaningful conclusion about a 4-point difference, especially given how non-deterministic these models are.<p>This is a recurring problem with LLM benchmarking — small sample sizes presented with high confidence. The underlying finding (always-in-context > lazy-loaded) is probably directionally correct, but the specific numbers don't really support the strength of the claims in the article.</p>
]]></description><pubDate>Wed, 11 Feb 2026 04:39:37 +0000</pubDate><link>https://news.ycombinator.com/item?id=46970901</link><dc:creator>alex_metacraft</dc:creator><comments>https://news.ycombinator.com/item?id=46970901</comments><guid isPermaLink="false">https://news.ycombinator.com/item?id=46970901</guid></item><item><title><![CDATA[New comment by alex_metacraft in "Show HN: Askill – A package manager for AI agent skills with AI safety scoring"]]></title><description><![CDATA[
<p>Hey HN. I've been working on askill, a CLI package manager for agent skills (SKILL.md files used by Claude Code, Codex, Cursor, etc.).<p>There are already several skill directories and installers out there (skills.sh, skillregistry.io, and others). I saw the Show HN for skills.sh a few weeks ago and noticed comments asking for version management, proper uninstalls, and more transparency around what gets installed. Those are exactly the problems I'd been working on, so I figured it was worth sharing.<p>What askill does differently:<p>1. AI safety scoring. Every skill indexed on askill.sh gets an automated review across five dimensions: safety, clarity, completeness, actionability, and reusability. The full breakdown is visible before you install. This was motivated by a simple concern — a SKILL.md tells your agent what to do, what commands to run, how to behave. Trusting random files from GitHub without any review felt like the early days of npm before anyone thought about supply chain security.<p>2. Real package management. askill publish lets authors release versioned skills with semver. askill add @scope/name@^1.0 resolves versions. askill update and askill remove do what you'd expect. Skills can declare dependencies on other skills. None of the existing tools I've seen handle versioning or dependency resolution.<p>3. Precise installs. askill add @scope/name installs one skill. Most alternatives operate at the repo level — if a repo has 12 skills you only want 1, you still get all 12. askill also lets you install from GitHub directly (askill add gh:owner/repo@skill-name) if the skill hasn't been published.<p>4. Cross-agent symlinks. Skills are written to .agents/skills/ (canonical location) and symlinked into each agent's expected directory (.claude/skills/, .codex/skills/, .cursor/skills/, etc.). One install, all agents see it. This also means removal is clean — delete the canonical copy and all symlinks go away.<p>5. Open indexing. An automated crawler finds SKILL.md files across public GitHub repos and indexes them. Authors can also run askill submit <github-url> to trigger indexing of a specific repo. No manual curation.<p>The AI scoring pipeline runs hourly. It re-evaluates whenever the source SKILL.md content changes. The scoring is done by an LLM with 11 heuristic rules as guardrails (detecting auto-generated content, internal config paths, hardcoded secrets, etc.). I'm under no illusions that LLM-based review is perfect, but it's a starting point and better than nothing.<p>The CLI is open source (MIT): <a href="https://github.com/avibe-bot/askill" rel="nofollow">https://github.com/avibe-bot/askill</a><p>Browse indexed skills: <a href="https://askill.sh" rel="nofollow">https://askill.sh</a><p>Happy to answer questions about the architecture, the scoring system, or anything else.</p>
]]></description><pubDate>Wed, 11 Feb 2026 04:05:01 +0000</pubDate><link>https://news.ycombinator.com/item?id=46970692</link><dc:creator>alex_metacraft</dc:creator><comments>https://news.ycombinator.com/item?id=46970692</comments><guid isPermaLink="false">https://news.ycombinator.com/item?id=46970692</guid></item><item><title><![CDATA[Show HN: Askill – A package manager for AI agent skills with AI safety scoring]]></title><description><![CDATA[
<p>Article URL: <a href="https://github.com/avibe-bot/askill">https://github.com/avibe-bot/askill</a></p>
<p>Comments URL: <a href="https://news.ycombinator.com/item?id=46970689">https://news.ycombinator.com/item?id=46970689</a></p>
<p>Points: 1</p>
<p># Comments: 1</p>
]]></description><pubDate>Wed, 11 Feb 2026 04:04:50 +0000</pubDate><link>https://github.com/avibe-bot/askill</link><dc:creator>alex_metacraft</dc:creator><comments>https://news.ycombinator.com/item?id=46970689</comments><guid isPermaLink="false">https://news.ycombinator.com/item?id=46970689</guid></item><item><title><![CDATA[New comment by alex_metacraft in "AGENTS.md outperforms skills in our agent evals"]]></title><description><![CDATA[
<p>I think this experiment has a fundamental flaw in its comparison setup.<p>What they're comparing is: (A) a skill with a short description in the frontmatter, which the agent may or may not decide to invoke, vs. (B) a massive compressed index of documentation paths dumped directly into AGENTS.md, which is always in context.<p>This isn't really "AGENTS.md vs skills." It's "always-in-context with high token count vs. lazy-loaded with a decision point." Of course the always-in-context version wins — you're giving the model way more information upfront. The agent literally can't miss it. That's not a surprising finding, it's almost tautological.<p>The more interesting question they don't address: what did their skill descriptions actually look like? In my experience, the quality of the frontmatter description is the single biggest factor in whether a skill gets invoked. A vague "Documentation lookup skill" will get ignored. A specific "Use this when the user asks about API endpoints, authentication, rate limits, or SDK usage for the Vercel platform" will get picked up reliably.<p>If you wrote equally detailed compressed pointers in AGENTS.md and equally detailed descriptions in skill frontmatter, the gap would likely be much smaller. The real takeaway isn't "skills are worse" — it's "if you don't invest effort in writing good skill descriptions, the agent won't know when to use them."</p>
]]></description><pubDate>Tue, 10 Feb 2026 13:26:50 +0000</pubDate><link>https://news.ycombinator.com/item?id=46959424</link><dc:creator>alex_metacraft</dc:creator><comments>https://news.ycombinator.com/item?id=46959424</comments><guid isPermaLink="false">https://news.ycombinator.com/item?id=46959424</guid></item><item><title><![CDATA[New comment by alex_metacraft in "AGENTS.md outperforms skills in our agent evals"]]></title><description><![CDATA[
<p>This is a really interesting finding. It makes sense when you think about what the training data looks like — first person statements in a system prompt pattern-match to "internal monologue" or "chain of thought" examples, which the model has been heavily trained to follow through on. Second person commands pattern-match to user instructions, which the model has also been trained to sometimes push back on or reinterpret.<p>There's probably a related effect with imperative vs. declarative framing in skills too. "When the user asks about X, do Y" seems to work worse than "This project uses Y for X" in my experience. The declarative version reads like a fact about the world rather than a command to obey, and models seem to treat facts as more reliable context.<p>Would be curious if someone has tested this systematically across different models. The optimal framing might vary quite a bit between Claude, Gemini, and GPT.</p>
]]></description><pubDate>Tue, 10 Feb 2026 13:16:39 +0000</pubDate><link>https://news.ycombinator.com/item?id=46959318</link><dc:creator>alex_metacraft</dc:creator><comments>https://news.ycombinator.com/item?id=46959318</comments><guid isPermaLink="false">https://news.ycombinator.com/item?id=46959318</guid></item></channel></rss>