<rss version="2.0" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:atom="http://www.w3.org/2005/Atom"><channel><title>Hacker News: fdefitte</title><link>https://news.ycombinator.com/user?id=fdefitte</link><description>Hacker News RSS</description><docs>https://hnrss.org/</docs><generator>hnrss v2.1.1</generator><lastBuildDate>Wed, 10 Jun 2026 05:37:52 +0000</lastBuildDate><atom:link href="https://hnrss.org/user?id=fdefitte" rel="self" type="application/rss+xml"></atom:link><item><title><![CDATA[New comment by fdefitte in "I'm helping my dog vibe code games"]]></title><description><![CDATA[
<p>The dog ships faster because it has zero opinions about the architecture.</p>
]]></description><pubDate>Wed, 25 Feb 2026 01:15:52 +0000</pubDate><link>https://news.ycombinator.com/item?id=47146015</link><dc:creator>fdefitte</dc:creator><comments>https://news.ycombinator.com/item?id=47146015</comments><guid isPermaLink="false">https://news.ycombinator.com/item?id=47146015</guid></item><item><title><![CDATA[Show HN: Cobalt – Unit tests for AI agents, like Jest but for LLMs]]></title><description><![CDATA[
<p>Hey HN, I built Cobalt, an open-source testing framework for AI agents and LLM apps.<p>Most eval tools (Braintrust, Arize, LangSmith) want you to live in their UI. Dashboards, manual reviews, clicking through results. That's fine for exploration, but it doesn't catch regressions. We needed something that runs in CI like any other test suite, lives in code, and fails the build when quality drops.<p><pre><code>  npm install @basalt-ai/cobalt
  npx cobalt init
  npx cobalt run
</code></pre>
Write experiments as code:<p><pre><code>  import { experiment, Dataset, Evaluator } from '@basalt-ai/cobalt'

  const dataset = Dataset.fromLangfuse('support-tickets')

  experiment('support-agent', dataset, async ({ item }) => {
    const result = await myAgent(item.input)
    return { output: result }
  }, {
    evaluators: [
      new Evaluator({ name: 'Helpful', type: 'llm-judge', prompt: 'Is this response helpful and accurate? {{output}}' }),
      new Evaluator({ name: 'No hallucination', type: 'llm-judge', prompt: 'Does this contain fabricated info? {{output}}' }),
    ]
  })
</code></pre>
`npx cobalt run --ci` exits with code 1 if thresholds are violated. The GitHub Action posts score tables on PRs and auto-compares against base branch.<p>The part I'm most excited about: Cobalt ships with a built-in MCP server, so you can drive it entirely from Claude Code. Just tell it "compare GPT 5.2 with 5.1 on my support agent" or "run my experiments, find the failing cases, and fix the prompt." It runs the experiments, diffs the results, and iterates on your code without you leaving the terminal. Turns eval from a chore into a conversation.<p>Pull datasets from Langfuse, LangSmith, Braintrust, or plain JSON/JSONL/CSV. Results stored locally in SQLite. No accounts, no dashboards, no vendor lock-in.</p>
<hr>
<p>Comments URL: <a href="https://news.ycombinator.com/item?id=47091182">https://news.ycombinator.com/item?id=47091182</a></p>
<p>Points: 3</p>
<p># Comments: 0</p>
]]></description><pubDate>Fri, 20 Feb 2026 17:43:05 +0000</pubDate><link>https://github.com/basalt-ai/cobalt</link><dc:creator>fdefitte</dc:creator><comments>https://news.ycombinator.com/item?id=47091182</comments><guid isPermaLink="false">https://news.ycombinator.com/item?id=47091182</guid></item><item><title><![CDATA[New comment by fdefitte in "Micropayments as a reality check for news sites"]]></title><description><![CDATA[
<p>Good point</p>
]]></description><pubDate>Fri, 20 Feb 2026 05:05:36 +0000</pubDate><link>https://news.ycombinator.com/item?id=47083995</link><dc:creator>fdefitte</dc:creator><comments>https://news.ycombinator.com/item?id=47083995</comments><guid isPermaLink="false">https://news.ycombinator.com/item?id=47083995</guid></item><item><title><![CDATA[New comment by fdefitte in "BarraCUDA Open-source CUDA compiler targeting AMD GPUs"]]></title><description><![CDATA[
<p>Agreed on ZLUDA being the practical choice. This project is more impressive as a "build a GPU compiler from scratch" exercise than as something you'd actually use for ML workloads. The custom instruction encoding without LLVM is genuinely cool though, even if the C subset limitation makes it a non-starter for most real CUDA codebases.</p>
]]></description><pubDate>Wed, 18 Feb 2026 00:28:06 +0000</pubDate><link>https://news.ycombinator.com/item?id=47055442</link><dc:creator>fdefitte</dc:creator><comments>https://news.ycombinator.com/item?id=47055442</comments><guid isPermaLink="false">https://news.ycombinator.com/item?id=47055442</guid></item><item><title><![CDATA[New comment by fdefitte in "Is Show HN dead? No, but it's drowning"]]></title><description><![CDATA[
<p>The filter used to be effort. You had to care enough to spend weeks on something, which meant you probably understood the problem deeply. Now that filter is gone and we get a flood of "I prompted this in 20 minutes" posts where the author can't answer a single follow-up about their own code. The interesting Show HNs still exist, they're just buried under noise.</p>
]]></description><pubDate>Wed, 18 Feb 2026 00:27:49 +0000</pubDate><link>https://news.ycombinator.com/item?id=47055440</link><dc:creator>fdefitte</dc:creator><comments>https://news.ycombinator.com/item?id=47055440</comments><guid isPermaLink="false">https://news.ycombinator.com/item?id=47055440</guid></item><item><title><![CDATA[New comment by fdefitte in "Claude Sonnet 4.6"]]></title><description><![CDATA[
<p>The 8% one-shot number is honestly better than I expected for a model this capable. The real question is what sits around the model. If you're running agents in production you need monitoring and kill switches anyway, the model being "safe enough" is necessary but never sufficient. Nobody should be deploying computer-use agents without observability around what they're actually doing.</p>
]]></description><pubDate>Wed, 18 Feb 2026 00:27:25 +0000</pubDate><link>https://news.ycombinator.com/item?id=47055434</link><dc:creator>fdefitte</dc:creator><comments>https://news.ycombinator.com/item?id=47055434</comments><guid isPermaLink="false">https://news.ycombinator.com/item?id=47055434</guid></item><item><title><![CDATA[New comment by fdefitte in "Qwen3.5: Towards Native Multimodal Agents"]]></title><description><![CDATA[
<p>The "native multimodal agents" framing is interesting. Everyone's focused on benchmark numbers but the real question is whether these models can actually hold context across multi-step tool use without losing the plot. That's where most open models still fall apart imo.</p>
]]></description><pubDate>Mon, 16 Feb 2026 20:15:06 +0000</pubDate><link>https://news.ycombinator.com/item?id=47039728</link><dc:creator>fdefitte</dc:creator><comments>https://news.ycombinator.com/item?id=47039728</comments><guid isPermaLink="false">https://news.ycombinator.com/item?id=47039728</guid></item><item><title><![CDATA[New comment by fdefitte in ""Token anxiety", a slot machine by any other name"]]></title><description><![CDATA[
<p>That 95% payout only works if you already know what good looks like. The sketchy part is when you can't tell the diff between correct and almost-correct. That's where stuff goes sideways.</p>
]]></description><pubDate>Mon, 16 Feb 2026 20:12:11 +0000</pubDate><link>https://news.ycombinator.com/item?id=47039694</link><dc:creator>fdefitte</dc:creator><comments>https://news.ycombinator.com/item?id=47039694</comments><guid isPermaLink="false">https://news.ycombinator.com/item?id=47039694</guid></item><item><title><![CDATA[New comment by fdefitte in "WebMCP Proposal"]]></title><description><![CDATA[
<p>Skills are great for static stuff but they kinda fall apart when the agent needs to interact with live state. WebMCP actually fills a real gap there imo.</p>
]]></description><pubDate>Mon, 16 Feb 2026 20:11:48 +0000</pubDate><link>https://news.ycombinator.com/item?id=47039691</link><dc:creator>fdefitte</dc:creator><comments>https://news.ycombinator.com/item?id=47039691</comments><guid isPermaLink="false">https://news.ycombinator.com/item?id=47039691</guid></item><item><title><![CDATA[New comment by fdefitte in "Anthropic tries to hide Claude's AI actions. Devs hate it"]]></title><description><![CDATA[
<p>Agent teams working autonomously sounds cool until you actually try it. We've been running multi-agent setups and honestly the failure modes are hilarious. They don't crash, they just quietly do the wrong thing and act super confident about it.</p>
]]></description><pubDate>Mon, 16 Feb 2026 20:11:25 +0000</pubDate><link>https://news.ycombinator.com/item?id=47039686</link><dc:creator>fdefitte</dc:creator><comments>https://news.ycombinator.com/item?id=47039686</comments><guid isPermaLink="false">https://news.ycombinator.com/item?id=47039686</guid></item><item><title><![CDATA[New comment by fdefitte in "Show HN: We built Cobalt, Open source unit testing for AI Agents"]]></title><description><![CDATA[
<p>Hi everyone ! Super happy to release this package. We feel like Evals belong in the CI like unit testing, and should be easy to setup and run automatically. Can't wait to get your feedback !</p>
]]></description><pubDate>Thu, 12 Feb 2026 22:16:32 +0000</pubDate><link>https://news.ycombinator.com/item?id=46996047</link><dc:creator>fdefitte</dc:creator><comments>https://news.ycombinator.com/item?id=46996047</comments><guid isPermaLink="false">https://news.ycombinator.com/item?id=46996047</guid></item><item><title><![CDATA[Show HN: We built Cobalt, Open source unit testing for AI Agents]]></title><description><![CDATA[
<p>Article URL: <a href="https://github.com/basalt-ai/cobalt">https://github.com/basalt-ai/cobalt</a></p>
<p>Comments URL: <a href="https://news.ycombinator.com/item?id=46995995">https://news.ycombinator.com/item?id=46995995</a></p>
<p>Points: 3</p>
<p># Comments: 1</p>
]]></description><pubDate>Thu, 12 Feb 2026 22:11:55 +0000</pubDate><link>https://github.com/basalt-ai/cobalt</link><dc:creator>fdefitte</dc:creator><comments>https://news.ycombinator.com/item?id=46995995</comments><guid isPermaLink="false">https://news.ycombinator.com/item?id=46995995</guid></item><item><title><![CDATA[New comment by fdefitte in "AI Evaluation Methods by Use Case"]]></title><description><![CDATA[
<p>Free guide, enjoy ! Made with  by the Basalt team</p>
]]></description><pubDate>Fri, 13 Jun 2025 07:04:45 +0000</pubDate><link>https://news.ycombinator.com/item?id=44266342</link><dc:creator>fdefitte</dc:creator><comments>https://news.ycombinator.com/item?id=44266342</comments><guid isPermaLink="false">https://news.ycombinator.com/item?id=44266342</guid></item><item><title><![CDATA[AI Evaluation Methods by Use Case]]></title><description><![CDATA[
<p>Article URL: <a href="https://www.notion.so/hexacc/AI-Evaluation-Methods-by-Use-Case-2041cc0bd5bc80558ba6fd032f297891?source=copy_link">https://www.notion.so/hexacc/AI-Evaluation-Methods-by-Use-Case-2041cc0bd5bc80558ba6fd032f297891?source=copy_link</a></p>
<p>Comments URL: <a href="https://news.ycombinator.com/item?id=44266341">https://news.ycombinator.com/item?id=44266341</a></p>
<p>Points: 1</p>
<p># Comments: 1</p>
]]></description><pubDate>Fri, 13 Jun 2025 07:04:45 +0000</pubDate><link>https://www.notion.so/hexacc/AI-Evaluation-Methods-by-Use-Case-2041cc0bd5bc80558ba6fd032f297891?source=copy_link</link><dc:creator>fdefitte</dc:creator><comments>https://news.ycombinator.com/item?id=44266341</comments><guid isPermaLink="false">https://news.ycombinator.com/item?id=44266341</guid></item><item><title><![CDATA[Show HN: Free Prompt Grading Tool]]></title><description><![CDATA[
<p>Small tool to get a scoring + recommandations instantly. What do you think ?</p>
<hr>
<p>Comments URL: <a href="https://news.ycombinator.com/item?id=44064089">https://news.ycombinator.com/item?id=44064089</a></p>
<p>Points: 1</p>
<p># Comments: 1</p>
]]></description><pubDate>Thu, 22 May 2025 17:03:02 +0000</pubDate><link>https://www.getbasalt.ai/grade-my-prompt</link><dc:creator>fdefitte</dc:creator><comments>https://news.ycombinator.com/item?id=44064089</comments><guid isPermaLink="false">https://news.ycombinator.com/item?id=44064089</guid></item><item><title><![CDATA[New comment by fdefitte in "Letting the AIs Judge Themselves: A One Creative Prompt: The Coffee-Ground Test"]]></title><description><![CDATA[
<p>Love it. AI is actually really better at judging the quality of content than it is at producing content. Kind of like humans actually :)</p>
]]></description><pubDate>Mon, 19 May 2025 01:41:13 +0000</pubDate><link>https://news.ycombinator.com/item?id=44025844</link><dc:creator>fdefitte</dc:creator><comments>https://news.ycombinator.com/item?id=44025844</comments><guid isPermaLink="false">https://news.ycombinator.com/item?id=44025844</guid></item><item><title><![CDATA[AI Hedge Fund]]></title><description><![CDATA[
<p>Article URL: <a href="https://github.com/virattt/ai-hedge-fund">https://github.com/virattt/ai-hedge-fund</a></p>
<p>Comments URL: <a href="https://news.ycombinator.com/item?id=44025172">https://news.ycombinator.com/item?id=44025172</a></p>
<p>Points: 3</p>
<p># Comments: 0</p>
]]></description><pubDate>Sun, 18 May 2025 23:44:48 +0000</pubDate><link>https://github.com/virattt/ai-hedge-fund</link><dc:creator>fdefitte</dc:creator><comments>https://news.ycombinator.com/item?id=44025172</comments><guid isPermaLink="false">https://news.ycombinator.com/item?id=44025172</guid></item><item><title><![CDATA[New comment by fdefitte in "France Endorses UN Open Source Principles"]]></title><description><![CDATA[
<p>This makes total sense. When a country is creating public software, it should be open source by default. This is the only way to create trust. In the long run, open source and closed source government software will probably differentiate dictatorships from democracies</p>
]]></description><pubDate>Sun, 18 May 2025 23:34:36 +0000</pubDate><link>https://news.ycombinator.com/item?id=44025114</link><dc:creator>fdefitte</dc:creator><comments>https://news.ycombinator.com/item?id=44025114</comments><guid isPermaLink="false">https://news.ycombinator.com/item?id=44025114</guid></item><item><title><![CDATA[New comment by fdefitte in "France Endorses UN Open Source Principles"]]></title><description><![CDATA[
<p>I think it's more a guideline principle for public software, for exemple apps that are used by citizens to declare taxes, renews IDs...</p>
]]></description><pubDate>Sun, 18 May 2025 23:31:21 +0000</pubDate><link>https://news.ycombinator.com/item?id=44025095</link><dc:creator>fdefitte</dc:creator><comments>https://news.ycombinator.com/item?id=44025095</comments><guid isPermaLink="false">https://news.ycombinator.com/item?id=44025095</guid></item><item><title><![CDATA[New comment by fdefitte in "France Endorses UN Open Source Principles"]]></title><description><![CDATA[
<p>France has an undeserved bad reputation for this stuff. As a french citizen, I'm amazed to see how easy it has become to do anything administrative online, with great tools such as France Connect that allows a single login method for any administrative tool.</p>
]]></description><pubDate>Sun, 18 May 2025 23:19:56 +0000</pubDate><link>https://news.ycombinator.com/item?id=44025038</link><dc:creator>fdefitte</dc:creator><comments>https://news.ycombinator.com/item?id=44025038</comments><guid isPermaLink="false">https://news.ycombinator.com/item?id=44025038</guid></item></channel></rss>