Hacker News: fdefitte

New comment by fdefitte in "I'm helping my dog vibe code games"

fdefitte — Wed, 25 Feb 2026 01:15:52 +0000

The dog ships faster because it has zero opinions about the architecture.

Show HN: Cobalt – Unit tests for AI agents, like Jest but for LLMs

fdefitte — Fri, 20 Feb 2026 17:43:05 +0000

Hey HN, I built Cobalt, an open-source testing framework for AI agents and LLM apps.

Most eval tools (Braintrust, Arize, LangSmith) want you to live in their UI. Dashboards, manual reviews, clicking through results. That's fine for exploration, but it doesn't catch regressions. We needed something that runs in CI like any other test suite, lives in code, and fails the build when quality drops.

  npm install @basalt-ai/cobalt
  npx cobalt init
  npx cobalt run

Write experiments as code:

  import { experiment, Dataset, Evaluator } from '@basalt-ai/cobalt'

  const dataset = Dataset.fromLangfuse('support-tickets')

  experiment('support-agent', dataset, async ({ item }) => {
    const result = await myAgent(item.input)
    return { output: result }
  }, {
    evaluators: [
      new Evaluator({ name: 'Helpful', type: 'llm-judge', prompt: 'Is this response helpful and accurate? {{output}}' }),
      new Evaluator({ name: 'No hallucination', type: 'llm-judge', prompt: 'Does this contain fabricated info? {{output}}' }),
    ]
  })

`npx cobalt run --ci` exits with code 1 if thresholds are violated. The GitHub Action posts score tables on PRs and auto-compares against base branch.

The part I'm most excited about: Cobalt ships with a built-in MCP server, so you can drive it entirely from Claude Code. Just tell it "compare GPT 5.2 with 5.1 on my support agent" or "run my experiments, find the failing cases, and fix the prompt." It runs the experiments, diffs the results, and iterates on your code without you leaving the terminal. Turns eval from a chore into a conversation.

Pull datasets from Langfuse, LangSmith, Braintrust, or plain JSON/JSONL/CSV. Results stored locally in SQLite. No accounts, no dashboards, no vendor lock-in.

Comments URL: https://news.ycombinator.com/item?id=47091182

Points: 3

# Comments: 0

New comment by fdefitte in "Micropayments as a reality check for news sites"

fdefitte — Fri, 20 Feb 2026 05:05:36 +0000

Good point

New comment by fdefitte in "BarraCUDA Open-source CUDA compiler targeting AMD GPUs"

fdefitte — Wed, 18 Feb 2026 00:28:06 +0000

Agreed on ZLUDA being the practical choice. This project is more impressive as a "build a GPU compiler from scratch" exercise than as something you'd actually use for ML workloads. The custom instruction encoding without LLVM is genuinely cool though, even if the C subset limitation makes it a non-starter for most real CUDA codebases.

New comment by fdefitte in "Is Show HN dead? No, but it's drowning"

fdefitte — Wed, 18 Feb 2026 00:27:49 +0000

The filter used to be effort. You had to care enough to spend weeks on something, which meant you probably understood the problem deeply. Now that filter is gone and we get a flood of "I prompted this in 20 minutes" posts where the author can't answer a single follow-up about their own code. The interesting Show HNs still exist, they're just buried under noise.

New comment by fdefitte in "Claude Sonnet 4.6"

fdefitte — Wed, 18 Feb 2026 00:27:25 +0000

The 8% one-shot number is honestly better than I expected for a model this capable. The real question is what sits around the model. If you're running agents in production you need monitoring and kill switches anyway, the model being "safe enough" is necessary but never sufficient. Nobody should be deploying computer-use agents without observability around what they're actually doing.

New comment by fdefitte in "Qwen3.5: Towards Native Multimodal Agents"

fdefitte — Mon, 16 Feb 2026 20:15:06 +0000

The "native multimodal agents" framing is interesting. Everyone's focused on benchmark numbers but the real question is whether these models can actually hold context across multi-step tool use without losing the plot. That's where most open models still fall apart imo.

New comment by fdefitte in ""Token anxiety", a slot machine by any other name"

fdefitte — Mon, 16 Feb 2026 20:12:11 +0000

That 95% payout only works if you already know what good looks like. The sketchy part is when you can't tell the diff between correct and almost-correct. That's where stuff goes sideways.

New comment by fdefitte in "WebMCP Proposal"

fdefitte — Mon, 16 Feb 2026 20:11:48 +0000

Skills are great for static stuff but they kinda fall apart when the agent needs to interact with live state. WebMCP actually fills a real gap there imo.

New comment by fdefitte in "Anthropic tries to hide Claude's AI actions. Devs hate it"

fdefitte — Mon, 16 Feb 2026 20:11:25 +0000

Agent teams working autonomously sounds cool until you actually try it. We've been running multi-agent setups and honestly the failure modes are hilarious. They don't crash, they just quietly do the wrong thing and act super confident about it.

New comment by fdefitte in "Show HN: We built Cobalt, Open source unit testing for AI Agents"

fdefitte — Thu, 12 Feb 2026 22:16:32 +0000

Hi everyone ! Super happy to release this package. We feel like Evals belong in the CI like unit testing, and should be easy to setup and run automatically. Can't wait to get your feedback !

Show HN: We built Cobalt, Open source unit testing for AI Agents

fdefitte — Thu, 12 Feb 2026 22:11:55 +0000

Article URL: https://github.com/basalt-ai/cobalt

Comments URL: https://news.ycombinator.com/item?id=46995995

Points: 3

# Comments: 1

New comment by fdefitte in "AI Evaluation Methods by Use Case"

fdefitte — Fri, 13 Jun 2025 07:04:45 +0000

Free guide, enjoy ! Made with by the Basalt team

AI Evaluation Methods by Use Case

fdefitte — Fri, 13 Jun 2025 07:04:45 +0000

Article URL: https://www.notion.so/hexacc/AI-Evaluation-Methods-by-Use-Case-2041cc0bd5bc80558ba6fd032f297891?source=copy_link

Comments URL: https://news.ycombinator.com/item?id=44266341

Points: 1

# Comments: 1

Show HN: Free Prompt Grading Tool

fdefitte — Thu, 22 May 2025 17:03:02 +0000

Small tool to get a scoring + recommandations instantly. What do you think ?

Comments URL: https://news.ycombinator.com/item?id=44064089

Points: 1

# Comments: 1

New comment by fdefitte in "Letting the AIs Judge Themselves: A One Creative Prompt: The Coffee-Ground Test"

fdefitte — Mon, 19 May 2025 01:41:13 +0000

Love it. AI is actually really better at judging the quality of content than it is at producing content. Kind of like humans actually :)

AI Hedge Fund

fdefitte — Sun, 18 May 2025 23:44:48 +0000

Article URL: https://github.com/virattt/ai-hedge-fund

Comments URL: https://news.ycombinator.com/item?id=44025172

Points: 3

# Comments: 0

New comment by fdefitte in "France Endorses UN Open Source Principles"

fdefitte — Sun, 18 May 2025 23:34:36 +0000

This makes total sense. When a country is creating public software, it should be open source by default. This is the only way to create trust. In the long run, open source and closed source government software will probably differentiate dictatorships from democracies

New comment by fdefitte in "France Endorses UN Open Source Principles"

fdefitte — Sun, 18 May 2025 23:31:21 +0000

I think it's more a guideline principle for public software, for exemple apps that are used by citizens to declare taxes, renews IDs...

New comment by fdefitte in "France Endorses UN Open Source Principles"

fdefitte — Sun, 18 May 2025 23:19:56 +0000

France has an undeserved bad reputation for this stuff. As a french citizen, I'm amazed to see how easy it has become to do anything administrative online, with great tools such as France Connect that allows a single login method for any administrative tool.