Hacker News: LeoStehlik

New comment by LeoStehlik in "Ask HN: What are you working on? (May 21)"

LeoStehlik — Thu, 21 May 2026 19:35:01 +0000

thank you so much mate!

New comment by LeoStehlik in "Ask HN: What are you working on? (May 21)"

LeoStehlik — Thu, 21 May 2026 17:18:16 +0000

Started working on WrenLore, a Knowledge Base / Knowledge Management system built for both humans and AI agents.

Humans review, use the nice clean UI, AI work on writing the documentaion, keeping it up to date, summarize, syntetize, have the ability to chat with the specific space or part of the documentation as per the user boundaries.

First release was mostly clean up from Docmost artefacts, added provider config for AI models, added AskAI and AI Help features, and basics from the security point of view that are a must: an MFA available for username/password users, and SAML/SSO EntraID authorization/authentication (needed it first, more options to follow).

https://github.com/wrenlore/wrenlore

New comment by LeoStehlik in "Coding is solved? Software is not"

LeoStehlik — Thu, 21 May 2026 17:05:32 +0000

Exactly. How much % coding represent in designing, building, shipping, deploying, integrating, maintaining and evolving software? My take is 20-30%. Still great to have the coding part solved, it now democratizes software / ideas to be build by smaller teams. The successfull business of the next 5 years probably has visionary software architect comes business consultant who translated what the business needs into specific software tools.

New comment by LeoStehlik in "Show HN: Proof Loop – I make my coding agents prove they finished the task"

LeoStehlik — Thu, 21 May 2026 16:41:23 +0000

same, but that follows. Why I wanted a proof first is so that I don’t waste time running tests on code that was far from finished yet. Especially early days this year, I’d get agent confirming to me “I did this” whilst later I uncovered it struggled to use tools, so it just said it was done. When I recieve the evidence of “I’ve done it” (iterate if anything is missing), only then I trigger the round of unit tests. I know this may sound like a bit of too much careful handholding, but got burned so many times this pays off.

Show HN: Proof Loop – I make my coding agents prove they finished the task

LeoStehlik — Thu, 21 May 2026 16:04:15 +0000

I built this because my coding agent kept telling me he did complete the task, but when I verified it, it was not the case.

I made Proof Loop fairly light, intentionally. It’s basically a protocol helper script for AI agent tasks:

- set acceptance criteria before coding/implementation - keep the builder and verifier roles separate - each criteria tested with results PASS, FAIL or UNKNOWN - attach evidence of done - keep the proof evidence in the repo, so that the next agent / run can inspect it and see what was already done

You can try it via commandline from the cloned repo, go the the proof-loop directory and run make demo.

Teh demo creates a task, checks the proof bundle, fails if evidence is missing, then passes when acceptance criteria have evidence attached.

There is also an OpenClaw skill version now, so the easiest use is

openclaw skills install proof-loop

In the GitHub repo, there is harness-agnostic version and examples.

I would especially like criticism and/or any feedback from people who run Codex, Claude Code or OpenCode on long-running multi-step tasks.

Note this is a utility that I use myself, FoC, MIT Licensed, OpenSourced, with no intention of any commercialization.

Comments URL: https://news.ycombinator.com/item?id=48224992

Points: 2

# Comments: 2

New comment by LeoStehlik in "Decisions that eroded trust in Azure – by a former Azure Core engineer"

LeoStehlik — Fri, 03 Apr 2026 17:56:06 +0000

Back in 2011 at Fujitsu, I ran one of the earliest Azure production subscriptions outside Microsoft. Windows Azure, mid-2011. I've watched this platform for 15 years from the outside.

Part 1 barely scratches the surface. Read parts 2 through 6.

The 173 agents story, the 200 manual node interventions per day, the WireServer sitting on the secure host side with unencrypted tenant memory mixed in shared address space, the letters to the EVP, the CEO, the Board - not a single acknowledgment.

The most damning thing in this series ... except for technical debt ... is the silence at the top when someone handed them the diagnosis on a plate.

Cutler's original vision was "no human touch." The gap between that and what Azure actually became is where the trillion dollars went.

Go read the rest. It's worth it.

Meanwhile on LinkedIn, there are still comments how adorable Microsoft leadership under Satya is... a carefully crafted PR image.

New comment by LeoStehlik in "Show HN: Real-time dashboard for Claude Code agent teams"

LeoStehlik — Wed, 01 Apr 2026 21:54:16 +0000

Both, as it proved neither is enough on its own.

The structural fix is the obsession about separating roles: the agent that builds is never the one that verifies. I run a reviewer agent (I call her Iris), and a tester (Rex) — they live in separate sessions with no shared context with the builder. Iris' brief explicitly says "we require a live browser test, code review is not enough" — and that is where role separation was key; agents reviewing their own output tend to confirm what they already believe.

The explicit result/verdict format helps too. Each acceptance criteria gets a PASS/FAIL/UNKNOWN verdict, attached with evidence. Unknown is the one with gravitas — you force the agent to say "I could not verify this" rather than it quietly pretending it was a PASS.

But diff-level verification is where it still leaks. I don't have a systematic diff check yet. It's mostly Iris catching "agent replaced the whole file rather than extending it" by noticing the git diff is suspiciously clean. That's still more pattern matching than proper instrumentation — room for improvement... when I figure out how. Not there yet, to be honest.

The sanitised optimism problem is deep — it's not always dishonesty, but quite often a genuine model confusion about whether a suppressed error counts as a fix. The agent believes... voila, success. The only way around it I've found is that the verifier has to be skeptical by default, not reviewing in good faith.

This tool's live timeline is the missing piece in that loop. Being able to see the actual tool calls rather than the curated (and falsely optimistic) summary could change verdict quality rather significantly.

New comment by LeoStehlik in "Show HN: Real-time dashboard for Claude Code agent teams"

LeoStehlik — Wed, 01 Apr 2026 20:17:24 +0000

This is what I've been missing running multi-agent ops through OpenClaw.

The opacity problem is the one I hit hard: when a coordinator spawns 3-4 agents in parallel (builder, reviewer, tester, each with their own tool calls), the only visibility you have is what they choose to report back. Which is often sanitised and … dangerously optimistic.

The role separation / independent verification structure I run helps catch bad outputs, but it doesn't give me the live timeline of HOW an agent got to a conclusion. That's why I find this genuinely useful.

Noticed OpenClaw is already on the roadmap - had my hands tingling to fork and adapt it. Starring it for now and added to my watchlist. The hook architecture should translate … OpenClaw fires session events that could feed the same pipeline. Looking forward to seeing that happen.