Hacker News: mrothroc

New comment by mrothroc in "Ask HN: Are LLMs creating busy work?"

mrothroc — Fri, 22 May 2026 16:37:35 +0000

Some of it can be busywork, but for me the intermediate artifacts (plans, design docs, etc) serve a real purpose: they create a verification surface where you can check that the agent is creating the right thing before it goes all the way. It's exactly the same reason we created short sprints: if the team misunderstood the requirements and built the wrong thing, you only lost a sprint. We lost months of work when we did waterfall because the product did not match what the customer had in mind.

I have deterministic and stochastic tests that run on each artifact. For those that have a high risk of "not the right thing", I manually review the artifacts. But if it's bog standard I just rely on the auto-gates to reject and get the agent to retry the artifact.

This gets me a high-volume pipeline that yes uses a lot of tokens, but at the same time doesn't overwhelm me. I only deal with things that genuinely need my attention. That's worth it for me, and not busywork.

New comment by mrothroc in "Show HN: Forge – Guardrails take an 8B model from 53% to 99% on agentic tasks"

mrothroc — Fri, 22 May 2026 16:09:10 +0000

Thanks, glad you find it useful! Feel free to ping me if you have any questions.

New comment by mrothroc in "Testing distributed systems with AI agents"

mrothroc — Thu, 21 May 2026 14:30:29 +0000

I've been specializing in distributed systems for nearly 35 years. I've read your work, and it's shaped my thinking. When you say you have a person in mind when you write, I am that person. Thank you for what you've done.

I don't think this replaces you. The hard part of reliability is understanding the failure modes in the context of the business. No one has unlimited time or money, we always have to make tradeoffs. Only experienced humans have both the ability to interrogate the stakeholders and a vision broad enough to understand what to pursue versus what to give up.

Tools like this make the grind part of the job easier. They do not replace the holistic view you need to be able to confidently tell someone "worry about X, do not worry about Y".

New comment by mrothroc in "Show HN: Forge – Guardrails take an 8B model from 53% to 99% on agentic tasks"

mrothroc — Thu, 21 May 2026 13:45:49 +0000

Definitely stacks. The thing that made it clear for me was being explicit about the stages, and where/what you can verify with a guardrail, or gate. I wrote up the framework I use here: https://michael.roth.rocks/research/trust-topology/

Being explicit about the space between the stages is critical, because that's your enforcement point.

New comment by mrothroc in "Show HN: Forge – Guardrails take an 8B model from 53% to 99% on agentic tasks"

mrothroc — Wed, 20 May 2026 15:47:48 +0000

Yes, "guardrails" is a squishy term. But it gets clearer if you ask what transition is being guarded.

Some of this is inside the model, like topic refusals. Forge sits at the tool call level.

My personal workflow uses guardrails at the SDLC level: I have a standard pipeline (plan, design, code, build, test). I use gates between each stage, and the right composition leads to a much higher quality in the final product.

Also worth mentioning that gate failures are given to the agent that produced the artifact, so it has a chance to fix it. That means that I don't have to review obviously wrong output.

New comment by mrothroc in "Learn Harness Engineering"

mrothroc — Tue, 19 May 2026 15:35:26 +0000

I fully agree with the idea: above a model capability threshold, the power comes from the harness far more than the model. Engineers can get tremendous power from learning how to do CICD and automation. If you view the models and agentic code pipelines as a natural evolution of this, you see the benefit.

I did a quick look at the content, and it seems verbose and AI generated but conceptually OK. I learn by tinkering, not a good fit for me, but if you learn by reading, maybe this is for you.

My view is that human time is more precious than computer time. If something can be automated, then automate it. I don't lint code by hand, I get the linter to do it. Similarly, LLMs expand the list of things that computers can do. That's what you get from the harness, however you learn to do it.

New comment by mrothroc in "The last six months in LLMs in five minutes"

mrothroc — Tue, 19 May 2026 14:01:01 +0000

I have the same experience. I've been running sequential agents in my own harness that is a standard SDLC pipeline (plan, design, code, build, test). It has gates between each stage to control quality.

The big benefit of automating this for so long is that I have lots of data. I analyzed it and found that I can change the models out without much of a change in the output quality.

For one-off tasks, where there is no harness and you're just YOLOing with the TUI, yes, big difference. You need a harness.

The pipeline controls the quality far more than the model, empirically.

New comment by mrothroc in "Vibe coding and agentic engineering are getting closer than I'd like"

mrothroc — Thu, 07 May 2026 16:09:06 +0000

The "blurring" framing makes Simon's tension sound intrinsic when it is actually structural. Vibe coding and agentic engineering aren't on a continuum. They're distinguished by the process.

Engineering is always about a defined process. We follow it to produce predictable artifacts that meet the specifications. Even though code is somewhat "squishy" in that it is an art just as much as a science, it still has to meet the spec.

This has always been true, even before agents started writing code for us. We've all dealt with spaghetti code because of undisciplined practices. That's exactly why we came up with the standard SDLC process: plan, design, code, test, deploy. Repeat.

The part people seem to forget about when looking at this is the space between the steps: the gates. We review the artifacts produced at each stage. If the reviewer does not approve, the engineer has to fix it until it passes. True for human coders, doubly true for agentic coders.

Agentic engineering still follows the process. Artifacts are now cheap to produce, which means we have to adjust it so we don't overwhelm the humans in the loop. For me, this means augmenting my review step with agentic reviewers to catch the dumb stuff. It only escalates to me when either a) it passes clean or b) there is something that genuinely needs my experience.

This is agentic engineering, not vibe coding.

New comment by mrothroc in "Lessons for Agentic Coding: What should we do when code is cheap?"

mrothroc — Tue, 05 May 2026 23:45:40 +0000

The list in the article looks like verification practices. Document intent, develop taste, find the hard stuff, etc. It assumes that when code is cheap the bottleneck shifts to knowing whether what you generated is actually right.

e2e tests can do a lot, but in my experience it's not enough. By the time the test fails you've already burned a generation cycle on an artifact that came from a flawed spec or design. I've gotten more mileage from having checks at stage boundaries (standard SDLC: plan, design, code, test). We all know the earlier you catch the mistake, the cheaper the fix.

The "implement to learn" is the same idea: you need to know enough about both where you want to go AND the path to get there to guide the agents to a proper implementation. You have contact with the world, both the users and the operational considerations that come from running software. Agents do not. We do the same thing with spikes, but now our spikes become far more sophisticated.

Code being cheap doesn't remove verification, it moves it earlier.

New comment by mrothroc in "SWE-bench Verified no longer measures frontier coding capabilities"

mrothroc — Mon, 27 Apr 2026 15:51:36 +0000

From a verification-topology angle, what makes algotune.io contamination-resistant? Is it because the correctness oracle is a performance metric (which can't be memorized) rather than a fixed test that can?

New comment by mrothroc in "Show HN: Mixlab, an ML arch lab in Go. JSON config, Metal and CUDA, 1.6s builds"

mrothroc — Wed, 22 Apr 2026 15:50:19 +0000

Simple example to show how configs are defined:

{ "name": "plain_3L",

  // Minimal causal transformer baseline: 3 attention layers plus 3 SwiGLU layers.
  "model_dim": 128,
  "vocab_size": 1024,
  "seq_len": 128,

  // Blocks execute sequentially, alternating token mixing and feed-forward mixing.
  "blocks": [
    {"type": "plain", "heads": 4},
    {"type": "swiglu"},
    {"type": "plain", "heads": 4},
    {"type": "swiglu"},
    {"type": "plain", "heads": 4},
    {"type": "swiglu"}
  ],

  // Slightly longer than smoke-test configs so the baseline loss moves visibly.
  "training": {
    "steps": 200,
    "lr": 3e-4,
    "grad_clip": 1.0,
    "weight_decay": 0.01,
    "seed": 42,
    "batch_tokens": 1024
  }
}

Show HN: Mixlab, an ML arch lab in Go. JSON config, Metal and CUDA, 1.6s builds

mrothroc — Wed, 22 Apr 2026 15:47:53 +0000

I built a tool for quickly testing different ML architectures. Define a model in JSON, train on your Mac (Metal) or ship the same config to a cloud GPU (CUDA). No code changes between platforms.

Why: I wanted to compare attention vs Mamba vs GQA at different parameter budgets without writing PyTorch for each experiment. Edit a JSON config, hit enter, see loss numbers. It will race different configs for you. The number one goal is iteration speed.

JSON config lets you chain together common ML blocks (attention, GQA, mamba, RetNet, and several more) and optimizers (muon, adamw) and compiles them to MLX IR, which can either run on Metal or CUDA backends.

Why Go: 1.6s builds, built-in profiling (mixlab -cpuprofile gives you a flame graph), import-based extensibility for custom blocks. No C++ extensions, no custom build systems. And personally I prefer strongly-typed, compiled languages.

On a Shakespeare benchmark matching nanoGPT (6L, 6H, d=384, 10.8M params): val loss 1.5527 on M1 Max, 1.5588 on A40. PyTorch numerical parity confirmed to 8 decimal places.

brew install mrothroc/tap/mixlab

https://github.com/mrothroc/mixlab

Comments URL: https://news.ycombinator.com/item?id=47865322

Points: 2

# Comments: 1

New comment by mrothroc in "543 Hours: What happens when AI runs while you sleep"

mrothroc — Mon, 20 Apr 2026 16:18:13 +0000

I addressed this in my reply to kelseyfrog above. The short version: the production work is proprietary, the tooling I used to do the analysis is open source.

New comment by mrothroc in "543 Hours: What happens when AI runs while you sleep"

mrothroc — Mon, 20 Apr 2026 16:16:45 +0000

Hi, I'm the original author and I can clarify a few things.

The 543 hours are the agent compute hours, not me at the keyboard. The pipeline runs autonomously, the agents execute in parallel, and the gates verify the output. Most of the prompts are agent-to-agent, not human-to-agent.

On the timeline: I have a BSCS (1995) and MSCS (1997) with a specialty in distributed systems. I actually worked my way through school doing this work so I didn't need loans. Let's call it almost 35 years.

The terminology has evolved but the architecture hasn't changed as much as people think.

New comment by mrothroc in "543 Hours: What happens when AI runs while you sleep"

mrothroc — Mon, 20 Apr 2026 16:11:17 +0000

Thank you for your feedback. These are fair points.

I get that "top performer" is off-putting. You're right that authority has to be earned in the text (and I hope I do that), not declared.

On the structure: yes, it's a novel format and I can see how that would be hard to parse. It won't work for everyone.

Both of these are artifacts of trying to blend research into the modern social-media driven world.

New comment by mrothroc in "543 Hours: What happens when AI runs while you sleep"

mrothroc — Mon, 20 Apr 2026 16:07:17 +0000

I'm the author of that post. Thank you for your feedback.

The production code is proprietary work for clients, so I can't link to it directly. But the tooling I built to support the pipeline is open source: the log analyzer that computed these statistics.

There are a couple of other in-flight projects I will open source soon, created by this process, but they aren't out yet.

The research page is about the methodology because that's what generalizes. The specific microservices I ship are just microservices.

New comment by mrothroc in "Multi-Agentic Software Development Is a Distributed Systems Problem"

mrothroc — Wed, 15 Apr 2026 15:54:47 +0000

I did the same with my own orchestrator. That's where I get my data.

It's amazing the power a simple workflow with automatic gate enforcement brings to agenting coding.

New comment by mrothroc in "Multi-Agentic Software Development Is a Distributed Systems Problem"

mrothroc — Wed, 15 Apr 2026 15:53:09 +0000

I created my own framework. Long ago it started as shell scripts that I used in conjunction with aider. It was a very manual process.

It's grown over time to be a full MCP and CLI with stages and gates defined in YAML. I was thinking about open sourcing it but since the code grew organically I would need to do extensive cleanup to make it presentable.

But I do walk through the process on page 9: https://michael.roth.rocks/research/trust-topology/#9

New comment by mrothroc in "Multi-Agentic Software Development Is a Distributed Systems Problem"

mrothroc — Tue, 14 Apr 2026 16:02:45 +0000

Agreed that full consensus is overkill.

But I think the coordination problem is subtler than version control implies. In the (plan, design, code) pipeline they aren't collaborating on the same artifact. They're producing different artifacts that are all expressions of the same intent in different spaces: a plan in natural language, a design in a structured spec, code in a formal language.

Different artifacts which are different projections in different Chomsky levels but all from the same thing: user intent.

The coordination challenge is keeping these consistent with each other as each stage transforms the prior projection into the new one. That's where the gates earn their place: they verify that each transformation preserves the intent from the previous stage.

New comment by mrothroc in "Multi-Agentic Software Development Is a Distributed Systems Problem"

mrothroc — Tue, 14 Apr 2026 14:10:01 +0000

I've been running a multi-agent software development pipeline for a while now and I've reached the same conclusion: it's a distributed systems problem.

My approach has been more pragmatic than theoretical: I break work into sequential stages (plan, design, code) with verification gates. Each gate has deterministic checks (compile, lint, etc) and an agentic reviewer for qualitative assessment.

Collectively, this looks like a distributed system. The artifacts reflect the shared state.

The author's point about external validation converting misinterpretations into detectable failures is exactly what I've found empirically. You can't make the agent reliable on its own, but you can make the protocol reliable by checking at every boundary.

The deterministic gates provide a hard floor of guarantees. The agentic gates provide soft probabilistic assertions.

I wrote up the data and the framework I use: https://michael.roth.rocks/research/trust-topology/