Hacker News: mulmboy

New comment by mulmboy in "System Card: Claude Mythos Preview [pdf]"

mulmboy — Tue, 07 Apr 2026 20:07:52 +0000

There are a few hints in the doc around this

> Importantly, we find that when used in an interactive, synchronous, “hands-on-keyboard” pattern, the benefits of the model were less clear. When used in this fashion, some users perceived Mythos Preview as too slow and did not realize as much value. Autonomous, long-running agent harnesses better elicited the model’s coding capabilities. (p201)

^^ From the surrounding context, this could just be because the model tends to do a lot of work in the background which naturally takes time.

> Terminal-Bench 2.0 timeouts get quite restrictive at times, especially with thinking models, which risks hiding real capabilities jumps behind seemingly uncorrelated confounders like sampling speed. Moreover, some Terminal-Bench 2.0 tasks have ambiguities and limited resource specs that don’t properly allow agents to explore the full solution space — both being currently addressed by the maintainers in the 2.1 update. To exclusively measure agentic coding capabilities net of the confounders, we also ran Terminal-Bench with the latest 2.1 fixes available on GitHub, while increasing the timeout limits to 4 hours (roughly four times the 2.0 baseline). This brought the mean reward to 92.1%. (p188)

> ...Mythos Preview represents only a modest accuracy improvement over our best Claude Opus 4.6 score (86.9% vs. 83.7%). However, the model achieves this score with a considerably smaller token footprint: the best Mythos Preview result uses 4.9× fewer tokens per task than Opus 4.6 (226k vs. 1.11M tokens per task). (p191)

New comment by mulmboy in "There is an AI code review bubble"

mulmboy — Tue, 27 Jan 2026 00:20:39 +0000

What I'm saying is that a corporate or professional environment can make people communicate in weird ways due to various incentives. Reading into people's communication is an important skill in these kinds of environments, and looking superficially at their words can be misleading.

New comment by mulmboy in "There is an AI code review bubble"

mulmboy — Mon, 26 Jan 2026 19:38:43 +0000

People more often say that to save face by implying the issue you identified would be reasonable for the author to miss because it's subtle or tricky or whatever. It's often a proxy for embarrassment

New comment by mulmboy in "Things I've learned in my 10 years as an engineering manager"

mulmboy — Mon, 26 Jan 2026 09:30:24 +0000

Because it's a good heuristic for a functional and resilient team. People don't usually means it literally, more like "if I disappeared it should be pretty painless for the team to continue along for a month or so and to find and onboard a replacement".

New comment by mulmboy in "We put Claude Code in Rollercoaster Tycoon"

mulmboy — Sun, 18 Jan 2026 02:21:46 +0000

LLMs aren't like you or me. They can comprehend large quantities of code quickly and piece things together easily from scattered fragments. so go to reference etc become much less important. Of course though things change as the number of usages of a symbol becomes large but in most cases the LLM can just make perfect sense of things via grep.

To provide it access to refactoring as a tool also risks confusing it via too many tools.

It's the same reason that waffling for a few minutes via speech to text with tangents and corrections and chaos is just about as good as a carefully written prompt for coding agents.

New comment by mulmboy in "Running out of places to move the goalposts to"

mulmboy — Fri, 02 Jan 2026 07:14:09 +0000

I'm well ware that they can be sycophantic, and I structure things to avoid that like asking "what do you think of this problem" and seeing the idea fall out rather than providing anything that would suggest it. In one of these two cases it took an idea that I had inkling of, fleshed it out, and expanded it to be much better than I had.

And I'm not bragging. I'm expressing awe, and humility that I am finding a machine can match me on things that I find quite difficult. Maybe those things aren't so difficult after all.

By steering I mean more steering to flesh out the context of the problem and to find relevant code and perform domain-specific research. Not steering toward a specific solution.

New comment by mulmboy in "Running out of places to move the goalposts to"

mulmboy — Thu, 01 Jan 2026 10:37:20 +0000

> AI seems to have caught up to my own intelligence even in those narrow domains where I have some expertise. What is there left that AI can’t do that I would be able to verify?

The last few days I've been working on some particularly tricky problems, tricky in the domain and in backwards compatibility with our existing codebase. For both these problems GPT 5.2 has been able to come to the same ideas as my best, which took me quite a bit of brain racking to get to. Granted it's required a lot of steering and context management from me as well as judgement to discard other options. But it's really getting to the point that LLMs are a good sparring partner for (isolated technical) problems at the 99th percentile of difficulty

New comment by mulmboy in "Codex vs. Claude Code (today)"

mulmboy — Fri, 26 Dec 2025 19:43:05 +0000

Is it just me or is codex slow?

With claude code I'll ask it to read a couple of files and do x similar to existing thing y. It takes a few moments to read files and then just does it. All done in a minute or so.

I tried something similar with codex and it took 20 minutes reading around bits of file and this and that. I didn't bother letting it finish. Is this normal? Do I have something misconfigured? This was a couple of months ago.

New comment by mulmboy in "Using LLMs at Oxide"

mulmboy — Sun, 07 Dec 2025 08:09:43 +0000

What do these look like?

New comment by mulmboy in "Python Data Science Handbook"

mulmboy — Wed, 03 Dec 2025 06:05:33 +0000

> Everything it does can be done reasonable well with list comprehensions and objects that support type annotations and runtime type checking (if needed).

I see this take somewhat often, and usually with similar lack of nuance. How do you come to this? In other cases where I've seen this it's from people who haven't worked in any context where performance or scientific computing ecosystem interoperability matters - missing a massive part of the picture. I've struggled to get through to them before. Genuine question.

New comment by mulmboy in "Post-mortem of Shai-Hulud attack on November 24th, 2025"

mulmboy — Sun, 30 Nov 2025 01:38:29 +0000

Yes and anyone who knows anything about software dev knows that the first thing you should do with an important repo is set up branch protections to disallow that, and require reviews etc. Basic CI/CD.

This incident reflects extremely poorly on PostHog because it demonstrates a lack of thought to security beyond surface level. It tells us that any dev at PostHog has access at any time to publish packages, without review (because we know that the secret to do this is accessible from plain GHA secret which can be read from any GHA run which presumably run on any internal dev's PR). The most charitable interpretation of this is that it's consciously justified by them because it reduces friction, in which case I would say that demonstrates poor judgement, a bad balance.

A casual audit would have revealed this and suggested something like restricting the secret to a specific GHA environment and requiring reviews to push to that env. Or something like that.

New comment by mulmboy in "Post-mortem of Shai-Hulud attack on November 24th, 2025"

mulmboy — Sun, 30 Nov 2025 00:57:18 +0000

It does largely avoid the issue if you configure to allow only specific environments AND you require reviews before pushing/merging to branches in that environment.

https://docs.pypi.org/trusted-publishers/adding-a-publisher/

For a malicious version to be published would then require full merge which is a fairly high bar.

AWS allows similar

New comment by mulmboy in "Structured outputs on the Claude Developer Platform"

mulmboy — Sat, 15 Nov 2025 02:10:43 +0000

Along with a bunch of limitations that make it useless for anything but trivial use cases https://docs.claude.com/en/docs/build-with-claude/structured...

I've found structured output APIs to be a pain across various LLMs. Now I just ask for json output and pick it out between first/last curly brace. If validation fails just retry with details about why it was invalid. This works very reliably for complex schemas and works across all LLMs without having to think about limitations.

And then you can add complex pydantic validators (or whatever, I use pydantic) with super helpful error messages to be fed back into the model on retry. Powerful pattern

New comment by mulmboy in "A postmortem of three recent issues"

mulmboy — Thu, 18 Sep 2025 01:45:14 +0000

Big missing piece - what was the impact of the degraded quality?

Was it 1% worse / unnoticeable? Did it become useless? The engineering is interesting but I'd like to see it tied to actual impact

New comment by mulmboy in "Ask HN: What are you actually using LLMs for in production?"

mulmboy — Sun, 29 Jun 2025 07:34:07 +0000

We operate a saas where a common step is inputting rates of widgets in $/widget, $/widget/day, $/1kwidgets, etc etc. These are incredibly tedious and error prone to enter. And usually the source of these rates is an invoice which presents them in ambiguous ways e.g. rows with "quantity" and "charge" from which you have to back calculate the rate. And these invoices are formatted in all different ways.

We offer a feature to upload the invoice and we pull out all the rates for you. Uses LLMs under the hood. Fundamentally it's a "chatgpt wrapper" but there's a massive amount of work in tweaking the prompts based on evals, splitting things up into multiple calls, etc.

And it works great! Niche software, but for power users were saving them tens of minutes of monotonous work per day and in all likelihood entering things more accurate. This complements the manual entry process with full ability to review the results. Accuracy is around 98-99 percent.

New comment by mulmboy in "Gemini CLI"

mulmboy — Thu, 26 Jun 2025 01:36:55 +0000

I gave it a shot just now with a fairly simple refactor. +19 lines, -9 lines, across two files. Totally ballsed it up. Defined one of the two variables it was meant to, referred to the non-implemented one. I told it "hey you forgot the second variable" and then it went and added it in twice. Added comments (after prompting it to) which were half-baked, ambiguous when read in context.

Never had anything like this with claude code.

I've used Gemini 2.5 Pro quite a lot and like most people I find it's very intelligent. I've bent over backwards to use Gemini 2.5 Pro in another piece of work because it's so good. I can only assume it's the gemini CLI itself that's using the model poorly. Keen to try again in a month or two and see if this poor first impression is just a teething issue.

I told it that it did a pretty poor job and asked it why it thinks that is, told it that I know it's pretty smart. It gave me a wall of text and I asked for the short summary

> My tools operate on raw text, not the code's structure, making my edits brittle and prone to error if the text patterns aren't perfect. I lack a persistent, holistic view of the code like an IDE provides, so I can lose track of changes during multi-step tasks. This led me to make simple mistakes like forgetting a calculation and duplicating code.

New comment by mulmboy in "Show HN: Index – New Open Source browser agent"

mulmboy — Thu, 24 Apr 2025 04:36:45 +0000

Nice.

Can run with `uvx --from lmnr-index --python 3.12 index run`

New comment by mulmboy in "Python's new t-strings"

mulmboy — Mon, 21 Apr 2025 08:15:23 +0000

Are there string prefixes for i18n stuff?

New comment by mulmboy in "What the heck is MCP? And why is everybody talking about it?"

mulmboy — Wed, 26 Mar 2025 09:59:23 +0000

What MCP servers do you use?

New comment by mulmboy in "Diagrams AI can, and cannot, generate"

mulmboy — Thu, 20 Mar 2025 08:11:13 +0000

I have found LLMs to be very good at the kind of code -> diagram task presented here. Fire up superwhisper[1] and stream-of-consciousness away about why you want the diagram, which bits are important, who the audience is, and so on. Then iterate a few times. Works brilliantly for even very complex things, including 5000 line CDK files.

It's disingenuous to conclude that AI is no good at diagramming after using an impotent prompt AND refusing to iterate with it. A human would do no better with the same instructions, LLMs aren't magic.

This is the same as my previous comment https://news.ycombinator.com/item?id=42524125

[1] https://superwhisper.com/