Hacker News: MattRogish

New comment by MattRogish in "GPT-5.5 hallucinates 3x more than MIT-licensed GLM-5.2"

MattRogish — Sat, 20 Jun 2026 16:36:35 +0000

I'm not saying they are not trying - I'm saying we're inventing new problems faster than any Lab can:

1) Identify the gaps

2) Determine how to fix them

3) Implement a fix (especially if that fix is: identify and find experts)

4) And judge the result

How do they know [person] is an expert in [some field]? How do they find that person? How many experts are necessary to give the right information? How do we evaluate the results, especially if it's novel?

You can find a lot of people who disagree on many topics, and those turtles go all the way down.

I'm not in disagreement that your work will help reduce hallucinations and improve model performance! It is.

I predict (I hope I'm wrong!) that we're going to hit some asymptote that is not at 0% hallucinations (and I would even put a substantial nonzero probability that "overall" hallucination rate bottoms out at some minimum and then slowly grows because we just can't keep up with the new garbage we throw at it).

New comment by MattRogish in "GPT-5.5 hallucinates 3x more than MIT-licensed GLM-5.2"

MattRogish — Sat, 20 Jun 2026 15:22:03 +0000

I'd have to imagine there are wildly diminishing marginal returns to additional SFT/post-training passes.

There are a bounded number of (useful) derivations/combinations of Duff's device.

If Frontier Labs wish to reduce hallucinations on factual things, they will have to hire people (or the data providers will need to) to do fundamental research above and beyond what is available in extant literature and the web. IE if the LLMs want to lower precision error, they need to go out and actually find more expertise. If the wikipedia page for Pompey lacks data, where are they going to get it from? How would they even _identify_ that the page has holes?

Yes, they can digitize more books but that is untrustworthy data - if there were enough eyeballs on a particular work, it would be in the internet. If it's not, they'd need to hire the experts themselves. They need expert reviewers in virtually every interesting topic, which fundamentally is an intractable problem, especially since things change all the time. Maybe even uninteresting topics, too?

I dunno, it doesn't seem to me "more data" is the magic bullet here. Yeah, it will "help" but we're already on the flat part of the S shaped curve.

My take from trying to understand this stuff is some sort of algorithmic improvement is necessary to get another step change in how well LLMs perform in this area. I could be wrong!

New comment by MattRogish in "Expertise in the age of AI"

MattRogish — Fri, 29 May 2026 16:52:58 +0000

Yes, for fun I tried to make a Mahjong solver and NONE of the SOTA frontier models could understand what they were looking at to determine tile occlusion/geometry to build the DAG.

I had to spoon feed it an algorithm - here's how you determine if a tile is on top of another one, etc. etc.

Anything that involves, well, "3d space" they don't seem to do very well on it at all (which makes sense, of course)

New comment by MattRogish in "Claude Opus 4.8"

MattRogish — Thu, 28 May 2026 17:54:55 +0000

Agreed, my vibes tell me 4.6 is a better coder than 4.7. 4.7 is a much better strategic thinker and maintains overall "better architecture" than 5.5. 5.5 is way better than either at coding, but more expensive. So I have 4.7 do the planning/architecture, 4.6 does the coding, then 5.5 critiques and fixes it.

New comment by MattRogish in "Disagreement among frontier LLMs on real-world fact-checks"

MattRogish — Thu, 28 May 2026 15:05:19 +0000

The other thing I suspect is that "Just give me True/False" cuts off a large amount of the search space a modern-day LLM uses to help it answer questions (you can see it in reasoning traces but the act of writing the explanation helps guide it toward a better answer and gives it better likelihood it backtracks on a bad decision).

If you let it spew out an explanation along with the answer, I'm curious if the accuracy will improve (I suspect it will).

New comment by MattRogish in "Tech CEOs are apparently suffering from AI psychosis"

MattRogish — Wed, 27 May 2026 18:16:43 +0000

Also, this is why investors and CEOs are so in love with "LLMs are the route to AGI!"

When some rich/powerful person says "I have to go to Davos, figure it out" their workers know so much context that no LLM is going to ever be able to incorporate, because it isn't written down and is idiosyncratic. (Really, though, the assistant will just say "you're going to Davos next week, the helicopter will pick you up at 3p on Friday" but you know..)

The rich person's assistant knows who else is on the corporate jet, and that X doesn't like Y, and so they should take a different plane. Or get a different accommodation. Oh, Person X doesn't like to fly on an empty stomach, so they should eat first, and that changes all sorts of other downstream implications. Oh, your best friend lives in this city, and I know you love to see them, so I'm going to send you a day or two early so you can meet up with them. etc. etc. etc.

The investor dream of "AGI" is modeled off of the army of employees that make investors/ceos/etc lives easier, and there is a nearly insurmountable gap between what LLMs can do, context they can get, and the availability of all of that information. (To me, the magnitude of this investor <> fundamental reality gap is the entirety of the "bubble". I love AI coding, but it's never gonna do the things investors think it can, to justify the crazy valuations)

New comment by MattRogish in "The worst job interview I ever had"

MattRogish — Wed, 27 May 2026 01:49:01 +0000

Thank you; you’re right - context matters and now more than ever there are a ton of folks looking involuntarily. Grace is always needed, but now especially.

New comment by MattRogish in "Use boring languages with LLMs"

MattRogish — Wed, 27 May 2026 00:21:24 +0000

Yes! It reminds me of the tabs vs spaces argument. Yeah, you have your preference, but as long as everyone on your team uses the same convention it doesn’t matter

The Rails apps I’ve programmed with LLMs seem to work a LOT better than arbitrary python or ruby or JavaScript apps. I chalk that up to “there are a gazillion examples of omiauth in Rails that the LLM can’t really stray off the path. It just works.”

That means I let the agent do things the way it wants to, not because I have a preference. So we’re using turbo and Hotwire and whatever it is it’s doing. And I’m using React for some other problems. Not because I know React, but because the LLM does.

In golf it is said to “let the club do the work”. Over control leads to disaster. Same with LLMs. Not saying let it do whatever but if there are widely baked in conventions you’ll be far better off letting those do the work.

New comment by MattRogish in "The worst job interview I ever had"

MattRogish — Wed, 27 May 2026 00:04:37 +0000

This… was a mistake on both you and the interviewer.

All interview questions - unless it’s impossible to twist your answer to fit this - is scoped to “… at work”. Nobody who asks “tell me about yourself” is asking you to talk about how you met your partner, how many cats you have, or that experience you had, that one time, at band camp. It would be redundant and awkward to literally say “… at work” at the end of every question. It’s totally 100% the intent of the interviewer.

This is interviewing 101 and unless this is your first ever interview I would find it odd, and stop you immediately and say “I meant, worst day at work”. They should’ve done that.

Unless they explicitly and unambiguously say “tell me about the day your mom and dog died in the same day when you found out you had cancer” they mean “tell me about your worst day _at work_.” And even if they ask about the time your dog died (they won’t), they are not asking you “tell me about the worst day you’ve had in your life”. They are asking “tell me about a time you experienced adversity and overcame it, exhibiting problem solving, resilience, and grit AT WORK. (Or - if you are operating in executive mode or you like to live dangerously - some non-work context that maps obviously and unambiguously to a work context).”

You failed the “knows how to interact with people in a professional setting” part of the interview. Or the “this person knows how to interview” part (which generally, but not always, correlates with experience and emotional maturity). Or the “read between the lines” part.

Yeah, inartfully asked questions - but also totally flubbed the answers.

Sorry, chalk it up to you had a bad interview or day or whatever, and never, ever forget the entire thing is scoped to “…. at work”.

New comment by MattRogish in "Enough with the AI FOMO, go slow-mo, says Domo CDO"

MattRogish — Mon, 18 May 2026 14:45:12 +0000

"futurist for data platform" - whatever happened to that Shingy guy?

New comment by MattRogish in "Where Are the Vibecoded Photoshops?"

MattRogish — Mon, 18 May 2026 14:42:27 +0000

Yes, absolutely.

People keep confusing single player vs. multiplayer and forgetting "jobs to be done".

Photoshop files are the lingua franca of the design world. If you're working with designers, they'll likely give you a PSD (replace PSD with other examples in other domains, etc.)

Sure, I could vibe code "make a tool that lets me create multi-layered canvasses etc." but if I want to use it with anyone other than me, I have to make sure that it's binary compatible, bit for bit, with PSDs (or whatever's required to open it in Photoshop and maintain the layers).

This makes no sense to do this unless I'm targeting Photoshop specifically and plenty of tools already do this.

Other than a way to burn money, I am completely unsurprised such a thing isn't widely available (I'm sure someone, somewhere is/has tried this).

For most people, single-player Photoshop is a means to an end. If I have sufficiently advanced AI I can just describe what I want and get the end result (a button, image, whatever). Or even just point it at an image and say "add gaussian blur here".

I would never try and vibe-code a new editor just to make images.

Image-blaster: Creates 3D environments, SFX, and meshes from a single image

MattRogish — Fri, 15 May 2026 15:42:37 +0000

Article URL: https://github.com/neilsonnn/image-blaster

Comments URL: https://news.ycombinator.com/item?id=48150069

Points: 197

# Comments: 40

New comment by MattRogish in "SQL: Incorrect by Construction"

MattRogish — Tue, 12 May 2026 22:39:24 +0000

“I intentionally wrote the example code the way a beginner might. More experienced users would probably reach for solutions like: Updating and checking the balance in a single UPDATE Using [check constraints]to ensure an account balance can never be negative.” Not the author but presumably ALTER TABLE foo ‘ADD CONSTRAINT BalanceCantBeNegative CHECK( balance >= 0 )’

Where Are All the Data Centers?

MattRogish — Tue, 12 May 2026 16:21:44 +0000

Article URL: https://www.wheresyoured.at/where-are-all-the-data-centers/

Comments URL: https://news.ycombinator.com/item?id=48110417

Points: 33

# Comments: 13

New comment by MattRogish in "Show HN: Agent-desktop – Native desktop automation CLI for AI agents"

MattRogish — Sun, 03 May 2026 03:04:15 +0000

The major limitation is that macOS apps do not have to use the API and so there will always need to be a fallback to something like screen scraping for controls that don’t use it.

Zoom Desktop app is a prime example of this. Many of the windows (join a meeting, settings etc) are normal macOS ones, and those use AX buttons, but many are poorly / weirdly labeled (if at all).

But once the Zoom meeting appears, that’s all (?) custom, and so the best you can do is whatever Zoom decided to offer. The dreaded “this meeting is being recorded” pop up is a custom control and so doesn’t have AX at all; I have automation that basically looks for an appearing window and if it has “OK” just blindly click it and hope for the best.

New comment by MattRogish in "HERMES.md in commit messages causes requests to route to extra usage billing"

MattRogish — Wed, 29 Apr 2026 20:16:54 +0000

Same.

Back in December the iOS app had a bug ( https://status.claude.com/incidents/6rrnsb1y0kbn) in which buying a subscription thru the Apple App Store would not register with the backend, so you’d be charged but not receive the plan entitlement.

I discovered this because I wanted to upgrade from free plan to the regular plan. I was charged, but remained in the free tier. Thinking it was a temporary bug, I tried buying the max plan. Same result.

I tried cancelling the plan and restarting but I when I went to buy the regular plan again, I was forever tagged as an “Apple” user and so could only manage the billing plan on the iOS app. I tried one more time, same result.

I tried interacting with the support bot and although it agreed that there was a bug and that it should be fixed and I should get a refund, my account never was able to get unstuck nor refunded. I lodged a refund request with Apple, which was relatively quickly refunded. The Bot never did escalate to a human as promised.

Even though the bug was ostensibly fixed, my account (personal email) remains in permanent limbo, unable to upgrade from Free to anything else (I tried again recently and same result - paid but stuck on free plan). I had to create a new gmail just to pay for Anthropic / Claude.

New comment by MattRogish in "If America's so rich, how'd it get so sad?"

MattRogish — Thu, 23 Apr 2026 17:36:01 +0000

I find it interesting that all the trend lines start going negative around 2001. I wonder why that's not remarked upon? 9/11 itself was - obviously - epically terrible, but the impact of the event was recoverable.

Our response to it (Iraq war, forever wars, etc.) combined with the realization that the USA are be "the baddies" and we've been lied to since forever, probably might have been the thing that set all the dominos up.

COVID was the straw that broke the camel's back. Had we _not_ had the disastrous response to 9/11, I suspect we could've weathered COVID better (like the rest of the world has.)

Agents and the Era of Overproduction

MattRogish — Wed, 22 Apr 2026 17:43:03 +0000

Article URL: https://mattrogish.com/blog/2026/03/11/agents-and-the-era-of-overproduction/

Comments URL: https://news.ycombinator.com/item?id=47866803

Points: 2

# Comments: 0

LLMs and Agents: How do they Work?

MattRogish — Thu, 16 Apr 2026 21:27:40 +0000

Article URL: https://mattrogish.com/blog/2026/03/20/llms-agents-how-do-they-work/

Comments URL: https://news.ycombinator.com/item?id=47799714

Points: 3

# Comments: 1

New comment by MattRogish in "Do you even need a database?"

MattRogish — Wed, 15 Apr 2026 14:33:03 +0000

"Do not cite the deep magic to me witch, I was there when it was written"

If you want to do this for fun or for learning? Absolutely! I did my CS Masters thesis on SQL JOINS and tried building my own new JOIN indexing system (tl;dr: mine wasn't better). Learning is fun! Just don't recommend people build production systems like this.

Is this article trolling? It feels like trolling. I struggle to take an article seriously that conflates databases with database management systems.

A JSON file is a database. A CSV is a database. XML (shudder) is a database. PostgreSQL data files, I guess, are a database (and indexes and transaction logs).

They never actually posit a scenario in which rolling your own DBMS makes sense (the only pro is "hand rolled binary search is faster than SQLite"), and their "When you might need" a DBMS misses all the scenarios, the addition of which would cause the conclusion to round to "just start with SQLite".

It should basically be "if you have an entirely read-only system on a single server/container/whatever" then use JSON files. I won't even argue with that.

Nobody - and I mean nobody - is running a production system processing hundreds of thousands of requests per second off of a single JSON file. I mean, if req/sec is the only consideration, at that point just cache everything to flat HTML files! Node and Typescript and code at all is unnecessary complexity.

PostgreSQL (MySQL, et al) is a DBMS (DataBase Management System). It might sound pedantic but the "MS" part is the thing you're building in code:

concurrency, access controls, backups, transactions: recovery, rollback, committing, etc., ability to do aggregations, joins, indexing, arbitrary queries, etc. etc.

These are not just "nice to have" in the vast, vast majority of projects.

"The cases where you'll outgrow flat files:"

Please add "you just want to get shit done and never have to build your own database management system". Which should be just about everybody.

If your app is meaningfully successful - and I mean more than just like a vibe-coded prototype - it will break. It will break in both spectacular ways that wake you up at 2AM and it will break in subtle ways that you won't know about until you realize something terrible has happened and you lost your data.

Didn't we just have this discussion like yesterday (https://ultrathink.art/blog/sqlite-in-production-lessons)?

It feels like we're throwing away 50 years of collective knowledge, skills, and experience because it "is faster" (and in the same breath note that nobody is gonna hit these req/sec.)

I know, it's really, really hard to type `yarn add sqlite3` and then `SELECT * FROM foo WHERE bar='baz'`. You're right, it's so much easier writing your own binary search and indexing logic and reordering files and query language.

Not to mention now you need a AGENTS.md that says "We use our own home-grown database nonsense if you want to query the JSON file in a different way just generate more code." - NOT using standard components that LLMs know backwards-and-forwards? Gonna have a bad time. Enjoy burning your token budget on useless, counter-productive code.

This is madness.