Hacker News: dudeinhawaii

New comment by dudeinhawaii in "Apple accidentally left Claude.md files Apple Support app"

dudeinhawaii — Fri, 01 May 2026 16:35:42 +0000

I'm not an Alexa user myself but I have watched my wife interact with it for around 5years now.

The new Alexa powered by an LLM is objectively better that previous Alexa in a few ways. This much was apparently from day one and has only gotten smoother.

1. It can reliably execute direct or vague-ish commands "play X movie in app Y" or "play x show" and can infer X movie is only available in app Z so use that.

2. Speech recognition seems better (less instances of 5x round trips)

3. Conversational with multi-turn --- my wife can have a back and forth clarifying a topic.

4. Seems to understand intent a bit better. (user asked A so they are probably thinking about B)

Those may seem small but they were a tremendous source of annoyance for her -- and thus for me -- "Alexa is not listening, do something!"

New comment by dudeinhawaii in "Nvidia NemoClaw"

dudeinhawaii — Thu, 19 Mar 2026 19:05:23 +0000

This has been my approach and of course what you lose is the "random and surprising" (maybe good) but also the "evolutionary" aspect.

So, if you write strong tooling (even with AI) around the connection points - you can create blackboxes tht are secure and only allow the agent to perform certain actions. The blackbox email service calls out to a secure store (for keys/etc) and accesses your emails in a read-only way, etc (for example).

Everything is then much more intentional. You're writing tools for your agent but you also can't do fun or evolutionary things which is most of the fun behind OpenClaw. That and many people seem to genuinely see them as 'pets' or 'strange Ai friends' but that's a different problem and it's due to the interesting methods OpenClaw uses to give the illusion of intelligence, always on, and memories. These are all well know (variations on RAG, markdowns, etc)

New comment by dudeinhawaii in "Nvidia NemoClaw"

dudeinhawaii — Thu, 19 Mar 2026 18:32:22 +0000

Why would I want non-deterministic behavior here though?

If I want to max uptime, I write a tool to track/monitor. Then write a small agent (non-ai) that monitors those outputs and performs your remediation actions (reset something, clear something, etc, depends on service).

Do I want Claude re-writing and breaking subscription flow because it detected an issue? No.

New comment by dudeinhawaii in "Digg is gone again"

dudeinhawaii — Sat, 14 Mar 2026 20:13:36 +0000

It's not, hence the "don't post AI slop as your comment" posting a few days back that had 1000+ comments.

Currently an unsolved problem - just stealthier on some platforms than others. Trigger the right topic on HN and the bots come out in-force together with humans sloppily copy/pasting LLM content.

New comment by dudeinhawaii in "Elon Musk pushes out more xAI founders as AI coding effort falters"

dudeinhawaii — Sat, 14 Mar 2026 16:31:36 +0000

I don't see what you're seeing, in any dimension. But here's a fair take.

I wrote several very specialized benchmarks that I've used over time, that surface "model personalities" and their effects on decision making (as well as measuring the outcomes).

Grok 4.1 Fast Reasoning is/was a solid model. It's also fundamentally different from the pack.

I call it a smart, aggressive, Claude Haiku. That is, its "thinking" is quite chaotic and sometimes short-hand and its output can be as well (relate to other models).

Its aggressiveness can allow it to punch above in competitive scenarios that I have in some of my benchmarks. Its write-ups and documentation are often replete with "dominate", "relentless" and a general high energy that skirts the limits of 'cringe bro'. That said, it has generally performed just beneath the SOTA (at the time: GPT-5.2, Gemini-3-Flash, Claude Opus 4.5). Angry Sonnet perhaps.

The latest release feels quite similar but also underperforms the same older crowd (so far) so it hasn't quite made the leap that Claude's 4.6 and GPT's 5.3/5.4 series made. It's also now priced the same as its peers but does not deliver SOTA capabilities (at least not consistently in my opinion).

New comment by dudeinhawaii in "Debian decides not to decide on AI-generated contributions"

dudeinhawaii — Tue, 10 Mar 2026 17:00:03 +0000

I don't see why we can't have AI powered reviews as a verification of truth and trust score modifier. Let me explain.

1. You layout policy stating that all code, especially AI code has to be written to a high quality level and have been reviewed for issues prior to submission.

2. Given that even the fastest AI models do a great job of code reviews, you setup an agent using Codex-Spark or Sonnnet, etc to scan submissions for a few different dimensions (maintainability, security, etc).

3. If a submission comes through that fails review, that's a strong indication that the submitter hasn't put even the lowest effort into reviewing their own code. Especially since most AI models will flag similar issues. Knock their trust score down and supply feedback.

3a. If the submitter never acts on the feedback - close the submission and knock the trust score down even more.

3b. If the submitter acts on the feedback - boost trust score slightly. We now have a self-reinforcing loop that pushes thoughtful submitters to screen their own code. (Or ai models to iterate and improve their own code)

4. Submission passes and trust score of submitter meets some minimal threshold. Queued for human review pending prioritization.

I haven't put much thought into this but it seems like you could design a system such that "clout chasing" or "bot submissions" would be forced to either deliver something useful or give up _and_ lose enough trust score that you can safely shadowban them.

New comment by dudeinhawaii in "GPT-5.4"

dudeinhawaii — Thu, 05 Mar 2026 23:59:37 +0000

This is marketing. The same way Apple cares about your privacy so long as they can wall you in their garden.

Not a value judgment, just saying that the CEO of a company making a statement isn't worth anything. See Googles "don't be evil" ethos that lasted as long as it was corporately useful.

If Anthropic can lure engineers with virtue signaling, good on them. They were also the same ones to say "don't accelerate" and "who would give these models access to the internet", etc etc.

"Our models will take everyone's jobs tomorrow and they're so dangerous they shouldn't be exported". Again all investor speak.

New comment by dudeinhawaii in "Relicensing with AI-Assisted Rewrite"

dudeinhawaii — Thu, 05 Mar 2026 13:54:48 +0000

It usually refers to situations without access to the source code.

I've always taken "clean room" to be the kind of manufacturing clean room (sealed/etc). You're given a device and told "make our version". You're allowed to look, poke, etc but you don't get the detailed plans/schematics/etc.

In software, you get the app or API and you can choose how to re-implement.

In open source, yes, it seems like a silly thing and hard to prove.

New comment by dudeinhawaii in "Nobody gets promoted for simplicity"

dudeinhawaii — Wed, 04 Mar 2026 15:49:16 +0000

True, but I think the implication (as I read it) is that AI may be providing more complex solutions than were needed for the problem and perhaps more complex than a human engineer would have provided.

New comment by dudeinhawaii in "An autopsy of AI-generated 3D slop"

dudeinhawaii — Wed, 25 Feb 2026 21:35:54 +0000

Somehow this article explains perfectly, visually, how AI generated code differs from human generated code as well.

You see the exact same patterns. AI uses more code to accomplish the same thing, less efficiently.

I'm not even an AI hater. It's just a fact.

The human then has to go through and cleanup that code if you want to deliver a high-quality product.

Similarly, you can slap that AI generated 3D model right into your game engine, with its terrible topology and have it perform "ok". As you add more of these terrible models, you end up with crap performance but who cares, you delivered the game on-time right? A human can then go and slave away fixing the terrible topology and textures and take longer than they would have if the object had been modeled correctly to begin with.

The comparison of edge-loops to "high quality code" is also one that I mentally draw. High quality code can be a joy to extend and build upon.

Low quality code is like the dense mesh pictured. You have a million cross interactions and side-effects. Half the time it's easier to gut the whole thing and build a better system.

Again, I use AI models daily but AI for tools is different from AI for large products. The large products will demand the bulk of your time constantly refactoring and cleaning the code (with AI as well) -- such that you lose nearly all of the perceived speed enhancements.

That is, if you care about a high quality codebase and product...

New comment by dudeinhawaii in "Gemini 3.1 Pro"

dudeinhawaii — Sat, 21 Feb 2026 22:40:27 +0000

After 2 days of giving it a go, I find that Gemini CLI is still considerably worse than both Codex and Claude Code.

The model itself also has strange behaviors that seem like it gets randomly replaced with Gemini-3-Flash or something else. I'll explain.

Once agentic coding was a bust, I gave it a run as a daily driver for AI assistant. It performed fairly well but then began behaving strangely. It would lose context mid conversation. For instance, I said "In san francisco I'm looking for XYZ". Two turns later I'm asking about food and it gives me suggestions all over the world.

Another time, I asked it about the likelihood of the pending east coast winter storm of affecting my flight. I gave it all the details (flight, stops, time, cities).

Both GPT-5.2 and Claude crunched and came back with high quality estimations and rationale. Gemini 3.1 Pro... 5 times, returned a weather forecast widget for either the layover or final destination. This was on "Pro" reasoning, the highest exposed on the Gemini App/WebApp. I've always suspected Google swaps out models randomly so this.. wasn't surprising.

I then asked Gemini 3.1 Pro via the API and it returned a response similar to Claude and GPT-5.2 -- carefully considering all factors.

This tells me that a Google AI Ultra subscription gives me a sub-par coding agent which often swaps in Flash models, a sub-par web/app AI experience that also isn't using the advertised SOTA models, and a bunch of preview apps for video gen, audio gen (crashed every time I attempted), and world gen (Genie was interesting but a toy).

This will be a quick cancel as soon as the intro rate is done.

It's like Google doesn't ACTUALLY want to be the leader in AI or serve people their best models. They want to generate hype around benchmarks and then nerf the model and go silent.

Gemini 3 Pro Preview went from exceptional in the first month to mediocre and then out of my rotation within a month.

New comment by dudeinhawaii in "Nvidia and OpenAI abandon unfinished $100B deal in favour of $30B investment"

dudeinhawaii — Fri, 20 Feb 2026 16:49:59 +0000

My take has been...

Gemini 3.1 (and Gemini 3) are a lot smarter than Claude Opus 4.6

But...

Gemini 3 series are both mediocre at best in agentic coding.

Single shot question(s) about a code problem vs "build this feature autonomously".

Gemini's CLI harness is just not very good and Gemini's approach to agentic coding leaves a lot to be desired. It doesn't perform the double-checking that Codex does, it's slower than Claude, it runs off and does things without asking and not clearly explaining why.

New comment by dudeinhawaii in "Nvidia and OpenAI abandon unfinished $100B deal in favour of $30B investment"

dudeinhawaii — Fri, 20 Feb 2026 16:39:25 +0000

My experience is that on large codebases that get tricky problems, you eventually get an answer quicker if you can send _all_ the context to a relevant large model to crunch on it for a long period of time.

Last night I was happily coding away with Codex after writing off Gemini CLI yet again due to weirdness in the CLI tooling.

I ran into a very tedious problem that all of the agents failed to diagnose and were confidently patching random things as solutions back and forth (Claude Code - Opus 4.6, GPT-5.3 Codex, Gemini 3 Pro CLI).

I took a step back, used python script to extract all of the relevant codebase, and popped open the browser and had Gemini-3-Pro set to Pro (highest) reasoning, and GPT-5.2 Pro crunch on it.

They took a good while thinking.

But, they narrowed the problem down to a complex interaction between texture origins, polygon rotations, and a mirroring implementation that was causing issues for one single "player model" running through a scene and not every other model in the scene. You'd think the "spot the difference" would make the problem easier. It did not.

I then took Gemini's proposal and passed it to GPT-5.3-Codex to implement. It actually pushed back and said "I want to do some research because I think there's a better code solution to this". Wait a bit. It solved the problem in the most elegant and compatible way possible.

So, that's a long winded way to say that there _is_ a use for a very smart model that only works in the browser or via API tooling, so long as it has a large context and can think for ages.

New comment by dudeinhawaii in "Anthropic officially bans using subscription auth for third party use"

dudeinhawaii — Thu, 19 Feb 2026 06:41:17 +0000

I just ran some numbers and it works out if you're a prolific user.

Over 9 days I would have spent roughly $63 dollars on Codex with 11.5M input tokens plus 141M cached input tokens and 1.3M output tokens.

That roughly mirrors the $100-200/wk in API spending that drove me to the subscription.

  | Category | Tokens | Rate (/1M) | Estimated Cost |
  |---|---:|---:|---:|
  | Input (uncached) | 11,568,331 | $1.75 | $20.24 |
  | Cached input | 141,566,720 | $0.175 | $24.77 |
  | Output | 1,301,078 | $14.00 | $18.22 |
  | Total | 154,436,129 | — | $63.23 |

BUT... like a typical gym user. This is a 30/d window and I only used it for 9 days, $63 worth. OpenAI kept the other $137.

It makes sense though for heavy use.

New comment by dudeinhawaii in "Anthropic officially bans using subscription auth for third party use"

dudeinhawaii — Thu, 19 Feb 2026 06:19:12 +0000

I think you've stated this in reverse.

API limits are infinite but you'd blow through $20 of usage in a maybe 1 hours or less of intense Opus use.

The subscription at $20/mo (or $200) allows for vastly more queries than $20 would buy you via API but you are constrained by hourly/weekly limits.

The $20/mo sub user will take a lot longer to complete a high token count task (due to start/stop) BUT they will cap their costs.

New comment by dudeinhawaii in "Is Show HN dead? No, but it's drowning"

dudeinhawaii — Tue, 17 Feb 2026 21:56:46 +0000

This is something that I was thinking about today. We're at the point where anyone can vibe code a product that "appears" to work. There's going to be a glut of garbage.

It used to be that getting to that point required a lot of effort. So, in producing something large, there were quality indicators, and you could calibrate your expectations based on this.

Nowadays, you can get the large thing done - meanwhile the internal codebase is a mess and held together with AI duct-tape.

In the past, this codebase wouldn't scale, the devs would quit, the project would stall, and most of the time the things written poorly would die off. Not every time, but most of the time -- or at least until someone wrote the thing better/faster/more efficiently.

How can you differentiate between 10 identical products, 9 of which were vibecoded, and 1 of which wasn't. The one which wasn't might actually recover your backups when it fails. The other 9, whoops, never tested that codepath. Customers won't know until the edge cases happen.

It's the app store affect but magnified and applied to everything. Search for a product, find 200 near-identical apps, all somehow "official" -- 90% of which are scams or low-effort trash.

New comment by dudeinhawaii in "Is Show HN dead? No, but it's drowning"

dudeinhawaii — Tue, 17 Feb 2026 21:40:39 +0000

The other element here is that the vibecoder hasn't done the interesting thing, they've pulled other people's interesting things.

Let's see, how to say this less inflamatory..

(just did this) I sit here in a hotel and I wondered if I could do some fancy video processing on the video feed from my laptop to turn it into a wildlife cam to capture the birds who keep flying by.

I ask Codex to whip something up. I iterate a few times, I ask why processing is slow, it suggests a DNN. I tell it to go ahead and add GPU support while its at it.

In a short period of time, I have an app that is processing video, doing all of the detection, applying the correct models, and works.

It's impressive _to me_ but it's not lost on me that all of the hard parts were done by someone else. Someone wrote the video library, someone wrote the easy python video parsers, someone trained and supplied the neural networks, someone did the hard work of writing a CUDA/GPU support library that 'just works'.

I get to slap this all together.

In some ways, that's the essence of software engineering. Building on the infinite layers of abstractions built by others.

In other ways, it doesn't feel earned. It feels hollow in some way and demoing or sharing that code feels equally hollow. "Look at this thing that I had AI copy-paste together!"

New comment by dudeinhawaii in ""Token anxiety", a slot machine by any other name"

dudeinhawaii — Tue, 17 Feb 2026 18:56:50 +0000

If you visualize it as AI Agents throwing a rope to wrangle a problem, and then visualize a dozen of these agents throwing their ropes around a room, and at each other -- very quickly you'll also visualize the mess of code that a collections of agents creates without oversight. It might even run, some might say that's the only true point but... at what cost in code complexity, performance waste, cascading bugs, etc.

Is it possible? Yes, I've had success with having a model output a 100 step plan that tried to deconflict among multiple agents. Without re-creating 'Gas town', I could not get the agents to operate without stepping on toes. With _me_ as the grand coordinator, I was able to execute and replicate a SaaS product (at a surface level) in about 24hrs. Output was around 100k lines of code (without counting css/js).

Who can prove that it works correctly though? An AI enthusiasts will say "as long as you've got test coverage blah blah blah". Those who have worked large scale products know that tests passing is basically "bare minimum". So you smoke test it, hope you've got all the paths, and toss it up and try to collect money from people? I don't know. If _this_ is the future, this will collapse under the weight of garbage code, security and privacy breaches, and who knows what else.

New comment by dudeinhawaii in "I spent two days gigging at RentAHuman and didn't make a single cent"

dudeinhawaii — Fri, 13 Feb 2026 18:15:55 +0000

Right but, do you or the founder have actual responses to the story posted? It seemed to give RentAhuman the benefit of the doubt every step of the way. The site doesn't work as advertised, appears to be begging for hype, got a reporter to check it out, and it didn't work.

That's life. Can't win them all. Lesson here is the product wasn't ready for primetime and you were given a massive freebie for free press both via Wired _and_ this crosspost.

Better strategy is to actually layout what works, what's the roadmap so anyone partially interested might see it when they stumble into this post.

Or jot it down as a failed experiment and move on.

New comment by dudeinhawaii in "Apple patches decade-old iOS zero-day, possibly exploited by commercial spyware"

dudeinhawaii — Thu, 12 Feb 2026 15:58:12 +0000

So the exploiters have deprecated that version of spyware and moved on I see. This has been the case every other time. The state actors realize that there's too many fingers in the pie (every other nation has caught on), the exploit is leaked and patched. Meanwhile, all actors have moved on to something even better.

Remember when Apple touted the security platform all-up and a short-time later we learned that an adversary could SMS you and pwn your phone without so much as a link to be clicked.

KSIMET: 2020, FORCEDENTRY: 2021, PWNYOURHOME, FINDMYPWN: 2022, BLASTPASS: 2023

Each time NSO had the next chain ready prior to patch.

I recall working at a lab a decade ago where we were touting full end-to-end exploit chain on the same day that the target product was announcing full end-to-end encryption -- that we could bypass with a click.

It's worth doing (Apple patching) but a reminder that you are never safe from a determined adversary.