Hacker News: boole1854

New comment by boole1854 in "The Connection Machine CM-1 "Feynman" T-shirt"

boole1854 — Tue, 03 Feb 2026 02:02:43 +0000

I ordered one of these a while back. Be warned that it will shrink if put in the dryer.

New comment by boole1854 in "GPT-5.2"

boole1854 — Thu, 11 Dec 2025 19:11:28 +0000

https://openai.com/index/hello-gpt-4o/

I see evaluations compared with Claude, Gemini, and Llama there on the GPT 4o post.

New comment by boole1854 in "Building more with GPT-5.1-Codex-Max"

boole1854 — Wed, 19 Nov 2025 21:36:40 +0000

Today I did some comparisons of GPT-5.1-Codex-Max (on high) in the Codex CLI versus Gemini 3 Pro in the Gemini CLI.

- As a general observation, Gemini is less easy to work with as a collaborator. If I ask the same question to both models, Codex will answer the question. Gemini will read some intention behind the question, write code to implement the intention, and only then answer the question. In one case, it took me five rounds of repeatedly rewriting my prompt in various ways before I could get it to not code but just answer the question.

- Subjectively, it seemed to me that the code that Gemini wrote was more similar to code that I, as a senior-level developer, would have written than what I have been used to from recent iterations of GPT-5.1. The code seemed more readable-by-default and not merely technically correct. I was happy to see this.

- Gemini seems to have a tendency to put its "internal dialogue" into comments. For example, "// Here we will do X because of reason Y. Wait, the plan calls for Z instead. Ok, we'll do Z.". Very annoying.

I did two concrete head-to-head comparisons where both models had the same code and the same prompt.

First, both models were told to take a high-level overview of some new functionality that we needed and were told to create a detailed plan for implementing it. Both models' plans were then reviewed by me and also by both models (in fresh conversations). All three of us agreed that Codex's plan was better. In particular, Codex was better at being more comprehensive and at understanding how to integrate the new functionality more naturally into the existing code.

Then (in fresh conversations), both models were told to implement that plan. Afterwards, again, all three of us compared the resulting solutions. And, again, all three of us agreed that Codex's implementation was better.

Notably, Gemini (1) hallucinated database column names, (2) ignored parts of the functionality that the plan called for, and (3) did not produce code that was integrated as well with the existing codebase. In its favor, it did produce a better version of a particular finance-related calculation function than Codex did.

Overall, Codex was the clear winner today. Hallucinations and ignored requirements are big problems that are very annoying to deal with when they happen. Additionally, Gemini's tendencies to include odd comments and to jump past the discussion phase of projects both make it more frustrating to work with, at this stage.

New comment by boole1854 in "Blue Prince (1989)"

boole1854 — Wed, 05 Nov 2025 14:44:59 +0000

Ok, so this post is a joke of some kind (there was no 1989 version of Blue Prince).

But it raises an interesting question: would it have been possible to implement that upside down floppy disk puzzle in a game?

1. Was it even possible to insert floppy disks upside down? I lived through the floppy disk era in my childhood, but I have to admit I can't remember if the drives would even let you do this.

2. If the answer to #1 is yes, would there be any way of programmatically detecting the floppy-disk-was-inserted-the-wrong-way state?

New comment by boole1854 in "You are the scariest monster in the woods"

boole1854 — Wed, 15 Oct 2025 14:45:19 +0000

If anyone knows of a steelman version of the "AGI is not possible" argument, I would be curious to read it. I also have trouble understanding what goes into that point of view.

New comment by boole1854 in "Grok Code Fast 1"

boole1854 — Fri, 29 Aug 2025 14:21:52 +0000

It's interesting that the benchmark they are choosing to emphasize (in the one chart they show and even in the "fast" name of the model) is token output speed.

I would have thought it uncontroversial view among software engineers that token quality is much important than token output speed.

New comment by boole1854 in "O3 Turns Pro"

boole1854 — Tue, 17 Jun 2025 18:34:20 +0000

Here are my own anecdotes from using o3-pro recently.

My primary use cases where I am willing to wait 10-20 minutes for an answer from the "big slow" model (o3-pro) is code reviews of large amounts of code. I have been comparing results on this task from the three models above.

Oddly, I see many cases where each model will surface issues that the other two miss. In previous months when running this test (e.g., Claude 3.7 Sonnet vs o1-pro vs earlier Gemini), that wasn't the case. Back then, the best model (o1-pro) would almost always find all the issues that the other models found. But now it seems they each have their own blindspots (although they are also all better than the previous generation of models).

With that said, I am seeing Claude Opus 4 (w/extended thinking) be distinctly worse at missing problems which o3-pro and Gemini find. It seems fairly consistent that Opus will be the worst out of the three (despite sometimes noticing things the others do not).

Whether o3-pro or Gemini 2.5 Pro is better is less clear. o3-pro will report more issues, but it also has a tendency to confabulate problems. My workflow involves providing the model with a diff of all changes, plus the full contents of the files that were changed. o3-pro seems to have a tendency to imagine and report problems in the files that were not provided to it. It also has an odd new failure mode, which is very consistent: it gets confused by the fact that I provide both the diff and the full file contents. It "sees" parts of the same code twice and will usually report that there has accidentally been some code duplicated. Base o3 does this as well. None of the other models get confused in that way, and I also do not remember seeing that failure mode with o1-pro.

Nevertheless, it seems o3-pro can sometimes find real issues that Gemini 2.5 Pro and Opus 4 cannot more often than vice versa.

Back in the o1-pro days, it was fairly straightforward in my testing for this use case that o1-pro was simply better across the board. Now with o3-pro compared particularly with Gemini 2.5 Pro, it's no longer clear whether the bonus of occasionally finding a problem that Gemini misses is worth the trouble of (1) waiting way longer for an answer and (2) sifting through more false positives.

My other common code-related use case is actually writing code. Here, Claude Code (with Opus 4) is amazing and has replaced all my other use of coding models, including Cursor. I now code almost exclusively by peer programming with Claude Code, allowing it to be the code writer while I oversee and review. The OpenAI competitor to Claude Code, called Codex CLI, feels distinctly undercooked. It has a recurring problem where it seems to "forget" that it is an agent that needs to go ahead and edit files, and it will instead start to offer me suggestions about how I can make the change. It also hallucinates running commands on a regular basis (e.g., I tell it to commit the changes we've done, and outputs that it has done so, but it has not.)

So where will I spend my $200 monthly model budget? Answer: Claude, for nearly unlimited use of Claude Code. For highly complex tasks, I switch to Gemini 2.5 Pro, which is still free in AI Studio. If I can wait 10+ minutes, I may hand it to o3-pro. But once my ChatGPT Pro subscription expires this month, I may either stop using o3-pro altogether, or I may occasionally use it as a second opinion by paying on-demand through the API.

New comment by boole1854 in "The Gentle Singularity"

boole1854 — Wed, 11 Jun 2025 17:32:18 +0000

You can hover over places on the chart to get exact values. In January 1980, the index was at 37.124. In April 2025, it was at 125.880.

Then calculate cumulative inflation as the proportional change in the price level, like this:

(P_final - P_initial) / P_initial = (125.880 - 37.124) / 37.124 = 2.39

This shows that the overall price level (the cumulative inflation embodied in the PCEPI) has increased by about 2.39 times over the period, which is 239%.

New comment by boole1854 in "The Gentle Singularity"

boole1854 — Wed, 11 Jun 2025 15:37:37 +0000

It rose 2.75% per year (239% over 45 years).

Source with details: https://fred.stlouisfed.org/graph/?g=1JxIa

New comment by boole1854 in "The Gentle Singularity"

boole1854 — Wed, 11 Jun 2025 14:11:07 +0000

Even without including employer health insurance costs, real wages are up 67% since 1980.

Source: https://fred.stlouisfed.org/graph/?g=1JxBn

Details: uses the "Wage and salary accruals per full-time-equivalent employee" time series, which is the broadest wage measure for FTE employees, and adjusts for inflation using the PCE price index, which is the most economically meaningful measure of "how much did prices change for consumers" (and is the inflation index that the Fed targets)

New comment by boole1854 in "OpenAI o3-pro"

boole1854 — Tue, 10 Jun 2025 21:13:25 +0000

I also don't have that tweet saved, but I do remember it.

New comment by boole1854 in "OpenAI o3-pro"

boole1854 — Tue, 10 Jun 2025 21:10:28 +0000

No, this doesn't seem to be correct, although confusion regarding model names is understandable.

o4-mini-high is the label on chatgpt.com for what in the API is called o4-mini with reasoning={"effort": "high"}. Whereas o4-mini on chatgpt.com is the same thing as reasoning={"effort": "medium"} in the API.

o3 can also be run via the API with reasoning={"effort": "high"}.

o3-pro is different than o3 with high reasoning. It has a separate endpoint, and it runs for much longer.

See https://platform.openai.com/docs/guides/reasoning?api-mode=r...

New comment by boole1854 in "Google AI Ultra"

boole1854 — Tue, 20 May 2025 19:15:08 +0000

They are working on it: https://jules.google/

New comment by boole1854 in "ChatGPT Saved My Life (no, seriously, I'm writing this from the ER)"

boole1854 — Tue, 25 Feb 2025 15:17:10 +0000

According to the story, the ChatGPT conversation that led to the ER visit happened on a Sunday. In my part of the world, all local pharmacies are closed on Sundays, so going to a pharmacy and showing the results would not have been an option.

New comment by boole1854 in "Launch HN: A0.dev (YC W25) – React Native App Generator"

boole1854 — Tue, 11 Feb 2025 21:53:39 +0000

> which involves alot of stuff outside of code-gen that we're working on

Could you elaborate on what extra stuff you are working on that will be a value-add over standalone Cursor?

New comment by boole1854 in "Delaware faces exodus of tech companies"

boole1854 — Sat, 01 Feb 2025 20:20:29 +0000

That article states:

> Although some scholars and practitioners have long argued that officers should or do owe a duty of oversight, and as a practical matter many officers likely assume that they have such an obligation, McDonald’s marks the first time this duty was explicitly acknowledged by a Delaware court.

To me this seems to imply that the ruling was incorporating the already existing practice into law, which would seem to imply that it isn't a big shift.

New comment by boole1854 in "DeepSeek-R1"

boole1854 — Tue, 21 Jan 2025 18:08:40 +0000

In their paper, they explain that "in the case of math problems with deterministic results, the model is required to provide the final answer in a specified format (e.g., within a box), enabling reliable rule-based verification of correctness. Similarly, for LeetCode problems, a compiler can be used to generate feedback based on predefined test cases."

Basically, they have an external source-of-truth that verifies whether the model's answers are correct or not.

New comment by boole1854 in "No Calls"

boole1854 — Thu, 16 Jan 2025 15:32:36 +0000

Ah ha! Makes sense. Thank you.

New comment by boole1854 in "No Calls"

boole1854 — Thu, 16 Jan 2025 15:28:33 +0000

The post is about how they have a no-calls policy, even for enterprise sales. The author brags, "I nuked the 'book a call' button from my pricing page".

...But their pricing page actually has a big "Schedule a Call" button when you drag the pricing slider into enterprise territory: https://keygen.sh/pricing/

What am I missing?

New comment by boole1854 in "Narcolepsy is weird but I didn't notice"

boole1854 — Sat, 11 Jan 2025 20:03:27 +0000

Oddly the author compares their cataplexy experience to sleep paralysis and says they are not similar because in sleep paralysis "you can't feel" whereas in cataplexy "you can feel all your limbs and it feels like they're all ready to obey you".

I have experienced sleep paralysis several times, and I have always retained the ability to apparently feel my body/limbs as I think most people do. It would seem that the author's experience of sleep paralysis is different from most people's.