Hacker News: hansonw

New comment by hansonw in "GPT-5.4"

hansonw — Fri, 06 Mar 2026 03:59:10 +0000

The skill source is here: https://github.com/openai/skills/blob/main/skills/.curated/p...

$skill-installer playwright-interactive in Codex! the model writes normal JS playwright code in a Node REPL

New comment by hansonw in "Building more with GPT-5.1-Codex-Max"

hansonw — Wed, 19 Nov 2025 18:19:03 +0000

Rest assured that we are better at training models than naming them ;D

- New benchmark SOTAs with 77.9% on SWE-Bench-Verified, 79.9% on SWE-Lancer, and 58.1% on TerminalBench 2.0

- Natively trained to work across many hours across multiple context windows via compaction

- 30% more token-efficient at the same reasoning level across many tasks

Let us know what you think!

Building more with GPT-5.1-Codex-Max

hansonw — Wed, 19 Nov 2025 18:01:59 +0000

Article URL: https://openai.com/index/gpt-5-1-codex-max/

Comments URL: https://news.ycombinator.com/item?id=45982649

Points: 483

# Comments: 319

New comment by hansonw in "A Research Preview of Codex"

hansonw — Fri, 16 May 2025 19:08:22 +0000

More about that here! https://platform.openai.com/docs/codex#advanced-configuratio...

New comment by hansonw in "An embarrassingly simple approach to recover unlearned knowledge for LLMs"

hansonw — Mon, 04 Nov 2024 07:42:22 +0000

The ELI5 of the paper is that most "unlearning" methods can be regarded as adding some delta `w` to the parameters of the network, but most of `w` just gets "rounded away" during quantization (i.e. `quantize(X+w) ~= quantize(X)`). Pretty clever idea as a lot of cited methods explicitly optimize/regularize to keep `w` small to avoid degrading evaluation accuracy.

To your point, it does put into question the idea of whether these methods can actually be considered truly "unlearning" from an information-theoretic perspective (or if it is the equivalent of e.g. just putting `if (false)` around the still latent knowledge)

New comment by hansonw in "Fine-tuning now available for GPT-4o"

hansonw — Tue, 20 Aug 2024 19:14:02 +0000

It looks like they didn't want to make a public submission in order to avoid disclosing the model internals: https://cosine.sh/blog/genie-technical-report#:~:text=SWE%2D....

New comment by hansonw in "Prompt Caching"

hansonw — Sun, 18 Aug 2024 23:25:42 +0000

It’s probably more. Pretty conservatively, if the KV embedding dimension for each token is ~10K x 100 attention layers (this is roughly the scale of Llama3.1 405B) that’s already 1M 16-bit floats per token = 2MB. They have likely needed to implement some kind of KV compression (like DeepSeek) to make this even feasible.

New comment by hansonw in "AI Search: The Bitter-Er Lesson"

hansonw — Sat, 15 Jun 2024 05:10:52 +0000

https://news.ycombinator.com/item?id=40675577

New comment by hansonw in "What can LLMs never do?"

hansonw — Sat, 27 Apr 2024 16:09:21 +0000

This is also a good paper on the subject:

What Algorithms can Transformers Learn? A Study in Length Generalization https://arxiv.org/abs/2310.16028

New comment by hansonw in "Ask HN: How does deploying a fine-tuned model work"

hansonw — Wed, 24 Apr 2024 03:07:37 +0000

https://predibase.com

New comment by hansonw in "GPT-4 Turbo with Vision Generally Available"

hansonw — Tue, 09 Apr 2024 20:23:13 +0000

Yes. But also note that the new function calling is actually “tool calling” where the model is also fine-tuned to expect and react to the output of the function (and there are various other nuances like being able to call multiple functions in parallel and matching up the outputs to function calls precisely).

When used in multi-turn “call/response” mode it actually does start to unlock some new capabilities.

New comment by hansonw in "How we built Text-to-SQL at Pinterest"

hansonw — Mon, 08 Apr 2024 16:22:18 +0000

Not the author, but really nice that they shared some real data points:

> Once our Text-to-SQL solution was in production, we were also able to observe how users interacted with the system. As our implementation improved and as users became more familiar with the feature, our first-shot acceptance rate for the generated SQL increased from 20% to above 40%. In practice, most queries that are generated require multiple iterations of human or AI generation before being finalized. In order to determine how Text-to-SQL affected data user productivity, the most reliable method would have been to experiment. Using such a method, previous research has found that AI assistance improved task completion speed by over 50%. In our real world data (which importantly does not control for differences in tasks), we find a 35% improvement in task completion speed for writing SQL queries using AI assistance.

How we built Text-to-SQL at Pinterest

hansonw — Mon, 08 Apr 2024 16:19:09 +0000

Article URL: https://medium.com/pinterest-engineering/how-we-built-text-to-sql-at-pinterest-30bad30dabff

Comments URL: https://news.ycombinator.com/item?id=39971231

Points: 3

# Comments: 1

New comment by hansonw in "Big Post About Big Context"

hansonw — Fri, 01 Mar 2024 04:00:31 +0000

If you think about it, RAG is a relatively primitive “first pass attention layer” that is binary and semi-heuristic based. I think it’s fairly safe to say that in the long term RAG will be integrated into the model architecture somehow, just a matter of when :)

New comment by hansonw in "Big Post About Big Context"

hansonw — Fri, 01 Mar 2024 03:49:32 +0000

If sub-quadratic architectures (eg Mamba) become a thing, it will become feasible to precompute most of the work for a fixed prefix (i.e. system prompt) and the latency can be pretty minimal. Even with current transformers, if you have a fixed system prompt, you can save the KV cache and it helps a lot (though the inference time of each incremental token is still linear).

New comment by hansonw in "Mamba: The Easy Way"

hansonw — Sat, 24 Feb 2024 08:51:21 +0000

Indeed: https://arxiv.org/pdf/2402.01032.pdf Perhaps future iterations of SSMs will accommodate dynamically sized (but still non-linearly-growing) hidden states / memories!

New comment by hansonw in "Mamba: The Easy Way"

hansonw — Fri, 23 Feb 2024 20:45:31 +0000

“RNN-mode inference” is also extremely exciting because you can precompute the hidden state of any prompt prefix (i.e. a long system prompt, or statically retrieved context) and continued generations pay the same cost irrespective of the prefix length.

New comment by hansonw in "Grist is a modern, relational spreadsheet"

hansonw — Thu, 02 Nov 2023 16:31:26 +0000

Our startup is building https://arcwise.app, which allows you to embed full-fledged SQL tables inside Google Sheets! We’re in the process of building out support for joins & subqueries, would be curious what people think.

New comment by hansonw in "I built Excel for Uber and they ditched it"

hansonw — Sat, 16 Sep 2023 02:14:37 +0000

I’m building a solution that works like this - we directly connect spreadsheet models to company databases (even converting pivots/formulas to SQL). Would love to chat with anyone that might find this valuable: https://arcwise.app

New comment by hansonw in "Persimmon-8B"

hansonw — Fri, 08 Sep 2023 11:26:05 +0000

This is the best comparison I've found that benchmarks the current OSS inference solutions: https://hamel.dev/notes/llm/inference/03_inference.html

IME the streaming API in text-generation-inference works fine in production. (Though some of the other solutions may be better). I've used it with Starcoder (15B) and the time-to-first-token / tokens per second all seem quite reasonable out of the box.