Hacker News: twotwotwo

New comment by twotwotwo in "Prompt Injection as Role Confusion"

twotwotwo — Tue, 23 Jun 2026 14:51:25 +0000

This is great--LLMs 'forgetting who they are' is one of the most uncanny things they do, and the note about why static benchmarks underperform human attackers is on point.

One sort of wild idea: 'give words a color'. That is, the harness/API adds a signal to the input vector (using a few 'role' dimensions or just adding some other vector to the embedding vector) to tell the model the role of an individual input token. It'd be kind of like how positional info is added. It might make some things a little weird--its output will be 'snapped' to the "tool call" or "assistant output" color when it's read back in, for example, regardless of what 'color' came out of the network. A lot of weird stuff happens in models already, though, and this may be less weird than trying to make them behave as formal grammar parsers reliably with security at stake.

A while back I'd dreamed about this as a way to keep models from confusing different kinds of training data: not all input can be high-quality sources, but knowing that a phrase was seen in a scientific paper/encyclopedia, an opinion piece, a work of fiction, a conversation, etc. reduces the chance of confusion. I know they can pick that kind of thing up from other signals like writing style or context, but exactly those signals that lead them astray in prompt injection, and sometimes even leads humans astray when something's written like a credible source but isn't!

New comment by twotwotwo in "The only scalable delete in Postgres is DROP TABLE"

twotwotwo — Sun, 14 Jun 2026 16:18:02 +0000

Years ago work was bit by the analogous thing in MySQL. Like it usually does, it took a chain of events:

- We wrote a cronjob to periodically DELETE for a retention policy on a table we'd just created. Most senior person on the team reviewed it, looked fine.

- Unusually for us, we prioritize QA'ing a different feature for release, delaying the release of this cronjob and a bunch of other code.

- During that delay, the new table accumulated many times more rows to be deleted than we'd expected during review.

- Release happens. All looks well since the initial delete wasn't a migration and cronjob hasn't run yet; engineer doing the release signs off.

- Cronjob runs, deleting hundreds of millions of rows quickly.

- Next day, replica lag's high and MySQL's transaction history is very high. MySQL keeps transaction history around until purge threads have visited all the affected pages on disk.

- The bad cluster conditions last for days and lead to other problems.

This omits detail and the 'noise' of everything else we were watching. But it gets across how the code and MySQL behaved.

Like most exciting events, it led to multiple changes to avoid a repeat. For retention policies, our new approach was one at the end of PlanetScale's post, to partition and drop old partitions. Transitioning to this from a huge unpartitioned table can be fun!

If a table is append-only and already huge, with lots of rows already past the retention threshold, you might only copy the rows to be kept to the new partitioned table: copy what you can, lock tables, do a last catch-up copy and swap tables. (Roughly the blog's 'performant one-off delete'.)

If the table's merely kind of big, gh-ost or such could allow you to ALTER without causing lag, locking, etc.

At a scale below that, you could run a slow incremental 'nibble' delete while watching server stats, and a step below that, plain ALTERs or DELETEs are fine.

Using partitioning has fun bits, too. In MySQL, the partition key has to be part of any unique index, understandably. But you have to keep that in mind when you're using INSERT..ON DUPLICATE KEY UPDATE and relying on uniqueness to trigger the update. Things stay interesting!

I hear Vitess shops like PlanetScale usually don't run multi-terabyte myqsld instances in the first place: even when physical nodes are big, they run many smaller mysqlds on them. That wouldn't make all this fully irrelevant--huge deletes would still sometimes be worse than copy-swap-drop--but it does seem real handy for taming issues that tend to worsen with mysqld size, like replication lag. All to say, little bit jelly of their setup over there!

New comment by twotwotwo in "FrontierCode: An eval to measure whether you would actually merge the code"

twotwotwo — Tue, 09 Jun 2026 17:31:16 +0000

To repeat, not a dig at FrontierCode, which is substantial progress in benchmarking. But I'd argue modeling the rest of process is tha(aaa)t valuable and becomes more so as coding capability progresses:

Async agents interact on a longer timescale, but they interact. Again, experienced SWEs, consulting agencies, etc. ask questions before and after implementation, accept notes, etc.; they vary at how good they are at it; and how well they do it is a big factor in the success and failure of projects.

LLM interaction ability isn't saturated or mature; asking for point edits mostly works, but e.g. when I try to get Opus to ask clarifying questions or surface tricky bits to focus review, it's not close to a human-level response -- it's both noisy and misses key stuff. (Handling uncertainty has been a weak point for LLMs since early on, which might not help.) Other aspects of good interaction are even harder, like digging into a potentially mistaken request, or proposing a good 80/20 tweak to the spec.

There's a different, shorter-term reason to model interaction: it better tells users the value to expect now. It turns out my employer doesn't love infinite Opus use. (Go figure.) Kimi and Sonnet do comparably on FrontierCode. Are they about the same to use, or is one flailing while the other one just needs a couple rounds of fixups? If I saw a benchmark that credibly approximated 'this model will save you this much time vs. that one' that would put it well above existing ones.

I do think a bunch of discussion, investment, etc. is based on the idea the industry will essentially be replaced with successful one-shotting with little interaction. The mistake there is to assume back-and-forth is inessential and only happens because the agents aren't that good at coding yet. For a long time lots back-and-forths were driven by the models' limitations at raw coding, which might've made that idea more appealing.

As the coding side gets better, drawing the rest of the owl becomes the hard part. The world is messy and so is one's software's boundary with it. (I'm not saying the tasks don't get longer, I'm saying interaction gets more important as they do.) My conviction here might partly because in my sort of work the requirements and big picture were always thornier than typing the code; I'm suspicious that as raw coding gets easier for everybody they will hit something analogous.

Anyway, again, what y'all are doing is progress. I do want to stick up for the idea that a lot of critical things aren't raw coding ability. (I'm not alone in that, I don't think!) I'm definitely not here to say someone's Doing It Wrong as they do it more correctly than I've seen it done--just asking "would the patch get accepted?" is a huge step.

New comment by twotwotwo in "FrontierCode"

twotwotwo — Tue, 09 Jun 2026 05:55:20 +0000

I'm liking the effort to make new, no-longer-saturated benchmarks. I'll also be a bit suspicious if some model aces it -- matching OSS maintainers' taste more often is a plausible improvement in quality but if they nail it every time they've been memorizing.

Not saying FrontierCode should've done this, but benchmarking the interaction would be interesting. That is, if I get a diff with a blocking problem but writing a comment gets fixed, that's a lot different from if the model has hit a wall. Better, if there's a problem but the model flagged it in a short list of questions or worries to me before or after coding, it can get sorted without taking much of my time. Stick an LLM in the loop instructed to behave like a user or reviewer with some rubric-ish info that wasn't in the prompt. Then, look at how much the pretend user has to do to get to a quality result with a given model, if they can get to one at all.

You could say 'why worry about interaction? the goal is the model just gets it perfect' but I think that imagined end state just is not a thing: tasks will get bigger but there will still be interaction. Handling comments and asking good clarifying questions when needed are real capabilities. Human SWEs interact plenty and real engineering has a certain density of questions about requirements, taste, and other big vague things.

New comment by twotwotwo in "DeepSeek V4 Pro beats GPT-5.5 Pro on precision"

twotwotwo — Mon, 08 Jun 2026 15:45:53 +0000

If you worry about sending your data off for inference, Fireworks is one of the companies serving open models with solid performance and compliance/zero data retention sorted out. OpenCode supports them and many others. Cursor uses them. They don't have the super-cheap cache reads deal that DeepSeek's own endpoint does, but are still well below Anthropic API rates. (Though crucially you're not paying API rates now!)

DeepSeek and Xiaomi's deals on cache reads go with their models' latest gens making caching cheaper (using less space for KVs). No open-model inference provider has decided to match the pricing. I'm sure that says something about how inference pricing works, but not completely sure what.

Agree with others that top open models aren't on the frontier, and I would expect differences doing big-picture planning or anywhere you're only giving broad brushstrokes and looking for a lot to be guessed. But they do seem fine at coding from a a concrete plan! No experience in huge codebases because I only use them outside work, but they seem good enough about gathering info before they dive in that I'd expect them to grep around as they need.

An annoying caveat: individual subscription plans, used heavily, are much cheaper than the API -- see https://she-llac.com/claude-limits -- which complicates any argument about cost. I still think open models are worth playing with. They're one of the things that let us treat this as a technology rather than just as the product offerings of one of a few companies.

New comment by twotwotwo in "Human-Like Neural Nets by Catapulting"

twotwotwo — Sun, 07 Jun 2026 07:17:13 +0000

We have a lot of synapses, but (agreeing with you) I don't find that sufficient to explain why humans (or animals!) do what we do. If you throw zillions of parameters at a problem with a weak architecture, you get really high-fidelity memorization, and we're not awesome at memorization compared to machines.

Humans can do an impressive amount of generalization from one error or surprise, and as is often rightly noted, don't need trillions of words to get going. And it all seems to happen some 'forward-only' way, without backpropagation -- we don't have AdamW or MuonClip helpfully nudging our synaptic connections towards whatever would have scored well on our most recent test. It is relevant that we're creatures with goals -- reinforcement learning is the only stage where there's a taste of that for neural nets -- but the learning differences seem at least partly independent of that.

I suppose it could turn out that, even if not sufficient, the large number of synapses is necessary to all this, like we're effectively buying a lot of lottery tickets that give us a shot at fishing interesting hypotheses out of the experiences flowing by. But I'm still awfully suspicious that we don't have the right mathematical model for learning messy ideas all worked out yet.

New comment by twotwotwo in "Did DeepSeek v4 suddenly become more expensive?"

twotwotwo — Sat, 30 May 2026 04:42:12 +0000

Whatever is the darker shade of blue in the bottom-right graph had a bump at the same time cost did. Perhaps that's output tokens (which include reasoning)?

New comment by twotwotwo in "Unknowable Math Can Help Hide Secrets"

twotwotwo — Sun, 17 May 2026 06:16:52 +0000

The fielded systems require something that wasn't there in the original model of zero-knowledge proofs. That could be as little as a trusted-enough public source of randomness: the prover makes their initial commitments, plays the verification game with a verifier whose challenges are controlled by the next outputs of the public RNG, and as long as the other party trusts that the RNG and prover aren't in cahoots, that's enough. Doing a trusted setup process beforehand is another tool used by a bunch of deployed systems.

That doesn't mean anything's practically wrong with the fielded ZK proof systems, just that's how you reconcile the article's "no non-interactive proofs under these assumptions" with people out in the real world using non-interactive proofs.

This paper brings up another logical possibility, that there could be a non-interactive proof with no RNG or setup that doesn't meet the precise original definition of zero-knowledge proofs but is zero-knowledge practically speaking. I don't know whether we'll actually see better fielded ZK proof systems come out of this approach!

New comment by twotwotwo in "Kioxia and Dell cram 10 PB into slim 2RU server"

twotwotwo — Sun, 17 May 2026 02:02:56 +0000

It is kinda neat how the density can trickle down. When an individual SSD can hold tens of TBs, recent-gen drives can do millions of random reads/s each, and one socket can handle lots of RAM and many cores, it doesn't take the fancier chassis with two sockets or lots of storage bays to handle pretty substantial data work.

On the other hand, current part prices are not neat; a commodity platform only helps so much if none of what you want put in it is affordable! And other factors like power and cooling can push you away from optimizing for density. I just like that along with the ludicrious becoming possible, merely great stuff becomes more feasible.

New comment by twotwotwo in "Kimi K2.6: Advancing open-source coding"

twotwotwo — Mon, 20 Apr 2026 17:02:30 +0000

Kagi has it as an option in its Assistant thing, where there is naturally a lot of searching and summarizing results. I've liked its output there and in general when asked for prose that isn't in the list/Markdown-heavy "LLM style." It's hard to do a confident comparison, but it's seemed bold in arranging the output to flow well, even when that took surgery on the original doc(s). Sometimes the surgery's needed e.g. to connect related ideas the inputs treated as separate, or to ensure it really replies to the request instead of just dumping info that's somehow related to it.

New comment by twotwotwo in "Are the costs of AI agents also rising exponentially? (2025)"

twotwotwo — Sat, 18 Apr 2026 18:06:03 +0000

You could model more of the process: the dev's work as well as the model's, and the cost of catching a bug later or deploying it live. Those tasks push me further towards smaller tasks in general. (And they make the Gas Town type stuff seem more baffling.)

- Smaller chunks make review much easier and more effective at finding bugs, as we've known since long before LLMs.

- Greater certainty provides a better development experience. I've heard people talk about how LLM development can be tiring. One way that happens, I think, is the win-or-lose drama of feeding in huge tasks with a substantial chance of failure. I think if you're succeeding 95% of the time instead of 70%, and the 5% are easier to deal with (smaller chunks to debug), it's a better experience.

- Everything is harder about real-world tasks because they aren't clean verifiable-reward benchmarks. Developers have context that models don't, so it's common that a problem traces to an detail not in the spec where the model guessed wrong. For real-world tasks "failures" are also sometimes "that UI is bad" or "that way of coding it is hard to maintain." And it's possible to have problems the dev simply doesn't notice. The benchmarks' fully computer-checkable outcomes are 'easy mode' compared to the real world.

- Fixing agents' mess becomes more work as task sizes increase. (Like the certainty thing, but about cost in hours than the experience.) Again, if the model has spat out 1000 lines and stumped itself debugging a failure, it'll take you some time to figure out: more time than debugging 250-line patch, and the larger patch is more likely to have bugs. And if an issue bug makes it out to peer review, you can add communication and context-switching cost (point out bug, fix, re-review) on top of that.

- Bugs that reach prod are really expensive. More of a problem when a prod bug can lose you customers vs., say, on most hobby things. Ord's post gestures at it: there are "cases where failure is much worse than not having tried at all." That magnifies how important it is the review be good, and how much of a problem bugs that sneak through are, which points towards doing smaller chunks.

How significant each factor is depends on details: how easy the task is to verify, how well-specified it is (and more generally how much it's in the models' wheelhouse, and how much in mine), how bad a bug would be (fun thing? internal tool? user facing? can lose data?).

I think the dynamics above apply across a range of model strengths, but that doesn't mean the changes from say Sonnet 3.7 to Opus 4.5 didn't mean anything; the machine getting better at getting the info it needs and checking itself still helps at shorter task lengths. Harness improvements can help, e.g. they could help keep models of the 'too much context, model got silly' zone (may be less severe than it once was, but I suspect will remain a thing), build better context, and clean up code as well as spitting results out.

Besides taking more of your time up front, involving yourself more also tends to drift towards you making more of the lower-level decisions about how the code will look, which I find double-edged. You have better broad context, and you know what you find maintainable. But the implementer, model or another person, is closer to the code, which helps it make some mid-to-low-level decisions well.

Plan modes and Spec-Kit type things can help with the balance of getting involved but letting the model do its thing. I've liked asking the LLM to ask a lot of questions and surface doubts. A colleague messed with Spec-Kit so it would pick one change on its fine-grained to-do list at a time, which is a neat hack I'd like to try sometime.

New comment by twotwotwo in "Dependency cooldowns turn you into a free-rider"

twotwotwo — Wed, 15 Apr 2026 04:38:35 +0000

The topic of cooldowns just shifting the problem around got some discussion on an earlier post about them -- what I said there is at https://lobste.rs/s/rygog1/we_should_all_be_using_dependency... and here's something similar:

- One idea is for projects not to update each dep just X hours after release, but on their own cycles, every N weeks or such. Someone still gets bit first, of course, but not everyone at once, and for those doing it, any upgrade-related testing or other work also ends up conveniently batched.

- Developers legitimately vary in how much they value getting the newest and greatest vs. minimizing risk. Similar logic to some people taking beta versions of software. A brand new or hobby project might take the latest version of something; a big project might upgrade occasionally and apply a strict cooldown. For users' sake, there is value in any projects that get bit not being the widely-used ones!

- Time (independent of usage) does catch some problems. A developer realizes they were phished and reports, for example, or the issue is caught by someone looking at a repo or commit stream.

As I lamented in the other post, it's unfortunate that merely using an upgraded package for a test run often exposes a bunch of a project's keys and so on. There are more angles to attack this from than solely when to upgrade packages.

New comment by twotwotwo in "Claude mixes up who said what"

twotwotwo — Thu, 09 Apr 2026 18:43:45 +0000

There is nothing specific to the role-switching here (as opposed to other mistakes), but I also notice them sometimes 1) realizing mistakes with "-- wait, that won't work" even mid-tool-call and 2) torquing a sentence around to maintain continuity after saying something wrong (amusingly blaming "the OOM killer's cousin" for a process dying, probably after outputting "the OOM killer" then recognizing it was ruled out).

Especially when thinking's off they can sometimes start with a wrong answer then talk their way around to the right one, but never quite acknowledge the initial answer as wrong, trying to finesse the correction as a 'well, technically' or refinement.

Anyhow, there are subtleties, but I wonder about giving these things a "restart sentence/line" mechanism. It'd make the '--wait,' or doomed tool-call situations more graceful, and provide a 'face-saving' out after a reply starts off incorrect. (It also potentially creates a sort of backdoor thinking mechanism in the middle of non-thinking replies, but maybe that's a feature.) Of course, we'd also need to get it to recognize "wait, I'm the assistant, not the user" for it to help here!

New comment by twotwotwo in "Claude mixes up who said what"

twotwotwo — Thu, 09 Apr 2026 18:32:20 +0000

I agree with the addition at the end -- I think this is a model limitation not a harness bug. I've seen recent Claudes act confused about who they are when deep in context, like accidentally switching to the voice of the authors of a paper it's summarizing without any quotes or an indication it's a paraphrase ("We find..."), or amusingly referring to "my laptop" (as in, Claude's laptop).

I've also seen it with older or more...chaotic? models. Older Claude got confused about who suggested an idea later in the chat. Gemini put a question 'from me' in the middle of its response and went on to answer, and once decided to answer a factual social-science question in the form of an imaginary news story with dateline and everything. It's a tiny bit like it forgets its grounding and goes base-model-y.

Something that might add to the challenge: models are already supposed to produce user-like messages to subagents. They've always been expected to be able to switch personas to some extent, but now even within a coding session, "always write like an assistant, never a user" is not necessarily a heuristic that's always right.

New comment by twotwotwo in "What makes Intel Optane stand out (2023)"

twotwotwo — Sun, 15 Mar 2026 17:16:42 +0000

One potential application I briefly had hope for was really good power loss protection in front of a conventional Flash SSD. You only need a little compared to the overall SSD capacity to be able to correctly report the write was persisted, and it's always running, so there's less of a 'will PLP work when we really need it?' question. (Maybe there's some use as a read cache too? Host RAM's probably better for that, though.) It's going to be rewritten lots of times, but it's supposed to be ready for that.

It seems like there's a very small window, commercially, for new persistent memories. Flash throughput scales really cost-efficiently, and a lot is already built around dealing with the tens-of-microseconds latencies (or worse--networked block storage!). Read latencies you can cache your way out of, and writers can either accept commit latency or play it a little fast and loose (count a replicated write as safe enough or...just not be safe). You have to improve on Flash by enough to make it worth the leap while remaining cheaper than other approaches to the same problem, and you have to be confident enough in pulling it off to invest a ton up front. Not easy!

New comment by twotwotwo in "Show HN: How I topped the HuggingFace open LLM leaderboard on two gaming GPUs"

twotwotwo — Wed, 11 Mar 2026 05:13:22 +0000

This is fascinating, and makes me wonder what other things that 'should' be impossible might just be waiting for the right configuration to be tried.

For example, we take for granted the context model of LLMs is necessary, that all you can do is append and anything that changes the beginning requires a recalculation of whatever comes after it. And that does match how training works.

But all sorts of things would become possible if it were possible to shift things in and out of context without recomputing it all; conservatively you could avoid compaction, optimistically it might be a way to get info to the model that's both more deeply integrated than search and more efficient than training larger and larger models.

New comment by twotwotwo in "I want to wash my car. The car wash is 50 meters away. Should I walk or drive?"

twotwotwo — Mon, 16 Feb 2026 18:39:17 +0000

For folks that like this kind of question, SimpleBench (https://simple-bench.com/ ) is sort of neat. From the sample questions (https://github.com/simple-bench/SimpleBench/blob/main/simple... ), a common pattern seems to be for the prompt to 'look like' a familiar/textbook problem (maybe with detail you'd need to solve a physics problem, etc.) but to get the actually-correct answer you have to ignore what the format appears to be hinting at and (sometimes) pull in some piece of human common sense.

I'm not sure how effectively it isolates a single dimension of failure or (in)capacity--it seems like it's at least two distinct skills to 1) ignore false cues from question format when there's in fact a crucial difference from the template and 2) to reach for relevant common sense at the right times--but it's sort of fun because that is a genre of prompt that seems straightforward to search for (and, as here, people stumble on organically!).

New comment by twotwotwo in "Attention at Constant Cost per Token via Symmetry-Aware Taylor Approximation"

twotwotwo — Thu, 05 Feb 2026 06:58:07 +0000

Yeah, this(-ish): there are shipping models that don't eliminate N^2 (if a model can repeat your code back with edits, it needs to reference everything somehow), but still change the picture a lot when you're thinking about, say, how resource-intensive a long-context coding session is.

There are other experiments where model designers mix full-attention layers with limited-memory ones. (Which still doesn't avoid N^2, but if e.g. 3/4 of layers use 'light' attention, it still improves efficiency a lot.) The idea is the model can still pull information from far back in context, just not in every layer. Use so far is limited to smaller models (maybe it costs too much model capability to use at the high end?) but it seems like another interesting angle on this stuff.

New comment by twotwotwo in "Exe.dev"

twotwotwo — Sat, 27 Dec 2025 21:30:50 +0000

FWIW, here are (mostly) their agent's tips for other agents from exploring a mostly-new system including tidbits like how to get recent Node: https://s3.us-east-1.amazonaws.com/1FV6XMQKP2T0D9M8FF82-cach...

It's very much a snapshot of what happens to come on a new VM today, and I put a little disclaimer in it to try to help tools get unstuck if anything there proves to be outdated or a flat-out (accidental) lie.

New comment by twotwotwo in "Exe.dev"

twotwotwo — Sat, 27 Dec 2025 01:50:48 +0000

I have played with it and it's so easy get started with that now I want a quick-project idea as an excuse to use it!

I'm sure you've thought of this, but: lots of people have some amount of 'free' (or really: zero incremental cost to users) access to some coding chat tool through a subscription or free allowance like Google's.

If you wanted to let those programs access your custom tools (browser!) and docs about the environment, a low-fuss way might be to drop a skills/ dir of info and executables that call your tools into new installs' homedirs, and/or a default AGENTS.md with the basic info and links to more.

And this seems like more fuss, but if you wanted to be able to expose to the Web whatever coding tool people 'bring', similar to how you expose your built-in chat, there's apparently an "agent control protocol" used as a sort of cross-vendor SDK by projects like https://willmcgugan.github.io/toad-released/ that try to put a nice interface on top of everything. Not saying this'd be easy at all, but you could imagine the choice between a few coding tools and auth info for them as profile-level settings pushed to new VMs. Or maybe no special settings, and bringing your own tools is just a special case of bringing your own image or setup script.

But, as y'all note, it's a VM. You can install whatever and use it through the terminal (or VSCode remoting or something else). "It's a computer" is quite a good open standard to build on.

Is the chat descended from Sketch?