Hacker News: anarazel

New comment by anarazel in "Moving beyond fork() + exec()"

anarazel — Sat, 06 Jun 2026 17:41:31 +0000

Indeed. Not enough coffee, apparently.

New comment by anarazel in "Moving beyond fork() + exec()"

anarazel — Sat, 06 Jun 2026 16:12:53 +0000

There's a bunch of nastiness around that too. If you have e.g. library state that assumes the fd still works you can get her very confusing bugs once another file is opened into that fd number...

New comment by anarazel in "Moving beyond fork() + exec()"

anarazel — Sat, 06 Jun 2026 16:11:10 +0000

It is somewhat interesting that the most widely used "big" OS that doesn't use fork, i.e. Windows, has dog slow process creation...

I agree that there should be non-fork primitives, I'm just not that sure that performance is the best argument.

New comment by anarazel in "Bug 1950764: Work Around Crash on Intel Raptor Lake CPU"

anarazel — Mon, 25 May 2026 18:00:53 +0000

> Use of the "h" register slices (bits 8..15) by compilers is thankfully pretty rare -- otherwise this would have been noticed much sooner!

It's actually pretty easy to get compilers to use those, you mainly need a bunch of narrow accesses to neighboring memory. The oodle post contains a godbolt link to pretty ordinary c code triggering this.

I'd guess that you also need some other conditions (multiple in flight stores, high boost speeds) to trigger this.

New comment by anarazel in "Idempotency is easy until the second request is different"

anarazel — Mon, 11 May 2026 12:56:44 +0000

> It's not the default (read committed is) and I never saw serializable being set in actual production systems.

It's not the common mode of deployment, but it's definitely in prod use.

> You can do it, but then you have to be able to retry all of your transactions, including read.

Pure read transactions shouldn't need to be retried in postgres due to serialization errors. You need to have read-write dependencies for that.

That's not to say that effectively read only transactions aren't affected by serializable, you do need to record the necessary metadata for the serialization logic to work.

FWIW, if you know your transaction is read only and long running, you can start a transaction with START TRANSACTION READ ONLY DEFERRABLE, which makes the start transaction slower, but then does not need to do any work related to serializable while the transaction is running.

New comment by anarazel in "GNU IFUNC is the real culprit behind CVE-2024-3094"

anarazel — Fri, 08 May 2026 07:17:24 +0000

The GOT has to be initially writable regardless of ifunc, even with relro, to apply relocations.

New comment by anarazel in "Linux 7.0 Broke PostgreSQL: The Preemption Regression Explained"

anarazel — Wed, 29 Apr 2026 23:47:42 +0000

> It is a crime that postgres isn't able to allocate with 1GB huge pages by changing a config parameter in 2026

It is able to? Configure huge_page_size=1GB?

Support for 2MB pages was added in 2014, for larger pages 2020.

Edit: year details.

New comment by anarazel in "I quit drinking for a year"

anarazel — Tue, 28 Apr 2026 01:51:44 +0000

I stopped drinking a few years back, after some (unrelated) health stuff. I don't miss wine, beer, that stopped - like for the author - after a relatively short amount of time. But interestingly I still really miss the feeling of a good scotch after a long day. Not being buzzed, but the sharpness mixed with interesting tastes.

My sleep has gotten so much better. I really didn't realize that alcohol didn't affect just the night after I had a drink, but even the next one or two nights..

New comment by anarazel in "Cirrus Labs to join OpenAI"

anarazel — Sun, 12 Apr 2026 11:48:46 +0000

Ability to trivially use custom VM images was quite nice. The amount of CI time spent installing dependencies or copying a cache of installed stuff is nontrivial. Particularly for Windows the time difference is often very substantial. But even for plain Linux, there's no point in apt-get update && apt-get install the same set of things in every run (when using containers, cirrus could build them in-demand too, with little notational overhead).

Defaulting to throw-away-VMs for everything is also the right choice for something where the threat model includes attackers submitting patches/PRs. I'll never understand why folks were ok with just container separation for that (and often have no separation in runners).

New comment by anarazel in "The effects of caffeine consumption do not decay with a ~5 hour half-life"

anarazel — Fri, 10 Apr 2026 14:37:06 +0000

Weirdly enough, I loved coffee from the first time I tried it, at maybe 13. Even though, looking back, it must have been terrible coffee, it was at something vaguely model UN like thing our entire class went to in an overnight trip. Obviously not enough sleep was had. A vending machine (in the late 90s) provided coffee...

New comment by anarazel in "AWS engineer reports PostgreSQL perf halved by Linux 7.0, fix may not be easy"

anarazel — Mon, 06 Apr 2026 14:59:50 +0000

> ... so that leaves me confused. My understanding is that the regression is triggered with the 7.0+ kernel and can be mitigated with huge pages turned on.

It gets a bit worse with preempt_lazy - for me just 15% percent or so - because the lock holder is scheduled out a bit more often. But it was bad before.

> My question therefore was how come this regression hasn't been visible with huge pages turned off with older kernel versions? You say that it was but I can't find this data point.

I mean it wasn't a regression before, because this is how it has behaved for a long time.

This workload is not a realistic thing that anybody would encounter in this form in the real world. Even without the contention - which only happens the first time the buffer pool is filled - you lose so much by not using huge pages with a 100gb buffer pool that you will have many other issues.

We (postgres and me personally) were concerned enough about potential contention in this path that we did get rid of that lock half a year ago (buffer replacement selection has been lock free for close to a decade, just unused buffers were found via a list protected by this lock).

But the performance gains we saw were relatively small, we didn't measure large buffer pools without huge pages though.

And at least I didn't test with this many connections doing small random reads into a cold buffer pool, just because it doesn't seem that interesting.

New comment by anarazel in "AWS engineer reports PostgreSQL perf halved by Linux 7.0, fix may not be easy"

anarazel — Mon, 06 Apr 2026 12:22:12 +0000

The contention does exist in older kernels and is quite substantial.

New comment by anarazel in "AWS engineer reports PostgreSQL perf halved by Linux 7.0, fix may not be easy"

anarazel — Sun, 05 Apr 2026 18:38:51 +0000

> That 64-bit atomic in the buffer head with flags, a spinlock, and refcounts all jammed into it is nasty.

Turns out to be pretty crucial for performance though... Not manipulating them with a single atomic leads to way way worse performance.

For quite a while it was a 32bit atomic, but I recently made it a 64bit one, to allow the content lock (i.e. protecting the buffer contents, rather than the buffer header) to be in the same atomic var. That's for one nice for performance, it's e.g. very common to release a pin and a lock at the same time and there are more fun perf things we can do in the future. But the real motivation was work on adding support for async writes - an exclusive locker might need to consume an IO completion for a write that's in flight that is prevent it from acquiring the lock. And that was hard to do with a separate content lock and buffer state...

> And there are like ten open coded spin waits around the uses... you certainly have my empathy :)

Well, nearly all of those are all to avoid needing to hold a spinlock, which, as lamented a lot around this issue, don't perform that well when really contended :)

We're on our way to barely ever need the spinlock for the buffer header, which then should allow us to get rid of many of those loops.

> This got me thinking about 64-bit futexes again. Obviously that can't work with PI... but for just FUTEX_WAIT/FUTEX_WAKE, why not?

It'd be pretty nice to have. There are lot of cases where one needs more lock state than one can really encode into a 32bit lock state.

I'm quite keen to experiment with the rseq time slice extension stuff. Think it'll help with some important locks (which are not spinlocks...).

New comment by anarazel in "AWS engineer reports PostgreSQL perf halved by Linux 7.0, fix may not be easy"

anarazel — Sun, 05 Apr 2026 16:57:13 +0000

> > On x86 a spinlock release doesn't need a memory barrier (unless you do insane things) / lock prefix, but a futex based lock does (because you otherwise may not realize you need to futex wake).

> Now you've gotten me wondering. This issue is, in some sense, artificial: the actual conceptual futex unlock operation does not require sequential consistency. What's needed is (roughly, anyway) an release operation that synchronizes with whoever subsequently acquires the lock (on x86, any non-WC store is sufficient) along with a promise that the kernel will get notified eventually (and preferably fairly quickly) if there was a non-spinning sleeper. But there is no requirement that the notification occur in any particular order wrt anything else except that the unlock must be visible by the time the notification occurs [0]; there isn't even a requirement that the notification not occur if there is no futex waiter.

Hah.

> ... > But maybe there are sneaky tricks. I'm wondering whether CMPXCHG (no lock) is secretly good enough for this. Imagine a lock word where bit 0 set means locked and bit 1 set means that there is a waiter. The wait operation observes (via plain MOV?) that bit 0 is set and then sets bit 1 (let's say this is done with LOCK CMPXCHG for simplicity) and then calls futex_wait(), so it thinks the lock word has the value 3. The unlock operation does plain CMPXCHG to release the lock. The failure case would be that it reports success while changing the value from 1 to 0. I don't know whether this can happen on Intel or AMD architectures.

I suspect the problem isn't so much the lock prefix, but that the non-futex spinlock release just is a store, whereas a futex release has to be a RMW operation.

I'm talking out of my ass here, but my guess is that the reason for the performance gain of the plain-store-is-a-spinlock-release on x86 comes from being able to do the release via the store buffer, without having to wait for exclusive ownership of the cache line. Due to being a somewhat contended simple spinlock, often embedded on the same line as the to-be-protected data, it's common for the line not not be in modified ownership anymore at release.

New comment by anarazel in "AWS engineer reports PostgreSQL perf halved by Linux 7.0, fix may not be easy"

anarazel — Sun, 05 Apr 2026 14:29:38 +0000

Addendum big enough to warrant a separate post: The fact the contention is a spinlock, rather than a futex is unrelated to the "regression".

A quick hack shows the contended performance to be nearly indistinguishable with a futex based lock. Which makes sense, non-PI futexes don't transfer the scheduler slice the lock owner, because they don't know who the lock owner is. Postgres' spinlock use randomized exponential backoff, so they don't prevent the lock owner from getting scheduled.

Thus the contention is worse with PREEMPT_LAZY, even with non-PI futexes (which is what typical lock implementations are based on), because the lock holder gets scheduled out more often.

Probably worth repeating: This contention is due to an absurd configuration that should never be used in practice.

New comment by anarazel in "AWS engineer reports PostgreSQL perf halved by Linux 7.0, fix may not be easy"

anarazel — Sun, 05 Apr 2026 13:22:44 +0000

I really dislike the use of spinlocks in postgres (and have been replacing a lot of uses over time), but it's not always easy to replace them from a performance angle.

On x86 a spinlock release doesn't need a memory barrier (unless you do insane things) / lock prefix, but a futex based lock does (because you otherwise may not realize you need to futex wake). Turns out that that increase in memory barriers causes regressions that are nontrivial to avoid.

Another difficulty is that most of the remaining spinlocks are just a single bit in a 8 larger byte atomic. Futexes still don't support anything but 4 bytes (we could probably get away with using it on a part of the 8 byte atomic with some reordering) and unfortunately postgres still supports platforms with no 8 byte atomics (which I think is supremely silly), and the support for a fallback implementation makes it harder to use futexes.

The spinlock triggering the contention in the report was just stupid and we only recently got around to removing it, because it isn't used during normal operation.

Edit: forgot to add that the spinlock contention is not measurable on much more extreme workloads when using huge pages. A 100GB buffer pool with 4KB pages doesn't make much sense.

New comment by anarazel in "AWS engineer reports PostgreSQL perf halved by Linux 7.0, fix may not be easy"

anarazel — Sun, 05 Apr 2026 13:00:05 +0000

I don't fully know, but I suspect it's just that due to the minor faults and tlb misses there is terrible contention with the spinlock, regardless of the PREEMPT_LAZY when using 4k pages (that easily reproducible). Which is then made worse by preempting more with the lock held.

New comment by anarazel in "AWS engineer reports PostgreSQL perf halved by Linux 7.0, fix may not be easy"

anarazel — Sun, 05 Apr 2026 05:20:43 +0000

Yes, I did reproduce it (to a much smaller degree, but it's just a 48c/96t machine). But it's an absurd workload in an insane configuration. Not using huge pages hurts way more than the regression due to PREEMPT_LAZY does.

With what we know so far, I expect that there are just about no real world workloads that aren't already completely falling over that will be affected.

New comment by anarazel in "RISC-V Is Sloooow"

anarazel — Wed, 11 Mar 2026 01:35:53 +0000

Cross building of possible, but it's rather useful to be able to test the software you just built... And often enough, tests take more resources than the build.

New comment by anarazel in "Story of XZ Backdoor [video]"

anarazel — Thu, 26 Feb 2026 14:57:56 +0000

Just German, not European, but still a start: https://en.wikipedia.org/wiki/Sovereign_Tech_Agency