Hacker News: pgaddict

New comment by pgaddict in "AWS engineer reports PostgreSQL perf halved by Linux 7.0, fix may not be easy"

pgaddict — Sun, 05 Apr 2026 12:54:19 +0000

So why does it happen only with hugepages? Is the extra overhead / TLB pressure enough to trigger the issue in some way? Of is it because the regular pages get swapped out (which hugepages can't be)?

New comment by pgaddict in "The real cost of random I/O"

pgaddict — Tue, 03 Mar 2026 11:48:14 +0000

True. Unfortunately it's what index scans in Postgres do right now - it's the last "major" scan type not supporting some sort of prefetch (posix_fadvise or AIO). We're working on it, hopefully it'll get into PG19.

New comment by pgaddict in "The real cost of random I/O"

pgaddict — Mon, 02 Mar 2026 11:46:29 +0000

To the best of my knowledge, yes. Unfortunately the details of how it was calculated in ~2000 seem to be lost, but the person who did that described he did it like this. It's possible we forgot some important details, of course, but the intent was to use the same formula. Which is why I carefully described and published the scripts, so that other engineers can point out thinkos and suggest changes.

New comment by pgaddict in "The real cost of random I/O"

pgaddict — Mon, 02 Mar 2026 11:37:39 +0000

Damn, I copied the wrong command. I wanted to copy this one:

fio --filename=/dev/md127 --direct=1 --rw=randread --bs=8k --ioengine=io_uring --iodepth=1 --runtime=120 --numjobs=1 --time_based --group_reporting --name=iops-test-job --eta-newline=1 --readonly

i.e. with iodepth=1 and numjobs=1. Because this is what "mimics" index scan (without prefetch) on cold data. More or less.

The command I posted earlier does ~10GB/s on my RAID, which matches your data.

New comment by pgaddict in "The real cost of random I/O"

pgaddict — Sun, 01 Mar 2026 22:11:48 +0000

Good idea. It's an interesting historical question - when we picked 4.0 as the default ~25 years ago, how close was is to the calculated value? I was asking that myself. Unfortunately I don't have a machine with traditional HDD in my homelab anymore, but I'll see if I can run the test somewhere.

I wouldn't be all that surprised if this was (partially) due to Postgres being less optimized back then, which might have hidden some of the random vs. sequential differences. But that's just a wild guess.

New comment by pgaddict in "The real cost of random I/O"

pgaddict — Sun, 01 Mar 2026 19:43:12 +0000

Good point, I should have included that (the linked pgsql-hackers thread have some of this information, at least).

I've observed exactly this behavior on a wide range of hardware / environments, it's not very specific to particular SSDs models (at least not for reads, which is what the blog post was measuring). That's why I showed results from three very different systems.

Some information for the two physical machines:

1) ryzen: Ryzen 9 9900X, RAID0 with 4x Samsung 990 PRO 1TB (in Asus Hyper M.2 Gen5 card)

2) xeon: E5-2699v4, WD Ultrastar DC SN640 960GB (U.3)

I don't know what exactly is backing the SSD storage on the Azure instance.

New comment by pgaddict in "The real cost of random I/O"

pgaddict — Sun, 01 Mar 2026 19:26:08 +0000

There probably is some additional inefficiency when reading pages randomly (compared to sequential reads), but most of the difference is at the storage level. That is, SSDs can handle a lot of random I/O, but it's nowhere close to sequential reads.

For example, I have a RAID0 with 4 SSDs (Samsung 990 PRO, so consumer, but quite good for reads). And this is what fio says:

# random reads, 8K, direct IO, depth=1

fio --filename=device name --direct=1 --rw=randread --bs=4k --ioengine=libaio --iodepth=256 --runtime=120 --numjobs=4 --time_based --group_reporting --name=iops-test-job --eta-newline=1 --readonly

-> read: IOPS=19.1k, BW=149MiB/s (156MB/s)(4473MiB/30001msec)

# sequential reads, 8K, direct IO, depth=1

fio --filename=/dev/md127 --direct=1 --rw=read --bs=8k --ioengine=io_uring --iodepth=1 --runtime=30 --numjobs=1 --time_based --group_reporting --name=random-1 --eta-newline=1 --readonly

-> read: IOPS=85.5k, BW=668MiB/s (700MB/s)(19.6GiB/30001msec)

With buffered I/O, random read stay at ~19k IOPS, while sequential reads get to ~1M IOPS (thanks to read-ahead, either at the OS level, or in the SSD).

So part of this is sequential reads benefiting from implicit "prefetching", which reduces the observed cost of a page. But for random I/O there's no such thing, and so it seems more expensive.

It's more complex (e.g. sequential reads allow issuing larger reads), of course.

New comment by pgaddict in "Super fast aggregations in PostgreSQL 19"

pgaddict — Wed, 03 Dec 2025 19:56:52 +0000

AFAIK these two joins are exactly the same once you get past the parsing. It's just a different way to write an inner join. It's translated into the same AST and so there's no difference in planning/execution.

New comment by pgaddict in "Super fast aggregations in PostgreSQL 19"

pgaddict — Wed, 03 Dec 2025 16:10:36 +0000

That is part of the key idea, yes. It's more elaborate, because it can split the aggregate - it can do part of it before the join, and finalize it after the join. Similarly to what we do for parallel queries.

As for indexes, it can help, but not in this particular example - the "code" tables are tiny, and the planner adds Memoize nodes anyway, so it acts like an ad hoc index.

Indexes are more of a complementary improvement, not an alternative to this optimization (i.e. neither makes the other unnecessary). FWIW in this case the indexes won't help very much - if you use more data in the code tables, it'll use a hash join, not nested loop / merge join.

That doesn't mean we couldn't do better with indexes, there probably are smart execution strategies for certain types of queries. But indexes also come with quite a bit of overhead (even in read-only workloads).

New comment by pgaddict in "Super fast aggregations in PostgreSQL 19"

pgaddict — Wed, 03 Dec 2025 15:43:04 +0000

It's not about not knowing about an optimization. The challenge is to know when to apply it, so that it does not cause regressions for cases that can't benefit from it. It may be less risky in specialized systems, like BI systems typically don't need to worry about regressing OLTP workloads. Postgres absolutely needs to be careful of that.

I believe that's one of the reasons why it took about ~8 years (the original patch was proposed in 2017).

New comment by pgaddict in "An Update on TinyKVM"

pgaddict — Sun, 26 Oct 2025 21:35:33 +0000

IMHO the whole point of Qubes is that it does not do the compartmentalization at the level of individual applications, but groups of applications. Otherwise you'd need to very clearly specify how/when exactly the applications can exchange data, what data, etc. I'm not saying it's impossible, but "apps in the same qube VM can do whatever" is a much easier concept.

New comment by pgaddict in "Tuning async IO in PostgreSQL 18"

pgaddict — Tue, 30 Sep 2025 13:31:34 +0000

I'm not sure what exactly you mean by "thread" here. Postgres is not thread-based - there are people working on that, but for now it's all processes.

Some of these limitations are mostly due to Postgres design, no doubt about that.

New comment by pgaddict in "Tuning async IO in PostgreSQL 18"

pgaddict — Tue, 30 Sep 2025 02:30:51 +0000

I believe there are reasons why e.g. io_uring could be inherently slower in some cases, and I tried to point some of those out.

With io_uring everything happens in the backend process, and so consumes some of the CPU time that might otherwise be spent executing the query. All the checksum verification, memcpy into shared buffers, etc. happen in the backend. And those things can be quite expensive. With worker this happens in the other processes, spreading the overhead.

Of course, on truly I/O-bound workload (actually waiting on the I/O), this may not be a huge difference. For warmed-up cases it may be more significant.

New comment by pgaddict in "Tuning async IO in PostgreSQL 18"

pgaddict — Tue, 30 Sep 2025 02:21:48 +0000

Debian 12/13, with kernel 6.15.

Sorry, should have mentioned that in the blog post.

New comment by pgaddict in "Tuning async IO in PostgreSQL 18"

pgaddict — Mon, 29 Sep 2025 13:52:30 +0000

> Right now async IO is used for sequential scans and bitmap scans, not for index scans. My initial guess would be that it mostly helps for complex queries (that use multiple indexes, so bitmap scans) and unoptimized queries (sequential scans), not so much for straightforward and/or optimized queries that use a single index. But this is really just a guess, I'm way out of my depth at this point. I'm curious how much it'll help once it is implemented for index scans as well.

Those are good guesses, IMHO.

For sequential scans, some of the "async" work could be done by kernel read-ahead, but AIO makes it explicit and loads the data into shared buffers, not just page cache. For bitmap scans we already had prefetching by fadvise, which is somewhat similar to read-ahead (also into page cache), and there were some bugs that made it ineffective in some cases, and AIO fixes that.

For index scans the difference can be an order of magnitude (say, 5-10x). Doing random I/O block by block is simply awful, prefetching data is important. I was just doing some testing on TPC-H, and on scale 50 I see Q8 going from 130s to 20s, and Q19 from 50s to 8s. And smaller improvements for a couple more queries. Of course, it depends on what else the query is doing - if it's spending 1% on the index, you won't notice a difference.

New comment by pgaddict in "Tuning async IO in PostgreSQL 18"

pgaddict — Mon, 29 Sep 2025 13:40:55 +0000

I did a lot of tests comparing the io_method choices, and I'm yet to see a realistic query where it makes a significant difference of more than a couple percent (in either direction). I'm sure it's possible to construct such queries, and it's interesting, but for real workloads it's mostly not noticeable.

At least that's how I see it right now, we'll see how that works on a much wider range of hardware and systems. The github repo linked from the pgsql-hackers post has a lot more results, some charts include results for the index prefetching patch - and there it makes more difference in some cases. But the patch is still fairly rough, it could be a bug in it too, and it changed a lot since August.

New comment by pgaddict in "1976 Soviet edition of 'The Hobbit' (2015)"

pgaddict — Thu, 14 Aug 2025 16:39:41 +0000

The funny thing is the local communist newspaper "Red Truth" (as if there were non-communist ones, ...) published a review of LOTR in 1977, in which they pretty much took the side of the Mordor. (It might be a made-up joke from the 90s, but the spirit of absurdity is spot on for 1977.)

The reasoning was roughly:

* Mordor is obviously meant to be USSR, as it's in the east.

* The orcs are clearly heavy industry workers, building the world of future.

* Bilbo is obviously a son from a bourgeoisie family, disgusted by hard work.

* The west is represented by elves = aristocracy, people = bourgeoisie, hobbits = landowners.

* The group of reactionaries are afraid of a made up "threat from the east", led by Gandalf.

* Gandalf = a reactionary ideologue, keeping people in state of fear of progress and knowledge.

* Saruman = protector of the oppressed, declared a traitor and destroyed by the reactionaries.

* But socialism can't be destroyed by throwing something in the fire. All the power to Mordor, surrounded by reactionary neighbors.

New comment by pgaddict in "Death by AI"

pgaddict — Sun, 20 Jul 2025 14:34:21 +0000

The toaster mention reminded me of this: https://www.youtube.com/watch?v=LRq_SAuQDec

This is how "talking to AI" feels like for anything mildly complex.

New comment by pgaddict in "Telum II at Hot Chips 2024: Mainframe with a Unique Caching Strategy"

pgaddict — Tue, 20 May 2025 00:15:47 +0000

I've been involved in a couple of those cases, where a large company ran into an issue, and chose to solve it by migrating to something else. And while the issues certainly exist (and are being addressed), the technical reasons often turned out to be a rather tiny part of the story. And in the end it was really about internal politics and incentives.

In several such cases, the company was repeatedly warned about how they implemented some functionalities, and that it will cause severe issues with bloat/vacuuming, etc. Along with suggestions how to modify the application to not hit those issues. Their 10x engineers chose to completely ignore that advice, because in their minds they constructed an "ideal database" and concluded that anything that behaves differently is "wrong" and it's not their application that should change. Add a dose of politics where a new CTO wants to rebuild everything from scratch, engineers with NIH syndrome, etc. It's about incentives - if you migrate to a new system, you can write flashy blog posts how the new system is great and saved everything.

You can always argue the original system would be worse, because everyone saw it had issues - you just leave out the details about choosing not to address the issues. The engineering team is unlikely to argue against that, because that'd be against their interests too.

I'm absolutely not claiming the problems do not exist. They certainly do. Nor am I claiming Postgres is the ideal database for every possible workload. It certainly is not. But the worst examples that I've seen were due to deliberate choices, driven by politics. But that's not described anywhere. In public everyone pretends it's just about the tech.

New comment by pgaddict in "Qubes OS: A reasonably secure operating system"

pgaddict — Mon, 13 Jan 2025 16:20:25 +0000

I only noticed the jerky scrolling on pages with a lot of images, particularly hires + CSS effects (blur etc.). Everything else feels OK to me (I'm sure it could be smoother, but it's not too bad so I haven't noticed).

For backups, I don't them the qubes way, I do "regular" backups within VM using rsync/duplicity/... When moving to a new machine I prefer to setup everything from scratch (and then restore the data). And it gives me all the features like incremental backups etc.