Hacker News: mjb

New comment by mjb in "Raft Consensus with a Minority of Nodes"

mjb — Wed, 27 May 2026 17:21:29 +0000

The 70s, if you want to be pedantic (e.g. Gifford's "Weighted Voting for Replicated Data" or Thomas's "A Majority Consensus Approach to Concurrency Control for Multiple Copy Databases", both from '79).

New comment by mjb in "Raft Consensus with a Minority of Nodes"

mjb — Wed, 27 May 2026 17:20:41 +0000

This is cool, and a really fun reminder that "majority" isn't required for quorum systems (it just happens to be the simplest way of thinking about it, and optimal in some senses). Moving from majorities to some other definition of quorum isn't super practical all that often, but is an interesting tool when you think about systems that don't have a uniform probability of failure or disconnection. That's not infrequent - large scale networks have very variable amounts of redundancy depending on geography and distance.

The idea of non-MDS erasure codes isn't quite the same, but they're related in the way that MDS codes are the easiest to think about, and non-MDS codes come with interesting complexities while opening up some cool new options for system design and recovery.

Using "majority" as the criterion has been around for a long time (e.g. Gifford in '79 https://pages.cs.wisc.edu/~remzi/Classes/739/Fall2015/Papers..., and Thomas also in '79 https://dl.acm.org/doi/10.1145/320071.320076). Also related is the idea of weighted voting (e.g. Peleg and Wool in '95 https://www.sciencedirect.com/science/article/pii/S089054018...).

New comment by mjb in "Simple and correct snapshot isolation"

mjb — Sun, 03 May 2026 04:44:54 +0000

Good read.

I've always been a little confused about this framing of WSI. The observation that detecting read-write conflicts is sufficient for serializability dates back to at least Kung and Robinson in '83 (IIRC). It is true, though, and the observation that it's a minor change to an already MVCC database's commit logic is theoretically correct.

It's not really practically correct, though. Writes kinda have to be resolved to updated keys, so detecting w-w conflicts is very easy. In a SQL database, though, reads can be predicates, or aggregations, or even indicate a lack of data (gaps). This makes practically implementing this scheme on real world workloads pretty tricky, both correctness-wise and performance-wise. Clearly possible, but quickly devolves into a bunch of optimizations around edge cases. Granted, it is easier in databases that don't need full SQL semantics.

We actually started here early in the design of Aurora DSQL, but changed our minds and picked SI based on data about what isolation levels people actually choose (vs what they say they choose), the difficulty that optimizing schemas and queries for good performance under serializability presents to application programmers (you have to be very very careful to read only what you need), and the general large size of read sets compared to write sets in relational workloads. We might end up doing serializability down the line, but the demand isn't there once people see the real world tradeoffs.

Amusing aside (not about the article linked here). It's super common to see people try refute the performance cost of serializability using TPC-C. That's funny because TPC-C is serializable at SI, and never experiences write skew due to the structure of it's workload.

The invisible engineering behind Lambda's network

mjb — Wed, 22 Apr 2026 19:14:49 +0000

Article URL: https://www.allthingsdistributed.com/2026/04/the-invisible-engineering-behind-lambdas-network.html

Comments URL: https://news.ycombinator.com/item?id=47867970

Points: 4

# Comments: 0

New comment by mjb in "Three Cache Layers Between Select and Disk"

mjb — Thu, 12 Feb 2026 22:29:21 +0000

Cool article!

> This is why free -h on a Linux box can look alarming. You see almost no “free” memory, but most of it is “available” - and the page cache is using it.

And other buffers and stuff too. This is a great thing on bare metal, because it's a bet that the marginal cost of using an empty memory page is zero. This is true on bare metal, always. But in containers, or multi-tenant infrastructure, that isn't true anymore. That's where stuff like DAMON come in: https://www.kernel.org/doc/html/v5.17/vm/damon/index.html

In Aurora Serverless this kind of page cache management is a critical part of what the control plane does. Essentially we need to size the page cache to be big enough for great performance, but small enough not to cost the customer unnecessarily. We go into quite a lot of detail on that in our VLDB'25 paper: https://assets.amazon.science/ee/a4/41ff11374f2f865e5e24de11...

> Linux fills free memory with page cache on purpose. It’s a bet: if someone reads this block again, I already have it.

This works because most database workloads have great temporal and spatial locality. And it works well. But it's also one of the biggest practical issues people run into with relational databases in production: performance is great until it isn't. The shared buffers and page cache keep reads to near zero, but when the working set grows even a tiny bit bigger, then the rate of reads can go up super quickly.

This is why in both Aurora Serverless and Aurora DSQL we do buffer and cache sizing very dynamically, getting rid of this cliff for most workloads.

New comment by mjb in "I wanted a camera that doesn't exist, so I built it"

mjb — Tue, 06 Jan 2026 23:45:38 +0000

Indeed. Collecting cameras, and talking about cameras, is a very different hobby from photography. That's OK! Both can be fun.

Inspired me to write this blog post: https://brooker.co.za/blog/2023/04/20/hobbies.html

New comment by mjb in "I wanted a camera that doesn't exist, so I built it"

mjb — Tue, 06 Jan 2026 23:44:14 +0000

The haters will hate, but tap guides are great (e.g. https://biggatortools.com/v-tapguide-faqs, but even a block of hard wood with a clearance hole drilled in it works fine).

Unless you're tapping something super tough (306?), Amazon taps are fine for hand tapping. Go in straight, use a good lubricant.

New comment by mjb in "$50 PlanetScale Metal Is GA for Postgres"

mjb — Mon, 15 Dec 2025 17:46:29 +0000

I don't think either is a bad choice, but Aurora has some advantages if you're not a DB expert. Starting with Aurora Serverless:

- Aurora storage scales with your needs, meaning that you don't need to worry about running out of space as your data grows. - Aurora will auto-scale CPU and memory based on the needs of your application, within the bounds you set. It does this without any downtime, or even dropping connections. You don't have to worry about choosing the right CPU and memory up-front, and for most applications you can simply adjust your limits as you go. This is great for applications that are growing over time, or for applications with daily or weekly cycles of usage.

The other Aurora option is Aurora DSQL. The advantages of picking DSQL are:

- A generous free tier to get you going with development. - Scale-to-zero and scale-up, on storage, CPU, and memory. If you aren't sending any traffic to your database it costs you nothing (except storage), and you can scale up to millions of transactions per second with no changes. - No infrastructure to configure or manage, no updates, no thinking about replicas, etc. You don't have to understand CPU or memory ratios, think about software versions, think about primaries and secondaries, or any of that stuff. High availability, scaling of reads and writes, patching, etc is all built-in.

New comment by mjb in "Why Strong Consistency?"

mjb — Fri, 28 Nov 2025 04:20:18 +0000

I spoke about this exact thing at a conference (HPTS’19) a while back. This can work, but introduces modal behaviors into systems that make reasoning about availability very difficult and tends to cause meta stable behaviors and long outages.

The feedback loop is replicas slow -> traffic increases to primary -> primary slows -> replicas slow, etc. The only way out of this loop is to shed traffic.

New comment by mjb in "Why Strong Consistency?"

mjb — Fri, 28 Nov 2025 04:18:12 +0000

When does AP help?

It helps in the case where clients are (a) able to contact a minority partition, and (b) can tolerate eventual consistency, and (c) can’t contact the majority partition. These cases are quite rare in modern internet-connected applications.

Consider a 3AZ cloud deployment with remote clients on the internet, and one AZ partitioned off. Most often, clients from the outside will either be able to contact the remaining majority (the two healthy AZs), or will be able to contact nobody. Rarely, clients from the outside will have a path into the minority partition but not the majority partition, but I don’t think I’ve seen that happen in nearly two decades of watching systems like this.

What about internal clients in the partitioned off DC? Yes, the trade-off is that they won’t be able to make isolated progress. If they’re web servers or whatever, that’s moot because they’re partitioned off and there’s no work to do. Same if they’re a training cluster, or other highly connected workloads. There are workloads that can tolerate a ton of asynchrony where being able to continue while disconnected is interesting, but they’re the exception rather than the rule.

Weak consistency is much more interesting as a mechanism for reducing latency (as DynamoDB does, for example) or increasing scalability (as the typical RDBMS ‘read replicas’ pattern does).

New comment by mjb in "Why Strong Consistency?"

mjb — Fri, 28 Nov 2025 04:12:38 +0000

Practically, the difference in availability for typical internet connected application is very small. Partitions do happen, but in most cases its possible to route user traffic around them, given the paths that traffic tends to take into large-scale data center clusters (redundant, typically not the same paths as the cross-DC traffic). The remaining cases do exist, but are exceedingly rare in practice.

Note that I’m not saying that partitions don’t happen. They do! But in typical internet connected applications the cases where a significant proportion of clients is partitioned into the same partition as a minority of the database (i.e. the cases where AP actually improves availability) are very rare in practice.

For client devices and IoT, partitions off from the main internet are rare, and there local copies of data are a necessity.

New comment by mjb in "Why Strong Consistency?"

mjb — Fri, 28 Nov 2025 04:08:57 +0000

Yes, you can do stuff like that. You might enjoy the CRAQ paper by Terrace et al, which does something similar to what you are saying (in a very different setting, chain replication rather than DBs).

New comment by mjb in "Why Strong Consistency?"

mjb — Fri, 28 Nov 2025 04:06:57 +0000

(Op here) No deadlocks needed! There’s nothing about providing strong consistency (or even strong isolation) that requires deadlocks to be a thing. DSQL, for example, doesn’t have them*.

Event sourcing architectures can be great, but they also tend to be fairly complex (a lot of moving parts). The bigger practical problem is that they make it quite hard to offer clients ‘outside the architecture’ meaningful read-time guarantees stronger than a consistent prefix. That makes clients’ lives hard for the reasons I argue in the blog post.

I really like event-based architectures for things like observability, metering, reporting, and so on where clients can be very tolerant to seeing bounded stale data. For control planes, website backends, etc, I think strongly consistent DB architectures tend to be both simpler and offer a better customer experience.

* Ok, there’s one edge case in the cross-shard commit protocol where two committers can deadlock, which needs to be resolved by aborting one of them (the moral equivalent of WAIT-DIE). This never happens with single-shard transactions, and can’t be triggered by any SQL patterns.

New comment by mjb in "Why Strong Consistency?"

mjb — Fri, 28 Nov 2025 01:10:38 +0000

(OP here).

The point of that section, which maybe isn’t obvious enough, is to reflect on how eventually-consistent read replicas limit the options of the database system builder (rather than the application builder). If I’m building the transaction layer of a database, I want to have a bunch of options for where to send me reads, so I don’t have the send the whole read part of every RMW workloads to the single leader.

New comment by mjb in "Why Strong Consistency?"

mjb — Fri, 28 Nov 2025 01:07:37 +0000

(OP here). I don’t love leaking this kind of thing through the API. I think that, for most client/server shaped systems at least, we can offer guarantees like linearizability to all clients with few hard real-world trade-offs. That does require a very careful approach to designing the database, and especially to read scale-out (as you say) but it’s real and doable.

By pushing things like read-scale-out into the core database, and away from replicas and caches, we get to have stronger client and application guarantees with less architectural complexity. A great combination.

New comment by mjb in "Why Strong Consistency?"

mjb — Fri, 28 Nov 2025 01:03:59 +0000

That’s a fair point. To be fair to the academic definitions, “eventually consistent” is a quiescent state in most definitions, and there are more specific ones (like “bounded staleness”, or “monotonic prefix”) that are meaningful to clients of the system.

But I agree with you in general - the dynamic nature of systems means, in my mind, that you need to use client-side guarantees, rather than state guarantees, to reason about this stuff in general. State guarantees are nicer to prove and work with formally (see Adya, for example) while client side guarantees are trickier and feel less fulfilling formally (see Crooks et al “Seeing is Believing”, or Herlihy and Wing).

New comment by mjb in "Why Strong Consistency?"

mjb — Fri, 28 Nov 2025 00:59:48 +0000

Read-your-writes is a client guarantee, that requires stickiness (i.e. a definition of “your”) to be meaningful. It’s not a level of consistency I love, because it raises all kinds of edge-case questions. For example, if I have to reconnect, am I still the same “your”? This isn’t even the some rare edge case! If I’m automating around a CLI, for example, how is the server meant to know that the next CLI invocation from the same script (a different process) is the same “your”? Sure, I can fix that with some kind of token, but then I’ve made the API more complicated.

Linearizability, as a global guarantee, is much nicer because it avoids all those edge cases.

New comment by mjb in "Why Strong Consistency?"

mjb — Fri, 28 Nov 2025 00:54:54 +0000

The point is that, in a disaggregated system, the transaction processor has less flexibility about how to route parts of the same transaction (that section is a point about internal implementation details of transaction systems).

New comment by mjb in "Why Strong Consistency?"

mjb — Fri, 28 Nov 2025 00:53:28 +0000

There is no reason a database can’t be both strongly consistent (linearizable, or equivalent) and available to clients on the majority side of a partition. This is, by far, the common case of real-world partitions in deployments with 3 data centers. One is disconnected or fails. The other two can continue, offering both strong consistency and availability to clients on their side of the partition.

The Gilbert and Lynch definition of CAP calls this state ‘unavailable’, in that it’s not available to all clients. Practically, though, it’s still available for two thirds of clients (or more, if we can reroute clients from the outside), which seems meaningfully ‘available’ to me!

If you don’t believe me, check out Phil Bernstein’s paper (Bernstein and Das) about this. Or read the Gilbert and Lynch proof carefully.

New comment by mjb in "/dev/null is an ACID compliant database"

mjb — Fri, 24 Oct 2025 01:15:01 +0000

This doesn't work as cleanly for SQL-style transactions where there are tons of RW transactions, sadly.