Hacker News: mjw_byrne

New comment by mjw_byrne in "FFmpeg 8.0 adds Whisper support"

mjw_byrne — Wed, 13 Aug 2025 15:25:59 +0000

They're pretty different in British English, I struggled to figure it out until I started thinking about how it would sound with an American accent.

New comment by mjw_byrne in "Why Building Billing Systems Is So Painful (2024)"

mjw_byrne — Thu, 07 Aug 2025 01:30:28 +0000

I've done a from-scratch billing system build. As well as the complexities in the article, some of the logic required can result in rather exotic SQL.

Answering questions like "what was the maximum number of concurrent sessions per account between these two dates" with a SQL query is interesting. Making it perform properly adds a layer of fun.

New comment by mjw_byrne in "The U.K. closed a tax loophole for the global rich, now they're fleeing"

mjw_byrne — Sat, 19 Jul 2025 23:05:48 +0000

It's 40% above a threshold; the trouble is that the threshold is laughably tiny when you compare it to house prices.

It's also relatively easily defeated by transferring assets to others well in advance of your death, which is the kind of thing very wealthy people are more likely to be able to arrange.

New comment by mjw_byrne in "A love letter to the CSV format"

mjw_byrne — Fri, 28 Mar 2025 18:25:22 +0000

It's good for a delimiter to be uncommon in the data, so that you don't have to use your escaping mechanism too much.

This is a different thing altogether from using "disallowed" control characters, which is an attempt to avoid escaping altogether - an attempt which I was arguing is doomed to fail.

New comment by mjw_byrne in "A love letter to the CSV format"

mjw_byrne — Fri, 28 Mar 2025 12:08:13 +0000

> I'm not clear why quotes prevent parallel processing?

Because of the "non-local" effect of quotes, you can't just jump into the middle of a file and start reading it, because you can't tell whether you're inside a quoted section or not. If (big if) you know something about the structure of the data, you might be able to guess. So that's why I said "tricky" instead of "impossible".

Contrast to my escaping-only strategy, where you can jump into the middle of a file and fully understand your context by looking one char on either side.

> Do you really have a use case where reading itself is the performance bottleneck and you need to parallelize reading by starting at different file offsets? I know that multiple processes can read faster from certain high-end SSD's than just one process, but that's a level of performance optimization that is pretty extraordinary. I'm kind of curious what it is!

I used to be a data analyst at a management consultancy. A very common scenario would be that I'm handed a multi-gigabyte CSV and told to "import the data". No spec, no schema, no nothing. Data loss or corruption is totally unacceptable, because we were highly risk-sensitive. So step 1 is to go through the whole thing trying to determine field types by testing them. Does column 3 always parse as a timestamp? Great, we'll call it a timestamp. That kind of thing. In that case, it's great to be able to parallelise reading.

> And data corruption is data corruption

Agreed, but I prefer data corruption which messes up one field, not data corruption which makes my importer sit there for 5 minutes thinking the whole file is a 10GB string value and then throw "EOF in quoted field".

New comment by mjw_byrne in "A love letter to the CSV format"

mjw_byrne — Fri, 28 Mar 2025 11:57:41 +0000

I agree that it's best to pick "unlikely" delimiters so that you don't have to pepper your data with escape chars.

But some people (plenty in this thread) really do think "pick a delimiter that won't be in the data" - and then forget quoting and/or escaping - is a viable solution.

New comment by mjw_byrne in "A love letter to the CSV format"

mjw_byrne — Fri, 28 Mar 2025 11:55:21 +0000

Yep, we had a constant tug of war between techies who wanted to use open-source tools that actually work (Linux, Postgres, Python, Go etc.) and bigwigs who wanted impressive-sounding things in Powerpoint decks and were trying to force "enterprise" platforms like Palantir and IBM BigInsights on us.

Any time we were allowed to actually test one of the "enterprise" platforms, we'd break it in a few minutes. And I don't mean by being pathologically abusive, I mean stuff like "let's see if it can correctly handle a UTF-8 BOM...oh no, it can't".

New comment by mjw_byrne in "A love letter to the CSV format"

mjw_byrne — Fri, 28 Mar 2025 11:51:31 +0000

Right, but the original point I was responding to is that control characters are disallowed in the data and therefore don't need to be escaped. If you're going to have an escaping mechanism then you can use "normal" characters like comma as delimiters, which is better because they can be read and written normally.

New comment by mjw_byrne in "A love letter to the CSV format"

mjw_byrne — Thu, 27 Mar 2025 20:01:44 +0000

It is true for everything that uses quoting, I didn't mean to imply otherwise.

New comment by mjw_byrne in "A love letter to the CSV format"

mjw_byrne — Thu, 27 Mar 2025 18:40:18 +0000

If you disallow control characters so that you can use them as delimiters, then CSV itself becomes a "binary" data format - or to put it another way, you lose the ability to nest CSV.

It isn't good enough to say "but people don't/won't/shouldn't do that", because it will just happen regardless. I've seen nested CSV in real-life data.

Compare to the zero-terminated strings used by C, one legacy of which is that PostgreSQL doesn't quite support UTF-8 properly, because it can't handle a 0 byte in a string, because 0 is "special" in C.

New comment by mjw_byrne in "A love letter to the CSV format"

mjw_byrne — Thu, 27 Mar 2025 16:48:29 +0000

The idea that binary data doesn't go in CSVs is debatable; people do all sorts of weird stuff. Part of the robustness of a format is coping with abuse.

But putting that aside, if the control chars are not text, then you sacrifice human-readability and human-writability. In which case, you may as well just use a binary format.

New comment by mjw_byrne in "A love letter to the CSV format"

mjw_byrne — Thu, 27 Mar 2025 16:22:17 +0000

There's just no such thing as a delimiter which won't find its way into the data. Quoting and escaping really are the only robust way.

New comment by mjw_byrne in "A love letter to the CSV format"

mjw_byrne — Thu, 27 Mar 2025 13:23:52 +0000

I used to be a data analyst at a Big 4 management consultancy, so I've seen an awful lot of this kind of thing. One thing I never understood is the inverse correlation between "cost of product" and "ability to do serialisation properly".

Free database like Postgres? Perfect every time.

Big complex 6-figure e-discovery system? Apparently written by someone who has never heard of quoting, escaping or the difference between \n and \r and who thinks it's clever to use 0xFF as a delimiter, because in the Windows-1252 code page it looks like a weird rune and therefore "it won't be in the data".

New comment by mjw_byrne in "A love letter to the CSV format"

mjw_byrne — Thu, 27 Mar 2025 12:18:43 +0000

I considered this but then went the other way - a \ before anything other than a \, newline or comma is treated as an error. This leaves room for adding features, e.g. \N to signify a SQL NULL.

Regarding quoting and escaping, there are two options that make sense to me - either use quoting, in which case quotes are self-escaped and that's that; or use escaping, in which case quotes aren't necessary at all.

New comment by mjw_byrne in "A love letter to the CSV format"

mjw_byrne — Thu, 27 Mar 2025 11:49:10 +0000

Exactly. "Use a delimiter that's not in the data" is not real serialisation, it's fingers-crossed-hope-for-the-best stuff.

I have in the past does data extractions from systems which really can't serialise properly, where the only option is to concat all the fields with some "unlikely" string like @#~!$ as a separator, then pick it apart later. Ugh.

New comment by mjw_byrne in "A love letter to the CSV format"

mjw_byrne — Wed, 26 Mar 2025 17:36:42 +0000

CSV is ever so elegant but it has one fatal flaw - quoting has "non-local" effects, i.e. an extra or missing quote at byte 1 can change the meaning of a comma at byte 1000000. This has (at least) two annoying consequences:

1. It's tricky to parallelise processing of CSV. 2. A small amount of data corruption can have a big impact on the readability of a file (one missing or extra quote can bugger the whole thing up).

So these days for serialisation of simple tabular data I prefer plain escaping, e.g. comma, newline and \ are all \-escaped. It's as easy to serialise and deserialise as CSV but without the above drawbacks.

New comment by mjw_byrne in "AI systems with 'unacceptable risk' are now banned in the EU"

mjw_byrne — Tue, 04 Feb 2025 10:52:33 +0000

As with GDPR, the spirit is admirable but the fundamental definitions have been hand-waved. So for the foreseeable future, the expensive lawyers you hire are going to answer the important questions with "well, we don't have much case law yet..."

Also, the definition of AI seems to exclude anything that doesn't "exhibit adaptiveness after deployment". So, a big neural network doing racist facial recognition crime prediction isn't AI as long as it can't learn on-the-fly? Is my naive HTTP request rate limiter "exhibiting adaptiveness" by keeping track of each customer's typical request rate in a float32?

Laws that regulate tech need to get into the weeds of exactly what is meant by the various terms up-front, even if that means loads of examples, clarification etc.

New comment by mjw_byrne in "The GPU, not the TPM, is the root of hardware DRM"

mjw_byrne — Thu, 02 Jan 2025 09:53:00 +0000

"No one wants a preboot password though" - really? Doesn't strike me as particularly inconvenient, especially given the relative rarity of actual bootups these days.

I've been using bog-standard FDE for as long as I can remember. One extra password entry per bootup for almost-perfect security seems like great value to me.

New comment by mjw_byrne in "Jack Elam and the Fly in 'Once Upon a Time in the West'"

mjw_byrne — Tue, 31 Dec 2024 00:06:11 +0000

Just adding another comment to say how brilliant this film is. So atmospheric, such great music, such a grand presentation of the wild west and it's demise. It makes other westerns feel half-baked.

New comment by mjw_byrne in "Hyrum’s Law in Golang"

mjw_byrne — Thu, 21 Nov 2024 14:01:27 +0000

The map iteration order change helps to avoid breaking changes in future, by preventing reliance on any specific ordering, but when the change was made it was breaking for anything that was relying on the previous ordering behaviour.

IMO this is a worthwhile tradeoff. I use Go a lot and love the strong backwards compatibility, but I would happily accept a (slightly) higher rate of breaking changes if it meant greater freedom for the Go devs to improve performance, add features etc.

Based on the kind of hell users of other ecosystems seem willing to tolerate (cough Python cough), I believe I am not alone in this viewpoint.