<rss version="2.0" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:atom="http://www.w3.org/2005/Atom"><channel><title>Hacker News: mjw_byrne</title><link>https://news.ycombinator.com/user?id=mjw_byrne</link><description>Hacker News RSS</description><docs>https://hnrss.org/</docs><generator>hnrss v2.1.1</generator><lastBuildDate>Sun, 05 Apr 2026 22:32:42 +0000</lastBuildDate><atom:link href="https://hnrss.org/user?id=mjw_byrne" rel="self" type="application/rss+xml"></atom:link><item><title><![CDATA[New comment by mjw_byrne in "FFmpeg 8.0 adds Whisper support"]]></title><description><![CDATA[
<p>They're pretty different in British English, I struggled to figure it out until I started thinking about how it would sound with an American accent.</p>
]]></description><pubDate>Wed, 13 Aug 2025 15:25:59 +0000</pubDate><link>https://news.ycombinator.com/item?id=44889709</link><dc:creator>mjw_byrne</dc:creator><comments>https://news.ycombinator.com/item?id=44889709</comments><guid isPermaLink="false">https://news.ycombinator.com/item?id=44889709</guid></item><item><title><![CDATA[New comment by mjw_byrne in "Why Building Billing Systems Is So Painful (2024)"]]></title><description><![CDATA[
<p>I've done a from-scratch billing system build. As well as the complexities in the article, some of the logic required can result in rather exotic SQL.<p>Answering questions like "what was the maximum number of concurrent sessions per account between these two dates" with a SQL query is interesting. Making it perform properly adds a layer of fun.</p>
]]></description><pubDate>Thu, 07 Aug 2025 01:30:28 +0000</pubDate><link>https://news.ycombinator.com/item?id=44819685</link><dc:creator>mjw_byrne</dc:creator><comments>https://news.ycombinator.com/item?id=44819685</comments><guid isPermaLink="false">https://news.ycombinator.com/item?id=44819685</guid></item><item><title><![CDATA[New comment by mjw_byrne in "The U.K. closed a tax loophole for the global rich, now they're fleeing"]]></title><description><![CDATA[
<p>It's 40% above a threshold; the trouble is that the threshold is laughably tiny when you compare it to house prices.<p>It's also relatively easily defeated by transferring assets to others well in advance of your death, which is the kind of thing very wealthy people are more likely to be able to arrange.</p>
]]></description><pubDate>Sat, 19 Jul 2025 23:05:48 +0000</pubDate><link>https://news.ycombinator.com/item?id=44620294</link><dc:creator>mjw_byrne</dc:creator><comments>https://news.ycombinator.com/item?id=44620294</comments><guid isPermaLink="false">https://news.ycombinator.com/item?id=44620294</guid></item><item><title><![CDATA[New comment by mjw_byrne in "A love letter to the CSV format"]]></title><description><![CDATA[
<p>It's good for a delimiter to be uncommon in the data, so that you don't have to use your escaping mechanism too much.<p>This is a different thing altogether from using "disallowed" control characters, which is an attempt to avoid escaping altogether - an attempt which I was arguing is doomed to fail.</p>
]]></description><pubDate>Fri, 28 Mar 2025 18:25:22 +0000</pubDate><link>https://news.ycombinator.com/item?id=43508547</link><dc:creator>mjw_byrne</dc:creator><comments>https://news.ycombinator.com/item?id=43508547</comments><guid isPermaLink="false">https://news.ycombinator.com/item?id=43508547</guid></item><item><title><![CDATA[New comment by mjw_byrne in "A love letter to the CSV format"]]></title><description><![CDATA[
<p>> I'm not clear why quotes prevent parallel processing?<p>Because of the "non-local" effect of quotes, you can't just jump into the middle of a file and start reading it, because you can't tell whether you're inside a quoted section or not.  If (big if) you know something about the structure of the data, you might be able to guess.  So that's why I said "tricky" instead of "impossible".<p>Contrast to my escaping-only strategy, where you can jump into the middle of a file and fully understand your context by looking one char on either side.<p>> Do you really have a use case where reading itself is the performance bottleneck and you need to parallelize reading by starting at different file offsets? I know that multiple processes can read faster from certain high-end SSD's than just one process, but that's a level of performance optimization that is pretty extraordinary. I'm kind of curious what it is!<p>I used to be a data analyst at a management consultancy.  A very common scenario would be that I'm handed a multi-gigabyte CSV and told to "import the data".  No spec, no schema, no nothing.  Data loss or corruption is totally unacceptable, because we were highly risk-sensitive.  So step 1 is to go through the whole thing trying to determine field types by testing them.  Does column 3 always parse as a timestamp?  Great, we'll call it a timestamp.  That kind of thing.  In that case, it's great to be able to parallelise reading.<p>> And data corruption is data corruption<p>Agreed, but I prefer data corruption which messes up one field, not data corruption which makes my importer sit there for 5 minutes thinking the whole file is a 10GB string value and then throw "EOF in quoted field".</p>
]]></description><pubDate>Fri, 28 Mar 2025 12:08:13 +0000</pubDate><link>https://news.ycombinator.com/item?id=43504355</link><dc:creator>mjw_byrne</dc:creator><comments>https://news.ycombinator.com/item?id=43504355</comments><guid isPermaLink="false">https://news.ycombinator.com/item?id=43504355</guid></item><item><title><![CDATA[New comment by mjw_byrne in "A love letter to the CSV format"]]></title><description><![CDATA[
<p>I agree that it's best to pick "unlikely" delimiters so that you don't have to pepper your data with escape chars.<p>But some people (plenty in this thread) really do think "pick a delimiter that won't be in the data" - and then forget quoting and/or escaping - is a viable solution.</p>
]]></description><pubDate>Fri, 28 Mar 2025 11:57:41 +0000</pubDate><link>https://news.ycombinator.com/item?id=43504257</link><dc:creator>mjw_byrne</dc:creator><comments>https://news.ycombinator.com/item?id=43504257</comments><guid isPermaLink="false">https://news.ycombinator.com/item?id=43504257</guid></item><item><title><![CDATA[New comment by mjw_byrne in "A love letter to the CSV format"]]></title><description><![CDATA[
<p>Yep, we had a constant tug of war between techies who wanted to use open-source tools that actually work (Linux, Postgres, Python, Go etc.) and bigwigs who wanted impressive-sounding things in Powerpoint decks and were trying to force "enterprise" platforms like Palantir and IBM BigInsights on us.<p>Any time we were allowed to actually test one of the "enterprise" platforms, we'd break it in a few minutes.  And I don't mean by being pathologically abusive, I mean stuff like "let's see if it can correctly handle a UTF-8 BOM...oh no, it can't".</p>
]]></description><pubDate>Fri, 28 Mar 2025 11:55:21 +0000</pubDate><link>https://news.ycombinator.com/item?id=43504233</link><dc:creator>mjw_byrne</dc:creator><comments>https://news.ycombinator.com/item?id=43504233</comments><guid isPermaLink="false">https://news.ycombinator.com/item?id=43504233</guid></item><item><title><![CDATA[New comment by mjw_byrne in "A love letter to the CSV format"]]></title><description><![CDATA[
<p>Right, but the original point I was responding to is that control characters are disallowed in the data and therefore don't need to be escaped.  If you're going to have an escaping mechanism then you can use "normal" characters like comma as delimiters, which is better because they can be read and written normally.</p>
]]></description><pubDate>Fri, 28 Mar 2025 11:51:31 +0000</pubDate><link>https://news.ycombinator.com/item?id=43504209</link><dc:creator>mjw_byrne</dc:creator><comments>https://news.ycombinator.com/item?id=43504209</comments><guid isPermaLink="false">https://news.ycombinator.com/item?id=43504209</guid></item><item><title><![CDATA[New comment by mjw_byrne in "A love letter to the CSV format"]]></title><description><![CDATA[
<p>It is true for everything that uses quoting, I didn't mean to imply otherwise.</p>
]]></description><pubDate>Thu, 27 Mar 2025 20:01:44 +0000</pubDate><link>https://news.ycombinator.com/item?id=43497528</link><dc:creator>mjw_byrne</dc:creator><comments>https://news.ycombinator.com/item?id=43497528</comments><guid isPermaLink="false">https://news.ycombinator.com/item?id=43497528</guid></item><item><title><![CDATA[New comment by mjw_byrne in "A love letter to the CSV format"]]></title><description><![CDATA[
<p>If you disallow control characters so that you can use them as delimiters, then CSV itself becomes a "binary" data format - or to put it another way, you lose the ability to nest CSV.<p>It isn't good enough to say "but people don't/won't/shouldn't do that", because it will just happen regardless.  I've seen nested CSV in real-life data.<p>Compare to the zero-terminated strings used by C, one legacy of which is that PostgreSQL doesn't quite support UTF-8 properly, because it can't handle a 0 byte in a string, because 0 is "special" in C.</p>
]]></description><pubDate>Thu, 27 Mar 2025 18:40:18 +0000</pubDate><link>https://news.ycombinator.com/item?id=43496669</link><dc:creator>mjw_byrne</dc:creator><comments>https://news.ycombinator.com/item?id=43496669</comments><guid isPermaLink="false">https://news.ycombinator.com/item?id=43496669</guid></item><item><title><![CDATA[New comment by mjw_byrne in "A love letter to the CSV format"]]></title><description><![CDATA[
<p>The idea that binary data doesn't go in CSVs is debatable; people do all sorts of weird stuff.  Part of the robustness of a format is coping with abuse.<p>But putting that aside, if the control chars are not text, then you sacrifice human-readability and human-writability.  In which case, you may as well just use a binary format.</p>
]]></description><pubDate>Thu, 27 Mar 2025 16:48:29 +0000</pubDate><link>https://news.ycombinator.com/item?id=43495454</link><dc:creator>mjw_byrne</dc:creator><comments>https://news.ycombinator.com/item?id=43495454</comments><guid isPermaLink="false">https://news.ycombinator.com/item?id=43495454</guid></item><item><title><![CDATA[New comment by mjw_byrne in "A love letter to the CSV format"]]></title><description><![CDATA[
<p>There's just no such thing as a delimiter which won't find its way into the data.  Quoting and escaping really are the only robust way.</p>
]]></description><pubDate>Thu, 27 Mar 2025 16:22:17 +0000</pubDate><link>https://news.ycombinator.com/item?id=43495203</link><dc:creator>mjw_byrne</dc:creator><comments>https://news.ycombinator.com/item?id=43495203</comments><guid isPermaLink="false">https://news.ycombinator.com/item?id=43495203</guid></item><item><title><![CDATA[New comment by mjw_byrne in "A love letter to the CSV format"]]></title><description><![CDATA[
<p>I used to be a data analyst at a Big 4 management consultancy, so I've seen an awful lot of this kind of thing.  One thing I never understood is the inverse correlation between "cost of product" and "ability to do serialisation properly".<p>Free database like Postgres?  Perfect every time.<p>Big complex 6-figure e-discovery system?  Apparently written by someone who has never heard of quoting, escaping or the difference between \n and \r and who thinks it's clever to use 0xFF as a delimiter, because in the Windows-1252 code page it looks like a weird rune and therefore "it won't be in the data".</p>
]]></description><pubDate>Thu, 27 Mar 2025 13:23:52 +0000</pubDate><link>https://news.ycombinator.com/item?id=43493362</link><dc:creator>mjw_byrne</dc:creator><comments>https://news.ycombinator.com/item?id=43493362</comments><guid isPermaLink="false">https://news.ycombinator.com/item?id=43493362</guid></item><item><title><![CDATA[New comment by mjw_byrne in "A love letter to the CSV format"]]></title><description><![CDATA[
<p>I considered this but then went the other way - a \ before anything other than a \, newline or comma is treated as an error.  This leaves room for adding features, e.g. \N to signify a SQL NULL.<p>Regarding quoting and escaping, there are two options that make sense to me - either use quoting, in which case quotes are self-escaped and that's that; or use escaping, in which case quotes aren't necessary at all.</p>
]]></description><pubDate>Thu, 27 Mar 2025 12:18:43 +0000</pubDate><link>https://news.ycombinator.com/item?id=43492815</link><dc:creator>mjw_byrne</dc:creator><comments>https://news.ycombinator.com/item?id=43492815</comments><guid isPermaLink="false">https://news.ycombinator.com/item?id=43492815</guid></item><item><title><![CDATA[New comment by mjw_byrne in "A love letter to the CSV format"]]></title><description><![CDATA[
<p>Exactly.  "Use a delimiter that's not in the data" is not real serialisation, it's fingers-crossed-hope-for-the-best stuff.<p>I have in the past does data extractions from systems which really can't serialise properly, where the only option is to concat all the fields with some "unlikely" string like @#~!$ as a separator, then pick it apart later.  Ugh.</p>
]]></description><pubDate>Thu, 27 Mar 2025 11:49:10 +0000</pubDate><link>https://news.ycombinator.com/item?id=43492613</link><dc:creator>mjw_byrne</dc:creator><comments>https://news.ycombinator.com/item?id=43492613</comments><guid isPermaLink="false">https://news.ycombinator.com/item?id=43492613</guid></item><item><title><![CDATA[New comment by mjw_byrne in "A love letter to the CSV format"]]></title><description><![CDATA[
<p>CSV is ever so elegant but it has one fatal flaw - quoting has "non-local" effects, i.e. an extra or missing quote at byte 1 can change the meaning of a comma at byte 1000000.  This has (at least) two annoying consequences:<p>1. It's tricky to parallelise processing of CSV.
2. A small amount of data corruption can have a big impact on the readability of a file (one missing or extra quote can bugger the whole thing up).<p>So these days for serialisation of simple tabular data I prefer plain escaping, e.g. comma, newline and \ are all \-escaped.  It's as easy to serialise and deserialise as CSV but without the above drawbacks.</p>
]]></description><pubDate>Wed, 26 Mar 2025 17:36:42 +0000</pubDate><link>https://news.ycombinator.com/item?id=43484641</link><dc:creator>mjw_byrne</dc:creator><comments>https://news.ycombinator.com/item?id=43484641</comments><guid isPermaLink="false">https://news.ycombinator.com/item?id=43484641</guid></item><item><title><![CDATA[New comment by mjw_byrne in "AI systems with 'unacceptable risk' are now banned in the EU"]]></title><description><![CDATA[
<p>As with GDPR, the spirit is admirable but the fundamental definitions have been hand-waved.  So for the foreseeable future, the expensive lawyers you hire are going to answer the important questions with "well, we don't have much case law yet..."<p>Also, the definition of AI seems to exclude anything that doesn't "exhibit adaptiveness after deployment".  So, a big neural network doing racist facial recognition crime prediction isn't AI as long as it can't learn on-the-fly?  Is my naive HTTP request rate limiter "exhibiting adaptiveness" by keeping track of each customer's typical request rate in a float32?<p>Laws that regulate tech need to get into the weeds of exactly what is meant by the various terms up-front, even if that means loads of examples, clarification etc.</p>
]]></description><pubDate>Tue, 04 Feb 2025 10:52:33 +0000</pubDate><link>https://news.ycombinator.com/item?id=42930804</link><dc:creator>mjw_byrne</dc:creator><comments>https://news.ycombinator.com/item?id=42930804</comments><guid isPermaLink="false">https://news.ycombinator.com/item?id=42930804</guid></item><item><title><![CDATA[New comment by mjw_byrne in "The GPU, not the TPM, is the root of hardware DRM"]]></title><description><![CDATA[
<p>"No one wants a preboot password though" - really? Doesn't strike me as particularly inconvenient, especially given the relative rarity of actual bootups these days.<p>I've been using bog-standard FDE for as long as I can remember. One extra password entry per bootup for almost-perfect security seems like great value to me.</p>
]]></description><pubDate>Thu, 02 Jan 2025 09:53:00 +0000</pubDate><link>https://news.ycombinator.com/item?id=42573130</link><dc:creator>mjw_byrne</dc:creator><comments>https://news.ycombinator.com/item?id=42573130</comments><guid isPermaLink="false">https://news.ycombinator.com/item?id=42573130</guid></item><item><title><![CDATA[New comment by mjw_byrne in "Jack Elam and the Fly in 'Once Upon a Time in the West'"]]></title><description><![CDATA[
<p>Just adding another comment to say how brilliant this film is. So atmospheric, such great music, such a grand presentation of the wild west and it's demise. It makes other westerns feel half-baked.</p>
]]></description><pubDate>Tue, 31 Dec 2024 00:06:11 +0000</pubDate><link>https://news.ycombinator.com/item?id=42555022</link><dc:creator>mjw_byrne</dc:creator><comments>https://news.ycombinator.com/item?id=42555022</comments><guid isPermaLink="false">https://news.ycombinator.com/item?id=42555022</guid></item><item><title><![CDATA[New comment by mjw_byrne in "Hyrum’s Law in Golang"]]></title><description><![CDATA[
<p>The map iteration order change helps to avoid breaking changes in future, by preventing reliance on any specific ordering, but when the change was made it was breaking for anything that was relying on the previous ordering behaviour.<p>IMO this is a worthwhile tradeoff.  I use Go a lot and love the strong backwards compatibility, but I would happily accept a (slightly) higher rate of breaking changes if it meant greater freedom for the Go devs to improve performance, add features etc.<p>Based on the kind of hell users of other ecosystems seem willing to tolerate (<i>cough</i> Python <i>cough</i>), I believe I am not alone in this viewpoint.</p>
]]></description><pubDate>Thu, 21 Nov 2024 14:01:27 +0000</pubDate><link>https://news.ycombinator.com/item?id=42204369</link><dc:creator>mjw_byrne</dc:creator><comments>https://news.ycombinator.com/item?id=42204369</comments><guid isPermaLink="false">https://news.ycombinator.com/item?id=42204369</guid></item></channel></rss>