Hacker News: willvarfar

New comment by willvarfar in "Zero-copy protobuf and ConnectRPC for Rust"

willvarfar — Mon, 20 Apr 2026 08:22:29 +0000

Exciting!

I have been on a similar odyssey making a 'zero copy' Java library that supports protobuf, parquet, thrift (compact) and (schema'd) json. It does allocate a long[] and break out the structure for O(1) access but doesn't create a big clump of object wrappers and strings and things; internally it just references a big pool buffer or the original byte[].

The speed demons use tail calls on rust and c++ to eat protobuf https://blog.reverberate.org/2021/04/21/musttail-efficient-i... at 2+GB/sec. In java I'm super pleased to be getting 4 cycles per touched byte and 500MB/sec.

Currently looking at how to merge a fast footer parser like this into the Apache Parquet Java project.

New comment by willvarfar in "The Economics of Software Teams: Why Most Engineering Orgs Are Flying Blind"

willvarfar — Mon, 13 Apr 2026 08:39:23 +0000

With a long time in the industry and seeing how so many big software companies work, this really really chimed with me. Many/most teams and projects and busy work are not actually moving the bottom line, at massive opportunity cost! And there's so little awareness that most people in squads and their managers will think they are the exception.

Whereas Whatsapp with its 30 software engineers was the exception etc.

A chat with friends showed how there are parallels with how LLMs will happen in the short-term future - say the next 5 years - and the whole MapReduce mess. Back when Hadoop came along you built operators and these operators communicated through disk. It took years even after Spark was about for the hadoop userbase as a whole to realise that it is orders of magnitude more efficient to only communicate through disk when two operators are not colocatable on the same machine and that most operators in most pipelines can be fused together.

So for a while LLMs will be in the Hadoop phase where they are acting like junior devs and making more islands that communicate in bigger bloated codebases and then there might be a realisation in about 2030 that actually the LLMs could have been used to clean up and streamline and fuse software and approach the Whatsapp style of business impact.

New comment by willvarfar in "Challenges in join optimization"

willvarfar — Thu, 22 Jan 2026 08:03:08 +0000

Yeah it's pretty obscure, sorry.

It's called cogroup in Spark and similar architectures.

It does a group-by to convert data into the format (key_col_1, ... key_col_n) -> [(other_col_1, ... other_col_n), ...]

This is useful and ergonomic in itself for lots of use-cases. A lot of Spark and similar pipelines do this just to make things easier to manipulate.

Its also especially useful if you cogroup each side before join, which gives you the key column and two arrays of matching rows, one for each side of the join.

A quick search says it's called "group join" in academia. I'm sure I've bumped into as another name in other DB engines but can't remember right now.

One advantage of this is that it is bounded memory. It doesn't actually iterate over the cartesian product of non-unique keys. In fact, the whole join can be done on pointers into the sides of the join, rather than shuffling and writing the values themselves.

My understanding is that a lot of big data distributed query engines do this, at least in mixer nodes. Then the discussion becomes how late they actually expand the product - are they able to communicate the cogrouped format to the next step in the plan or must they flatten it? Etc.

(In SQL big data engines sometimes you do this optimisation explicitly e.g. doing SELECT key, ARRAY_AGG(value) FROM ... on each side before join. But things are nicer when it happens transparently under the hood and users get the speedup without the boilerplate and brittleness and fear that it is a deoptimisation when circumstances change in the future.)

New comment by willvarfar in "Significant US farm losses persist, despite federal assistance"

willvarfar — Thu, 22 Jan 2026 06:58:40 +0000

This video is very liberal but does a good job of explaining which companies and industries pay for breaks and which don't. And uses soy bean farmers as a prominent example of a group who haven't been giving Trump bribes https://youtu.be/RPzcGeiNYvk?si=bfy_5KEo_ZUxOBHu

New comment by willvarfar in "Challenges in join optimization"

willvarfar — Wed, 21 Jan 2026 23:27:33 +0000

Can join cardinality can be tackled with cogroup and not expanding the rows until final write?

New comment by willvarfar in "Porsche sold more electrified cars in Europe in 2025 than pure gas-powered cars"

willvarfar — Tue, 20 Jan 2026 18:27:34 +0000

everyone is expecting everyone to actually go Gripen with Rolls Royce or MECA engines?

New comment by willvarfar in "Porsche sold more electrified cars in Europe in 2025 than pure gas-powered cars"

willvarfar — Tue, 20 Jan 2026 13:36:16 +0000

Having a pending order that can be cancelled is negotiation leverage?

New comment by willvarfar in "Porsche sold more electrified cars in Europe in 2025 than pure gas-powered cars"

willvarfar — Tue, 20 Jan 2026 07:27:41 +0000

And what about all those huge pending orders for F35 in ... Denmark and Canada? Etc.

New comment by willvarfar in "40% of Kids Can't Read and Teachers Are Quitting [video]"

willvarfar — Mon, 19 Jan 2026 12:25:49 +0000

The very last clip in the video says that it is kids in affluent families taking that direction.

New comment by willvarfar in "Why DuckDB is my first choice for data processing"

willvarfar — Fri, 16 Jan 2026 20:43:25 +0000

(I work a lot with BigQuery's BigLake adaptor and it's basically caching the metadata of the iceberg manifest and parquet footers in Bigtable (this is Google) so query planning is super fast etc. Really helps)

New comment by willvarfar in "Danish Armed Forces expand their presence and continue exercises in Greenland"

willvarfar — Thu, 15 Jan 2026 13:29:27 +0000

Greenland and Denmark have always been encouraging minerals deals etc, they just haven't materialized.

New comment by willvarfar in "Ask HN: What did you find out or explore today?"

willvarfar — Thu, 15 Jan 2026 08:49:06 +0000

Seriously, this is not what big data does today. Distributed query engines don't have the primitives to zip through two tables and treat them as column groups of the same wider logical table. There's a new kid on the block called LanceDB that has some of the same features but is aiming for different use-cases. My trick retrofits vertical partitioning into mainstream data lake stuff. It's generic and works on the tech stack my company uses but would also work on all the mainstream alternative stacks. Slightly slower on AWS. But anyway. I guess HN just wants to see an industrial track paper.

New comment by willvarfar in "Ask HN: What did you find out or explore today?"

willvarfar — Thu, 15 Jan 2026 06:57:49 +0000

specifically I've discovered how to 'trick' mainstream cloud storage and mainstream query engines using mainstream table formats how to read parallel arrays that are stored outside the table without using a classic join and treat them as new columns or schema evolution. It'll work on spark, bigquery etc.

New comment by willvarfar in "Ask HN: What did you find out or explore today?"

willvarfar — Thu, 15 Jan 2026 06:55:25 +0000

crazy to think that soon not being able to successfully complete the captcha will be a signal that the user is human

New comment by willvarfar in "Ask HN: What did you find out or explore today?"

willvarfar — Thu, 15 Jan 2026 06:39:56 +0000

I had a great euphoric epiphany feeling today. Doesn't come along too often, will celebrate with a nice glass of wine :)

Am doing data engineering for some big data (yeah, big enough) and thinking about efficiency of data enrichment. There's this classic trilemma with data enrichment where you can have good write efficiency, good read efficiency and/or good storage cost, pick two.

E.g. you have a 1TB table and you want to add a column that, say, will take 1GB to store.

You can create a new table that is 1.1TB and then delete the old table, but this is both write-inefficient and often breaks how normal data lake orchestration works.

You can create a new wide table that is 1.1TB and keep it along side the old table, but this is both write-inefficient and expensive to store.

You can create a narrow companion table that has just a join key and 1GB of data. This is efficient to write and store, but inefficient to query when you force all users to do joins on read.

And I've come up with a cunning forth way where you write a narrow table and read a wide table so its literally best of all worlds! Kinda staggering :) Still on a high.

Might actually be a conference paper, which is new territory for me. Lets see :)

/off dancing

New comment by willvarfar in "Network of Scottish X accounts go dark amid Iran blackout"

willvarfar — Tue, 13 Jan 2026 13:30:46 +0000

I agree that social media is a net negative, but want to also point out that before social media it was the mainstream press and TV have been shaping society for decades. Things like buying a used car from Nixon or fighting in Vietnam etc are all mainstream press impact.

New comment by willvarfar in "Why is there a tiny hole in the airplane window? (2023)"

willvarfar — Fri, 09 Jan 2026 10:40:15 +0000

I've always noticed and wondered, so I guess it's easy to overlook but it's there.

New comment by willvarfar in "Anthropic blocks third-party use of Claude Code subscriptions"

willvarfar — Fri, 09 Jan 2026 08:01:22 +0000

Presumably there will soon be banner ads in Claude Code then?

New comment by willvarfar in "I program without syntax highlighting"

willvarfar — Thu, 08 Jan 2026 14:23:50 +0000

I remember when syntax highlighting was introduced in Borland's Turbo Pascal editor (on DOS). It was a very major usability improvement and put TP's IDE at the forefront of getting things done. Fond memories :)

New comment by willvarfar in "Lessons from Hash Table Merging"

willvarfar — Thu, 08 Jan 2026 09:17:11 +0000

Kudos, neat digging and writeup that makes us think :)

If you merge linear probed tables by iterating in sorted hash order then you are matching the storage order and can congest particular parts of the table and cause the linear probing worse case behaviour.

By changing the iteration order, or salting the hash, you can avoid this.

Of course chained hash tables don't suffer from this particular problem.

My quick thought is that hash tables ought keep an internal salt hidden away. This seems good to avoid 'attacks' as well as speeding up merging etc. The only downside I can think of is that the creation of the table needs to fetch a random salt that might not be quick, although that can alleviated by allowing it to be set externally in the table creation so people who don't care can set it to 0 or whatever. What am I missing?