Hacker News: mrlongroots

New comment by mrlongroots in "Our eighth generation TPUs: two chips for the agentic era"

mrlongroots — Wed, 22 Apr 2026 14:43:58 +0000

That training is compute-bound and inference is memory-bound is well-known, but I don't think Nvidia deployments typically specialize for one vs the other.

One reason is that most clouds/neoclouds don't own workloads, and want fungibility. Given that you're spending a lot on H200s and what not it's good to also spend on the networking to make sure you can sell them to all kinds of customers. The Grok LPU in Vera Rubin is an inference-specific accelerator, and Cerebras is also inference-optimized so specialization is starting to happen.

New comment by mrlongroots in "What category theory teaches us about dataframes"

mrlongroots — Fri, 03 Apr 2026 14:44:29 +0000

MapReduce is nice but it doesn't, by itself, help you reason about pushdowns for one. Parquet, for example, can pushdown select/project/filter, and that's lost if you have MapReduce. And a reduce is just a shuffle + map, not very different from a distributed join. MapReduce as an escape hatch over what is fundamentally still relational algebra may be a good intuition.

New comment by mrlongroots in "What Category Theory Teaches Us About DataFrames"

mrlongroots — Fri, 03 Apr 2026 14:40:43 +0000

Algebras are also nice for implementations. If you can decompose a domain into a few algebraic primitives you can write nice SIMD/CUDA kernels for those primitives.

To your point, I wonder if the 73 distinct transforms were just different defaults/usability wrappers over these. And you may also get into situations where kernels can be fused together or other batching constraints enable optimizations that nice algebraic primitives don't capture. But that's just systems---theory is useful in helping rethink API bloats and keeping us all honest.

New comment by mrlongroots in "The Waymo World Model"

mrlongroots — Fri, 06 Feb 2026 20:08:50 +0000

Yes, GPT5-series thinking models are extremely pedantic and tedious. Any conversation with them is derailed because they start nitpicking something random.

But Codex/5.2 was substantially more effective than Claude at debugging complex C++ bugs until around Fall, when I was writing a lot more code.

I find Gemini 3 useless. It has regressed on hallucinations from Gemini 2.5, to the point where its output is no better than a random token stream despite all its benchmark outperformance. I would use Gemini 2.5 to help write papers and all, can't see to use Gemini 3 for anything. Gemini CLI also is very non-compliant and crazy.

New comment by mrlongroots in "Replacing Protobuf with Rust"

mrlongroots — Fri, 23 Jan 2026 16:43:09 +0000

While Arrow is amazing, it is only the C Data Interface that can be FFI'ed, which is pretty low level. If you have something higher-level like a table or a vector of recordbatches, you have to write quite a bit of FFI glue yourself. It is still performant because it's a tiny amount of metadata, but it can still be a bit tedious.

And the reason is ABI compatibility. Reasoning about ABI compatibility across different C++ versions and optimization levels and architectures can be a nightmare, let alone different programming languages.

The reason it works at all for Arrow is that the leaf levels of the data model are large contiguous columnar arrays, so reconstructing the higher layers still gets you a lot of value. The other domains where it works are tensors/DLPack and scientific arrays (Zarr etc). For arbitrary struct layouts across languages/architectures/versions, serdes is way more reliable than a universal ABI.

New comment by mrlongroots in "AWS Trainium3 Deep Dive – A Potential Challenger Approaching"

mrlongroots — Tue, 09 Dec 2025 18:19:11 +0000

Hyperscalers do not need to achieve parity with Nvidia. There's a (let's say) 50% headroom in terms of profit margins, and plenty of headroom in terms of the complexity custom chip efforts need to implement: they don't need the complexity or generality of Nvidia's chips. If a simple architecture allows them to do inference at 50% of the TCO and 1/5th the complexity and reduce their Nvidia bill by 70% that's a solid win. I'm being fast and loose with numbers and Trainium clearly seems to have ambitions beyond inference, but given the hundreds of billions each cloud vendor is investing in the AI buildout, a couple billion on IP that you will own afterwards is a no brainer. Nvidia has good products and a solid head start but they're not unassailable or anything.

New comment by mrlongroots in "The C++ standard for the F-35 Fighter Jet [video]"

mrlongroots — Mon, 08 Dec 2025 04:53:04 +0000

Yeah unfortunately no amount of manoeuvering is a substitute for a kill chain where a distributed web of sensors and relays and weapon carriers can result in an AAM being dispatched from any direction at lightspeed.

New comment by mrlongroots in "650GB of Data (Delta Lake on S3). Polars vs. DuckDB vs. Daft vs. Spark"

mrlongroots — Fri, 14 Nov 2025 16:30:54 +0000

The appropriate comparison point for aggregate cluster storage bandwidth would be its bisection bandwidth.

(I do HPC, IIRC ANL Aurora is < 1PB/s DAOS and 20 PB/s bisection).

New comment by mrlongroots in "650GB of Data (Delta Lake on S3). Polars vs. DuckDB vs. Daft vs. Spark"

mrlongroots — Fri, 14 Nov 2025 16:28:47 +0000

I think I'm talking about cluster-scale network bisection bandwidth vs attached storage bandwidth. With replication/erasure coding overhead and the economics, the order of magnitude difference still prevails.

I think your point is a good one in that it is more economics than systems physics. We size clusters to have more compute/network than storage because it is the design point that maximizes overall utility.

I think it also raises an interesting question in that let's say we get to a point where the disparity really no longer holds: that would justify a complete rethinking of many Spark-like applications that are designed to exploit this asymmetry.

New comment by mrlongroots in "650GB of Data (Delta Lake on S3). Polars vs. DuckDB vs. Daft vs. Spark"

mrlongroots — Fri, 14 Nov 2025 03:25:54 +0000

Yep I think the value of the experiment is not clear.

You want to use Spark for a large dataset with multiple stages. In this case, their I/O bandwidth is 1GB/s from S3. CPU memory bandwidth is 100-200GB/s for a multi-stage job. Spark is a way to pool memory for a large dataset with multiple stages, and use cluster-internal network bandwidth to do shuffling instead of storage.

Maybe when you have S3 as your backend, the storage bandwidth bottleneck doesn't show up in perf, but it sure does show up in the bill. A crude rule of thumb: network bandwidth is 20X storage, main memory bandwidth is 20X network bandwidth, accelerator/GPU memory is 10X CPU. It's great that single-node DuckDB/Polars are that good, but this is like racing a taxiing aircraft against motorbikes.

New comment by mrlongroots in "Ticker: Don't die of heart disease"

mrlongroots — Sun, 09 Nov 2025 01:38:15 +0000

> LDL-C is much much cheaper to measure. ApoB costs 36x times as much, so Insurance Companies don't like to pay for it

Unfortunately American retail prices might as well be generated by a PRNG, and do not mean much.

On Ulta, a basic lipid panel vs an ApoB test are $22 and $36 respectively. Looking at Indian lab prices, (approx. INR->USD), both are under $10 there.

https://www.ultalabtests.com/test/cholesterol-and-lipids-tes... https://www.ultalabtests.com/test/cardio-iq-apolipoprotein-b...

New comment by mrlongroots in "Ticker: Don't die of heart disease"

mrlongroots — Sun, 09 Nov 2025 01:30:46 +0000

Maybe 80-90% of people should take doctors at face value, but it is easy and only getting easier to at least access the knowledge to better advocate for your own healthcare (thanks to LLMs), with better outcomes. Of course, this requires doctors that respect your ability to provide useful inputs, which in your case did not happen.

My advice would be to "shop around" for doctors, establish a relationship where you demonstrate openness to what they say, try not to step on their toes unnecessarily, but also provide your own data and arguments. Some of the most "life-changing" interventions in terms of my own healthcare have been due to my own initiative and stubbornness, but I have doctors who humor me and respect my inputs. Credentials/vibes help here I think: in my case "the PhD student from the brand name school across the street who shows up with plots and regressions" is probably a soft signal that indicates that I mean business.

New comment by mrlongroots in "C++ move semantics from scratch (2022)"

mrlongroots — Sat, 08 Nov 2025 20:46:55 +0000

Same, I don't understand the complaints against modern C++. A lambda, used for things like comparators etc, is much simpler than structs with operators overloaded defined elsewhere.

My only complaint is the verbosity, things like `std::chrono::nanonseconds` break even simple statements into multiple lines, and you're tempted to just use uint64_t instead. And `std::thread` is fine but if you want to name your thread you still need to get the underlying handle and call `pthread_setname_np`. It's hard work pulling off everything C++ tries to pull off.

New comment by mrlongroots in "Falcon: A Reliable, Low Latency Hardware Transport"

mrlongroots — Wed, 29 Oct 2025 08:27:50 +0000

> Getting 200 Gb/s of reliable in-order bytestream per core over a unreliable, out-of-order packet-switched network using standard ethernet is not very hard with proper protocol design.

You also suggested that this can be done using a single CPU core. It seems to me that this proposal involves custom APIs (not sockets), and even if viable with a single core in the common case, would blow up in case of loss/recovery/retransmission events. Falcon provides a mostly lossless fabric with loss/retransmits/recovery taken care of by the fabric: the host CPU never handles any of these tail cases.

Ultimately there are two APIs for networks: sockets and verbs. Former is great for simplicity, compatibility, and portability, and the latter is the standard for when you are willing to break compatibility for performance.

New comment by mrlongroots in "Falcon: A Reliable, Low Latency Hardware Transport"

mrlongroots — Wed, 29 Oct 2025 07:54:12 +0000

> Is this just a cost efficiency thing?

It's not entirely, but even that would be a justifiable reason. Tail behavior of all sorts matters a lot, sophisticated congestion control and load-balancing matters a lot. ML training is all about massive collectives: a single tail latency event in a NCCL collective means all GPUs in that group are idling until the last GPU makes it.

> It only takes like 1 core to terminate 200 Gb/s of reliable bytestream using a software protocol with no hardware offload over regular old 1500-byte MTU ethernet.

The conventional TCP/IP stack is a lot more than just 20GB/s of memcpy's with 200 GbE: there's a DMA into kernel buffers and then a copy into user memory, there's syscalls and interrupts and back and forth, there's segmentation and checksums and reassembly and retransmits, and overall a lot more work. RDMA eliminates all that.

> all you need is a parallel hardware crypto accelerator > all you need is a hardware copy/DMA engine

And when you add these and all the other requirements you get a modern RDMA network :).

The network is what kicks in when Moore's law recedes. Jensen Huang wants you to pretend that your 10,000 GPUs are one massive GPU: that only works if you have Nvlink/Infiniband or something in that league, and even then barely. And GOOG/MSFT/AMZN are too big and the datacenter fabric is too precious to be outsourced.

New comment by mrlongroots in "Let's Help NetBSD Cross the Finish Line Before 2025 Ends"

mrlongroots — Sun, 26 Oct 2025 19:34:35 +0000

Yes, unfortunately even the best intentioned individuals have very limited ability to make meaningful carbon-minimizing decisions. Carbon tax is such a sensible solution!

New comment by mrlongroots in "Designing a Low Latency 10G Ethernet Core (2023)"

mrlongroots — Thu, 09 Oct 2025 06:41:15 +0000

The other funny bit is that one-way PCIe latency is 250ns-ish (don't quote me on the exact numbers), which imposes a hard 1us constraint on latency between two hosts.

New comment by mrlongroots in "LLM Observability in the Wild – Why OpenTelemetry Should Be the Standard"

mrlongroots — Sat, 27 Sep 2025 21:58:09 +0000

I think standard relational databases/schemas are underrated for when you need richness.

OTel or anything in that domain is fine when you have a distributed callgraph, which inference with tool calls does. I think the fallback layer if that doesn't work is just say Clickhouse.

New comment by mrlongroots in "How Palantir is mapping the nation’s data"

mrlongroots — Fri, 12 Sep 2025 21:45:11 +0000

Our social contract bestows individuals with freedom and governments with the monopoly of violence to police the limits of individual freedom.

If the acts of law abiding individuals (or groups) are a net negative for society, that is not an individual failure. Fiduciary responsibility is a useful parallel: it is not the job of a sugar manufacturer to think about the public health aspects of sugar. Their responsibility to their shareholders is to produce clean, safe, and edible sugar at competitive prices and do a good job with marketing and distributin, that's all.

New comment by mrlongroots in "How Palantir is mapping the nation’s data"

mrlongroots — Fri, 12 Sep 2025 00:27:27 +0000

If a selection mechanism is orthogonal to a property, it seems weird to argue that the selected subset is distributed differently along that axis than the broader population.