Hacker News: eklitzke

New comment by eklitzke in "Gemini 3.5 Flash"

eklitzke — Wed, 20 May 2026 00:26:54 +0000

Most of the training cost is not in the final training run, it's in all of the R&D (including salaries, equity, etc.) that it takes to get to the final training run. The actual cost of all of the TPUs (or GPUs), power, networking, storage, etc. for the final training run is significant, but it's even more expensive to have this huge R&D team doing frontier model development and using a lot of those same resources during development.

I think you're right that releasing models at a slower cadence would bring down costs to some degree, but it's not clear how much. All of these companies could significantly reduce their opex but at the risk of falling behind in terms of being at the frontier.

New comment by eklitzke in "Waymo in Portland"

eklitzke — Wed, 29 Apr 2026 02:03:05 +0000

A well run public transit system should obviously be cheaper at scale than robotaxis, but the incentives for Waymo (or Uber, or Lyft, etc.) are very different than the city's incentives. It's very possible that in practice private companies can operate more cheaply at scale than buses because they have much higher incentives to reduce costs and increase efficiency.

New comment by eklitzke in "What async promised and what it delivered"

eklitzke — Sat, 25 Apr 2026 21:48:32 +0000

Yeah, none of this makes sense to me. Allocating memory for stack space is not expensive (and the default isn't even 1MB??) because you're just creating a VMA and probably faulting in one or two pages.

They also say:

>The system spends time managing threads that could be better spent doing useful work.

What do they think the async runtime in their language is doing? It's literally doing the same thing the kernel would be doing. There's nothing that intrinsically makes scheduling 10k couroutines in userspace more efficient than the kernel scheduling 10k threads. Context switches are really only expensive when the switch is happening between different processes, the overhead of a context switch on a CPU between two threads in the same process is very small (and they're not free when done in userspace anyway).

There are advantages to doing scheduling in the kernel and there are advantages to doing scheduling in userspace, but this article doesn't really touch on any of the actual pros and cons here, it just assumes that userspace scheduling is automatically more efficient.

New comment by eklitzke in "Our eighth generation TPUs: two chips for the agentic era"

eklitzke — Wed, 22 Apr 2026 21:11:39 +0000

I can't speak to what every team at Google does, but there are machines with Nvidia GPUs in Borg. However Google charges orgs internally for cpu/memory/gpu/tpu usage and TPUs are *way* more efficient in terms of FLOPS/$ than Nvidia GPUs, so there is a *huge* incentive for teams to use TPUs if they can, especially for teams operating large products.

New comment by eklitzke in "Zero-copy protobuf and ConnectRPC for Rust"

eklitzke — Mon, 20 Apr 2026 16:53:41 +0000

This is true but the relative overhead of this is highly dependent on the protobuf structure in one's schema. For example, fixed integer fields don't need to be decoded (including repeated fixed ints), and the main idea of the "zero copy" here is avoiding copying string and bytes fields. If your protobufs are mostly varints then yes they all have to be decoded, if your protobufs contain a lot of string/bytes data then most of the decoded overhead could be memory copies for this data rather than varint decoding.

In some message schemas even though this isn't truly zero copy it may be close to it in terms of actual overhead and CPU time, in other schemas it doesn't help at all.

New comment by eklitzke in "Show HN: Gemini Pro 3 imagines the HN front page 10 years from now"

eklitzke — Tue, 09 Dec 2025 21:02:59 +0000

Pretty much all of the history of HN front pages, posts, and comments are surely in the Gemini training corpus. Therefore it seems totally plausible that Gemini would understand HN inside jokes or sentiment outside of what's literally on the front page given in the prompt, especially given that the prompt specifically stated that this is the front page for HN.

Cosmological Lithium Problem

eklitzke — Thu, 04 Dec 2025 23:12:59 +0000

Article URL: https://en.wikipedia.org/wiki/Cosmological_lithium_problem

Comments URL: https://news.ycombinator.com/item?id=46154547

Points: 4

# Comments: 0

x86 architecture 1 byte opcodes

eklitzke — Fri, 31 Oct 2025 17:49:16 +0000

Article URL: https://www.sandpile.org/x86/opc_1.htm

Comments URL: https://news.ycombinator.com/item?id=45774724

Points: 90

# Comments: 31

New comment by eklitzke in "A years-long Turkish alphabet bug in the Kotlin compiler"

eklitzke — Mon, 13 Oct 2025 19:07:18 +0000

I think it's important to point out the distinction between what POSIX mandates and what actual libc implementations, notably glibc, do. Nearly all non-reentrant POSIX functions are actually only non-reentrant if you are using a 1980s computer that for some reason has threads but doesn't have thread-local storage. All of these things like strerror are implemented using TLS in glibc nowadays, so while it is technically true you need to use the _r versions if you want to be portable to computers that nobody has used in 30 years in practice you usually don't need to worry about these things, especially if you're using Linux, since they use store results in static thread-local memory rather than static global memory.

As for the string.h stuff, while it is all terrible it's at least well documented that everything is broken unless you use wchar_t, and nobody uses wchar_t because it's the worst possible localization solution. No one is seriously trying to do real localization in C (and if they were they'd be using libicu).

New comment by eklitzke in "Memory access is O(N^[1/3])"

eklitzke — Thu, 09 Oct 2025 04:28:56 +0000

NUMA has a huge amount of overhead (e.g. in terms of intercore latency), and NUMA server CPUs cost a lot more than single socket boards. If you look at the servers at Google or Facebook they will have some NUMA servers for certain workloads that actually need them, but most most servers will be single socket because they're cheaper and applications literally run faster on them. It's a win win if you can fit your workload on a single socket server so there is a lot of motivation to make applications work in a non-NUMA way if at all possible.

New comment by eklitzke in "RIP pthread_cancel"

eklitzke — Sat, 13 Sep 2025 21:56:55 +0000

A few reasons, I think.

The first is that getaddrinfo is specified by POSIX, and the POSIX evolve very conservatively and at a glacial pace.

The second reason is that specifying a timeout breaks symmetry with a lot of other functions in Unix/C, both system calls and libc calls. For example, you can't specify a timeout when opening a file, reading from a file, or closing a file, which are all potentially blocking operations. There are ways to do these things in a non-blocking manner with timeouts using aio or io_uring, but those are already relatively complicated APIs for just using simple system calls, and getaddrinfo is much more complicated.

The last reason is that if you use the sockets APIs directly it's not that hard to write a non-blocking DNS resolver (c-ares is one example). The thing is though that if you write your own resolver you have to consider how to do caching, it won't work with NSS on Linux, etc. You can implement these things (systemd-resolved does it, and works with NSS) but they are a lot of work to do properly.

New comment by eklitzke in "%CPU utilization is a lie"

eklitzke — Wed, 03 Sep 2025 03:12:44 +0000

I agree. If you actually know what you're doing you can use perf and/or ftrace to get highly detailed processor metrics over short periods of time, and you can see the effects of things like CPU stalls from cache misses, CPU stalls from memory accesses, scheduler effects, and many other things. But most of these metrics are not very actionable anyway (the vast majority of people are not going to know what to do with their IPC or cache hit or branch hit numbers).

What most people care about is some combination of latency and utilization. As a very rough rule of thumb, for many workloads you can get up to about 80% CPU utilization before you start seeing serious impacts on workload latency. Beyond that you can increase utilization but you start seeing your workload latency suffer from all of the effects you mentioned.

To know how much latency is impacted by utilization you need to measure your specific workload. Also, how much you care about latency depends on what you're doing. In many cases people care much more about throughput than latency, so if that's the top metric then optimize for that. If you care about application latency as well as throughput then you need to measure both of those and decide what tradeoffs are acceptable.

New comment by eklitzke in "John Carmack's arguments against building a custom XR OS at Meta"

eklitzke — Fri, 29 Aug 2025 23:11:25 +0000

Writing drivers is easy, getting vendors to write *correct* drivers is difficult. At work right now we are working with a Chinese OEM with a custom Wifi board with a chipset with firmware and drivers supplied by the vendor. It's actually not a new wifi chipset, they've used it in other products for years without issues. In conditions that are difficult to reproduce sometimes the chipset gets "stuck" and basically stops responding or doing any wifi things. This appears to be a firmware problem because unloading and reloading the kernel module doesn't fix the issue. We've supplied loads of pcap dumps to the vendor, but they're kind of useless to the vendor because (a) pcap can only capture what the kernel sees, not what the wifi chipset sees, (b) it's infeasible for the wifi chipset to log all its internal state and whatnot, and (c) even if this was all possible trying to debug the driver just from looking at gigabytes of low level protocol dumps would be impossible.

Realistically for the OEM to debug the issue they're going to need a way to reliably repro which we don't have for them, so we're kind of stuck.

This type of problem generalizes to the development of drivers and firmware for many complex pieces of modern hardware.

New comment by eklitzke in "Physics of badminton's new killer spin serve"

eklitzke — Sun, 24 Aug 2025 02:56:46 +0000

From what I could tell from the article and the linked videos the innovation here is that it essentially lets you serve the shuttlecock while it's facing the wrong direction. Normally even if the shuttlecock has spin when it crosses the court it will move with the cork side forward, at least by the time it crosses the net. Hence I don't think this technique would be applicable to other sports that use a ball.

New comment by eklitzke in "Fuse is 95% cheaper and 10x faster than NFS"

eklitzke — Wed, 13 Aug 2025 21:07:41 +0000

Yeah if you were really trying to make things fast you'd have the compute and NFS server in the same rack connected this way. But you aren't going to get this from any cloud providers.

For read-only data (the original model is about serving file weights) you can also use iscsi. This is how packages/binaries are served to nearly all borg hosts at Google (most Borg hosts don't have any local disk whatsoever, when they need to run a given binary they mount the software image using iscsi and then I believe mlock nearly all of the elf sections).

New comment by eklitzke in "Fuse is 95% cheaper and 10x faster than NFS"

eklitzke — Wed, 13 Aug 2025 20:19:05 +0000

NFS can be super fast, in a past life I had to work a lot with a large distributed system of NetApp Filers (hundreds of filers located around the globe) and they have a lot of fancy logic for doing doing locale-aware caching and clustering.

That said, all of the open source NFS implementations are either missing this stuff or you'd have to implement it yourself which would be a lot of work. NetApp Filers are crazy expensive and really annoying to administer. I'm not really surprised that the cloud NFS solutions are all expensive and slow because truly *needing* NFS is a very niche thing (like do you really need `flock(2)` to work in a distributed way).

New comment by eklitzke in "HTTP/3 is everywhere but nowhere"

eklitzke — Tue, 18 Mar 2025 06:19:32 +0000

This is vastly oversimplifying the problem, the difference between IPv4 and IPv6 is not just the format of the address. Different protocols have different features, which is why the sockaddr_in and sockaddr_in6 types don't just differ in the address field. Plus the vast majority of network programs are using higher level abstractions, for example even in C or C++ a lot of people would be using a network library like libevent or asio to handle a lot of these details (especially if you want to write code that easily works with TLS).

Agenoria

eklitzke — Thu, 06 Feb 2025 20:39:44 +0000

Article URL: https://github.com/jiuguangw/Agenoria

Comments URL: https://news.ycombinator.com/item?id=42966260

Points: 1

# Comments: 0

New comment by eklitzke in "Borrow Checking, RC, GC, and Eleven Other Memory Safety Approaches"

eklitzke — Fri, 20 Dec 2024 00:13:16 +0000

I don't understand why they say that reference counting is "slow". Slow compared to what? Atomic increments/decrements to integers are one of the fastest operations you can do on modern x86 and ARM hardware, and except in pathological cases will pretty much always be faster than pointer chasing done in a traditional mark and sweep VMs.

This isn't to say reference counting is without problems (there are plenty of them, inability to collect cyclical references being the most well known), but I don't normally think of it as a slow technique, particularly on modern CPUs.

House committee approves bill requiring new cars to have AM radio

eklitzke — Thu, 19 Sep 2024 21:29:26 +0000

Article URL: https://www.theverge.com/2024/9/18/24248137/am-radio-bill-house-energy-commerce-ev-interference

Comments URL: https://news.ycombinator.com/item?id=41596561

Points: 59

# Comments: 53