Hacker News: ot

New comment by ot in "Meta’s renewed commitment to jemalloc"

ot — Tue, 17 Mar 2026 06:48:49 +0000

That's a false dichotomy: you optimize both the application and the allocator.

A 0.5% improvement may not be a lot to you, but at hyperscaler scale it's well worth staffing a team to work on it, with the added benefit of having people on hand that can investigate subtle bugs and pathological perf behaviors.

New comment by ot in "Meta’s renewed commitment to jemalloc"

ot — Tue, 17 Mar 2026 06:46:20 +0000

It's not just that zeroing got cheaper, but also we're doing a lot less of it, because jemalloc got much better.

If the allocator returns a page to the kernel and then immediately asks back for one, it's not doing its job well: the main purpose of the allocator is to cache allocations from the kernel. Those patches are pre-decay, pre-background purging thread; these changes significantly improve how jemalloc holds on to memory that might be needed soon. Instead, the zeroing out patches optimize for the pathological behavior.

Also, the kernel has since exposed better ways to optimize memory reclamation, like MADV_FREE, which is a "lazy reclaim": the page stays mapped to the process until the kernel actually need it, so if we use it again before that happens, the whole unmapping/mapping is avoided, which saves not only the zeroing cost, but also the TLB shootdown and other costs. And without changing any security boundary. jemalloc can take advantage of this by enabling "muzzy decay".

However, the drawback is that system-level memory accounting becomes even more fuzzy.

(hi Alex!)

New comment by ot in "The “JVG algorithm” only wins on tiny numbers"

ot — Tue, 10 Mar 2026 02:11:05 +0000

RSA was also not given that name by its authors, the name came later, which is usually the case.

In the original paper they do not give it any name: https://people.csail.mit.edu/rivest/Rsapaper.pdf

New comment by ot in "RE#: how we built the fastest regex engine in F#"

ot — Mon, 09 Mar 2026 21:20:25 +0000

Here RE2 does not fall back to the NFA, it just resets the Lazy DFA cache and starts growing it again. The latency spikes I was mentioning are due to the cost of destroying the cache (involving deallocations, pointer chasing, ...)

New comment by ot in "RE#: how we built the fastest regex engine in F#"

ot — Mon, 09 Mar 2026 01:39:11 +0000

> are there eviction techniques to guard against this?

RE2 resets the cache when it reaches a (configurable) size limit. Which I found out the hard way when I had to debug almost-periodic latency spikes in a service I managed, where a very inefficient regex caused linear growth in the Lazy DFA, until it hit the limit, then all threads had to wait for its reset for a few hundred milliseconds, and then it all started again.

I'm not sure if dropping the whole cache is the only feasible mitigation, or some gradual pruning would also be possible.

Either way, if you cannot assume that your cache grows monotonically, synchronization becomes more complicated: the trick mentioned in the other comment about only locking the slow path may not be applicable anymore. RE2 uses RW-locking for this.

Generate evolving textures by blending images

ot — Fri, 06 Mar 2026 15:31:02 +0000

Article URL: https://github.com/apresta/undula

Comments URL: https://news.ycombinator.com/item?id=47276180

Points: 1

# Comments: 0

Apache Iggy's migration to thread-per-core architecture powered by io_uring

ot — Mon, 02 Mar 2026 09:25:38 +0000

Article URL: https://iggy.apache.org/blogs/2026/02/27/thread-per-core-io_uring/

Comments URL: https://news.ycombinator.com/item?id=47215597

Points: 2

# Comments: 0

New comment by ot in "Read Locks Are Not Your Friends"

ot — Wed, 25 Feb 2026 12:39:30 +0000

This is drawing broad conclusions from a specific RW mutex implementation. Other implementations adopt techniques to make the readers scale linearly in the read-mostly case by using per-core state (the drawback is that write locks need to scan it).

One example is folly::SharedMutex, which is very battle-tested: https://uvdn7.github.io/shared-mutex/

There are more sophisticated techniques such as RCU or hazard pointers that make synchronization overhead almost negligible for readers, but they generally require to design the algorithms around them and are not drop-in replacements for a simple mutex, so a good RW mutex implementation is a reasonable default.

New comment by ot in "Every book recommended on the Odd Lots Discord"

ot — Mon, 09 Feb 2026 13:15:21 +0000

Glad that Moby Dick is in there.

The Anthropic Hive Mind

ot — Sun, 08 Feb 2026 14:41:30 +0000

Article URL: https://steve-yegge.medium.com/the-anthropic-hive-mind-d01f768f3d7b

Comments URL: https://news.ycombinator.com/item?id=46934590

Points: 4

# Comments: 1

Designing AI resistant technical evaluations

ot — Fri, 23 Jan 2026 15:37:32 +0000

Article URL: https://www.anthropic.com/engineering/AI-resistant-technical-evaluations

Comments URL: https://news.ycombinator.com/item?id=46733759

Points: 3

# Comments: 0

New comment by ot in "A 40-line fix eliminated a 400x performance gap"

ot — Thu, 15 Jan 2026 17:08:03 +0000

> Presumably you mean you just double check the page value after the rdtsc to make sure it hasn't changed and retry if it has?

Yes, that's exactly what a seqlock (reader) is.

New comment by ot in "A 40-line fix eliminated a 400x performance gap"

ot — Wed, 14 Jan 2026 01:05:05 +0000

Yes you need some lazy setup in thread-local state to use this. And short-lived threads should be avoided anyway :)

New comment by ot in "A 40-line fix eliminated a 400x performance gap"

ot — Wed, 14 Jan 2026 00:53:05 +0000

You can do even faster, about 8ns (almost an additional 10x improvement) by using software perf events: PERF_COUNT_SW_TASK_CLOCK is thread CPU time, it can be read through a shared page (so no syscall, see perf_event_mmap_page), and then you add the delta since the last context switch with a single rdtsc call within a seqlock.

This is not well documented unfortunately, and I'm not aware of open-source implementations of this.

EDIT: Or maybe not, I'm not sure if PERF_COUNT_SW_TASK_CLOCK allows to select only user time. The kernel can definitely do it, but I don't know if the wiring is there. However this definitely works for overall thread CPU time.

New comment by ot in "A 40-line fix eliminated a 400x performance gap"

ot — Wed, 14 Jan 2026 00:48:18 +0000

If you look below the vDSO frame, there is still a syscall. I think that the vDSO implementation is missing a fast path for this particular clock id (it could be implemented though).

New comment by ot in "Swapping two blocks of memory inside a larger block, in constant memory"

ot — Tue, 06 Jan 2026 17:01:00 +0000

That's probably true for small primitive types, but if your objects are expensive to move (like a large struct) it might be beneficial to minimize swaps.

New comment by ot in "Over 600 job openings at Apple for Vision Pro"

ot — Mon, 05 Jan 2026 16:42:48 +0000

Yeah, was just about to edit the comment :)

New comment by ot in "Over 600 job openings at Apple for Vision Pro"

ot — Mon, 05 Jan 2026 16:38:45 +0000

The query is incorrect, it will return any posting that contains the words "vision" and "pro", not necessarily consecutive.

It looks like phrasal search is supported, searching "vision pro" in quotes only returns 212 results worldwide

https://jobs.apple.com/en-us/search?search=%22vision+pro%22&...

Spot-checked a few and they all seem to be Vision Pro related.

EDIT: Actually even this is not accurate, as it matches postings with sentences like

> Fundamental to the success of iPhone, iPad, Apple Watch, Apple TV, Vision Pro, and Mac ...

but not specific to Vision Pro.

However we can filter on products and services for Vision Pro and visionOS, and it gives 106 results:

https://jobs.apple.com/en-us/search?search=%22vision+pro%22&...

New comment by ot in "Trying Out C++26 Executors"

ot — Wed, 03 Dec 2025 11:16:33 +0000

> can avoid or defer a lot of the expected memory allocations of async operations

Is this true in realistic use cases or only in minimal demos? From what I've seen, as soon as your code is complex enough that you need two compilation units, you need some higher level async abstraction, like coroutines.

And as soon as you have coroutines, you need to type-erase both the senders and the scheduler, so you have at least couple of allocations per continuation.

New comment by ot in "Mistral 3 family of models released"

ot — Tue, 02 Dec 2025 15:49:34 +0000

Is it so hard for people to understand that Europe is a continent, EU is a federation of European countries, and the two are not the same?