Hacker News: EffCompute

New comment by EffCompute in "Show HN: Oberon System 3 runs natively on Raspberry Pi 3 (with ready SD card)"

EffCompute — Mon, 13 Apr 2026 16:48:28 +0000

Rochus, your point about LLVM and the 'upper bound' of C optimization is a bit of a bitter pill for systems engineers. In my own work, I often hit that wall where I'm trying to express high-level data intent (like vector similarity semantics) but end up fighting the optimizer because it can't prove enough about memory aliasing or data alignment to stay efficient.

I agree with guenthert that higher-level intent should theoretically allow for better optimization, but as you said, without the decades of investment that went into the C backends, it's a David vs. Goliath situation.

The 'spiraling complexity' of LLVM you mentioned is exactly why some of us are looking back at leaner designs. For high-density data tasks (like the 5.2M documents in 240MB I'm handling), I'd almost prefer a language that gives me more predictable, transparent control over the machine than one that relies on a million-line optimizer to 'guess' what I'm trying to do. It feels like we are at a crossroads between 'massive compilers' and 'predictable languages' again.

New comment by EffCompute in "Show HN: Oberon System 3 runs natively on Raspberry Pi 3 (with ready SD card)"

EffCompute — Mon, 13 Apr 2026 04:01:52 +0000

That benchmark is a great data point, thanks for sharing. The performance parity with unoptimized GCC makes sense, given how much heavy lifting modern LLVM/GCC backends do for C++.

Your approach with Micron and the 'language levels' is particularly interesting. One of the biggest hurdles I face in C++ with these high-density vector tasks is exactly that: balancing the raw 'unsafe' pointer arithmetic needed for SIMD and custom memory layouts with the safety needed for the rest of the application.

Having those features controlled at the module level (like your Micron levels) sounds like a much cleaner architectural 'contract' than the scattered unsafe blocks or reinterpret_cast mess we often deal with in systems programming. I'll definitely keep an eye on the Micron repository—bridging that gap between Wirth-style safety and C-level performance is something the industry is still clearly struggling with (even with Rust's rise).

New comment by EffCompute in "Show HN: Oberon System 3 runs natively on Raspberry Pi 3 (with ready SD card)"

EffCompute — Sun, 12 Apr 2026 17:46:55 +0000

It's refreshing to see Oberon getting some love on the Pi. There’s a certain 'engineering elegance' in the Wirthian school of thought that we’ve largely lost in modern systems.

While working on a C++ vector engine optimized for 5M+ documents in very tight RAM (240MB), I often find myself looking back at how Oberon handled resource management. In an era where a 'hello world' app can pull in 100MB of dependencies, the idea of a full OS that is both human-readable and fits into a few megabytes is more relevant than ever.

Rochus, since you’ve worked on the IDE and the kernel: do you think the strictness of Oberon’s type system and its lean philosophy still offers a performance advantage for modern high-density data tasks, or is it primarily an educational 'ideal' at this point?

New comment by EffCompute in "Pijul a FOSS distributed version control system"

EffCompute — Sun, 12 Apr 2026 12:10:41 +0000

I think EnPissant has a point regarding the overhead. Mapping semantic dependencies at the patch layer sounds great in theory, but the computational cost of resolving those graphs in a repository with thousands of changes is non-trivial.

In my work with high-performance engines, 'on-the-fly' graph resolution is usually the first thing to hit a performance wall compared to simple snapshot-based lookups. Pijul is a brilliant experiment in Category Theory applied to VCS, but until it can demonstrate that it doesn't degrade linearly with history size, Git's 'dumb' but fast snapshots will likely win the network effect battle.

New comment by EffCompute in "Surelock: Deadlock-Free Mutexes for Rust"

EffCompute — Sat, 11 Apr 2026 18:02:03 +0000

I really agree with jandrewrogers' point about the insularity of the database domain. While working on a custom C++ engine to handle 10M vectors in minimal RAM, I’ve noticed that many 'mainstream' concurrency patterns simply don't scale when cache-locality is your primary bottleneck.

In the DB world, we often trade complex locking for deterministic ordering or latch-free structures, but translating those to general-purpose app code (like what this Rust crate tries to do) is where the friction happens. It’s great to see more 'DB-style' rigour (like total ordering for locks) making its way into library design.

New comment by EffCompute in "70M vectors searched in 48ms on a single consumer GPU –results you won't believe"

EffCompute — Wed, 18 Mar 2026 11:37:14 +0000

One thing I'm trying to better understand is where the real limits are.

At this point it feels like the bottleneck is less about raw compute and more about how efficiently data is represented and accessed on the GPU.

Curious if others have seen similar behavior when pushing large-scale vector search on consumer hardware.

New comment by EffCompute in "70M vectors searched in 48ms on a single consumer GPU –results you won't believe"

EffCompute — Tue, 17 Mar 2026 19:15:55 +0000

Not yet — it's still a personal prototype and I'm actively experimenting with different approaches and optimizations.

I’m trying to better understand the limits of what’s possible on consumer hardware before deciding how to package or share it.

Happy to share more high-level insights though.

New comment by EffCompute in "70M vectors searched in 48ms on a single consumer GPU –results you won't believe"

EffCompute — Tue, 17 Mar 2026 16:27:09 +0000

Quick update:

I've been iterating on the approach and managed to push the coarse search further.

Currently seeing ~100M vectors scanned in ~10ms on a single RTX 3090 (binary stage only).

Still experimenting with trade-offs between speed and recall, but it's interesting how far this can go on consumer hardware.

Curious what kind of numbers others are seeing for large-scale vector search on GPUs.

70M vectors searched in 48ms on a single consumer GPU –results you won't believe

EffCompute — Mon, 16 Mar 2026 16:12:48 +0000

I built a prototype GPU-based vector search system that runs locally on a consumer PC.

Hardware:

RTX 3090 consumer CPU NVMe SSD

Dataset:

~70 million vectors (384 dimensions)

Performance:

~48 ms search latency for top-k results.

This corresponds to roughly ~1.45 billion vector comparisons per second on a single GPU.

The system uses a custom GPU kernel and a two-stage search pipeline (binary filtering + floating-point reranking).

My goal was to explore whether large-scale vector search could run efficiently on consumer hardware instead of large datacenter clusters.

After thousands of hours of work and many failed attempts the results finally became stable enough to benchmark.

I'm currently exploring how far this approach can scale.

I'd be very interested to hear how others approach large-scale vector search on consumer hardware.

Happy to answer questions.

Comments URL: https://news.ycombinator.com/item?id=47400954

Points: 1

# Comments: 4