Hacker News: 1932812267

New comment by 1932812267 in "The average college student today"

1932812267 — Sun, 30 Mar 2025 20:42:00 +0000

One thing that's changed in the past decade is that college professors are now competing against youtube. There are really bad lecturers in college (and also really good ones!). But now, when you encounter a bad one, that's okay--you can watch lectures online.

New comment by 1932812267 in "Towards fearless SIMD, 7 years later"

1932812267 — Sun, 30 Mar 2025 20:18:05 +0000

I've seen the talk! The issue with using a global lock on a global work queue is that, unless the work items have drastically different compute times, there _will_ be high contention on the lock.

I ran a benchmark [1], which shows that this is correct:

Results on quad-core Intel Linux box:

$ hyperfine target/release/testit 'env USE_RAYON=1 target/release/testit' Benchmark 1: target/release/testit Time (mean ± σ): 2.526 s ± 0.139 s [User: 4.709 s, System: 11.425 s] Range (min … max): 2.391 s … 2.730 s 10 runs

Benchmark 2: env USE_RAYON=1 target/release/testit Time (mean ± σ): 174.1 ms ± 0.9 ms [User: 212.1 ms, System: 121.1 ms] Range (min … max): 173.1 ms … 175.4 ms 16 runs

Summary env USE_RAYON=1 target/release/testit ran 14.51 ± 0.80 times faster than target/release/testit

Results on M1 Pro:

$ hyperfine target/release/testit 'env USE_RAYON=1 target/release/testit' Benchmark 1: target/release/testit Time (mean ± σ): 692.2 ms ± 8.3 ms [User: 491.4 ms, System: 5693.6 ms] Range (min … max): 683.2 ms … 704.5 ms 10 runs

Benchmark 2: env USE_RAYON=1 target/release/testit Time (mean ± σ): 63.0 ms ± 2.1 ms [User: 97.7 ms, System: 47.0 ms] Range (min … max): 61.0 ms … 71.2 ms 44 runs

Summary env USE_RAYON=1 target/release/testit ran 10.99 ± 0.39 times faster than target/release/testit

[1]: https://play.rust-lang.org/?version=stable&mode=debug&editio... (I'm just using the rust playground as a pastebin; the actual benchmarks were run locally)

New comment by 1932812267 in "Tail Call Recursion in Java with ASM (2023)"

1932812267 — Sun, 30 Mar 2025 19:49:48 +0000

Clojure's loop/recur is specifically tail recursion like scala's tailrec or the optimization described in the blogpost. It doesn't use trampolines to enable tail calls that aren't tail recursion.

New comment by 1932812267 in "Tail Call Recursion in Java with ASM (2023)"

1932812267 — Sun, 30 Mar 2025 16:25:30 +0000

This isn't a _general_ tail call optimization--just tail recursion. The issue is that this won't support mutual tail recursion.

e.g.:

(defun func-a (x) (func-b (- x 34)) (defun func-b (x) (cond ((<= 0 x) x) ('t (func-a (-x 3))))

Because func-a and func-b are different (JVM) functions, you'd need an inter-procedural goto (i.e. a tail call) in order to natively implement this.

As an alternative, some implementations will use a trampoline. func-a and func-b return a _value_ which says what function to call (and what arguments) for the next step of the computation. The trampoline then calls the appropriate function. Because func-a and func-b _return_ instead of actually calling their sibling, the stack depth is always constant, and the trampoline takes care of the dispatch.

New comment by 1932812267 in "Tail Call Recursion in Java with ASM (2023)"

1932812267 — Sun, 30 Mar 2025 16:17:10 +0000

Scala has been using this technique for years with its scala.annotation.tailrec annotation. Regardless, it's cool to see this implemented as a bytecode pass.

New comment by 1932812267 in "Towards fearless SIMD, 7 years later"

1932812267 — Sun, 30 Mar 2025 16:15:07 +0000

Sure! However, the work-stealing queue in rayon [1] uses three atomic operations instead of the two atomic operations for a mutex for a global lock. The difference, however, is the three atomic operations for the thread-local queue should be uncontended, whereas a global lock on a global work queue would experience contention from every thread trying to access it for jobs.

Between the choices of "single work sharing queue with a big mutex on it that all threads access for work" vs "per-thread work-stealing queue that's uncontended for the cost of one extra atomic," in what situations would the work-sharing queue with the global mutex outperform? Perhaps if there's a small number of jobs, and there's not enough time for the work-stealing algorithm to distribute jobs to the worker threads before the work-sharing algorithm has already finished.

[1]: https://github.com/crossbeam-rs/crossbeam/blob/423e46fe20471...

New comment by 1932812267 in "Towards fearless SIMD, 7 years later"

1932812267 — Sun, 30 Mar 2025 14:41:01 +0000

While it's true that par_iter() uses a concurrent data structure under the hood, it's specifically designed to use work-stealing to avoid needing threads to communicate.

Why would putting a lock over a global workqueue be faster than per-thread workqueues that don't require inter-thread communication (except in the case where work-stealing is required)?

New comment by 1932812267 in "Towards fearless SIMD, 7 years later"

1932812267 — Sun, 30 Mar 2025 04:52:18 +0000

I love your username, btw :)

New comment by 1932812267 in "Towards fearless SIMD, 7 years later"

1932812267 — Sun, 30 Mar 2025 04:44:46 +0000

I've written a fair bit of SIMD code in Rust, and it definitely had lots of sore spots.

The main advantage was that, because Rust doesn't use TBAA, it's completely legal (and safe, if you use bytemuck) to freely cast pointers and values around. TBAA in C++ makes it much easier to hit undefined behavior.

But also, because of various miscompilations, Rust refuses (or at least refused) to pass SIMD arguments in registers, so every non-inlined function call passes arguments via the stack. There were also miscompilations if you enabled a target_feature for just one function, so we ended up just passing `-C target-cpu=...` globally, and if we wanted to support a different microarchitecture, we just recompiled the whole program. On top of that, there's no good way to check to see what microarchitecture you're compiling for, so we had to resort to specifying the target cpu in multiple places, with comments reminding us to keep the places in sync.