<rss version="2.0" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:atom="http://www.w3.org/2005/Atom"><channel><title>Hacker News: 1932812267</title><link>https://news.ycombinator.com/user?id=1932812267</link><description>Hacker News RSS</description><docs>https://hnrss.org/</docs><generator>hnrss v2.1.1</generator><lastBuildDate>Sat, 02 May 2026 20:51:44 +0000</lastBuildDate><atom:link href="https://hnrss.org/user?id=1932812267" rel="self" type="application/rss+xml"></atom:link><item><title><![CDATA[New comment by 1932812267 in "The average college student today"]]></title><description><![CDATA[
<p>One thing that's changed in the past decade is that college professors are now competing against youtube. There are really bad lecturers in college (and also really good ones!). But now, when you encounter a bad one, that's okay--you can watch lectures online.</p>
]]></description><pubDate>Sun, 30 Mar 2025 20:42:00 +0000</pubDate><link>https://news.ycombinator.com/item?id=43527446</link><dc:creator>1932812267</dc:creator><comments>https://news.ycombinator.com/item?id=43527446</comments><guid isPermaLink="false">https://news.ycombinator.com/item?id=43527446</guid></item><item><title><![CDATA[New comment by 1932812267 in "Towards fearless SIMD, 7 years later"]]></title><description><![CDATA[
<p>I've seen the talk! The issue with using a global lock on a global work queue is that, unless the work items have drastically different compute times, there _will_ be high contention on the lock.<p>I ran a benchmark [1], which shows that this is correct:<p>Results on quad-core Intel Linux box:<p>$ hyperfine target/release/testit 'env USE_RAYON=1 target/release/testit'
Benchmark 1: target/release/testit
  Time (mean ± σ):      2.526 s ±  0.139 s    [User: 4.709 s, System: 11.425 s]
  Range (min … max):    2.391 s …  2.730 s    10 runs<p>Benchmark 2: env USE_RAYON=1 target/release/testit
  Time (mean ± σ):     174.1 ms ±   0.9 ms    [User: 212.1 ms, System: 121.1 ms]
  Range (min … max):   173.1 ms … 175.4 ms    16 runs<p>Summary
  env USE_RAYON=1 target/release/testit ran
   14.51 ± 0.80 times faster than target/release/testit<p>Results on M1 Pro:<p>$ hyperfine target/release/testit 'env USE_RAYON=1 target/release/testit'                                                                 
Benchmark 1: target/release/testit
  Time (mean ± σ):     692.2 ms ±   8.3 ms    [User: 491.4 ms, System: 5693.6 ms]
  Range (min … max):   683.2 ms … 704.5 ms    10 runs<p>Benchmark 2: env USE_RAYON=1 target/release/testit
  Time (mean ± σ):      63.0 ms ±   2.1 ms    [User: 97.7 ms, System: 47.0 ms]
  Range (min … max):    61.0 ms …  71.2 ms    44 runs<p>Summary
  env USE_RAYON=1 target/release/testit ran
   10.99 ± 0.39 times faster than target/release/testit<p>[1]: <a href="https://play.rust-lang.org/?version=stable&mode=debug&edition=2024&gist=b9a84b823c023f13f65b4e23ddf38e3b" rel="nofollow">https://play.rust-lang.org/?version=stable&mode=debug&editio...</a> (I'm just using the rust playground as a pastebin; the actual benchmarks were run locally)</p>
]]></description><pubDate>Sun, 30 Mar 2025 20:18:05 +0000</pubDate><link>https://news.ycombinator.com/item?id=43527255</link><dc:creator>1932812267</dc:creator><comments>https://news.ycombinator.com/item?id=43527255</comments><guid isPermaLink="false">https://news.ycombinator.com/item?id=43527255</guid></item><item><title><![CDATA[New comment by 1932812267 in "Tail Call Recursion in Java with ASM (2023)"]]></title><description><![CDATA[
<p>Clojure's loop/recur is specifically tail recursion like scala's tailrec or the optimization described in the blogpost. It doesn't use trampolines to enable tail calls that aren't tail recursion.</p>
]]></description><pubDate>Sun, 30 Mar 2025 19:49:48 +0000</pubDate><link>https://news.ycombinator.com/item?id=43526996</link><dc:creator>1932812267</dc:creator><comments>https://news.ycombinator.com/item?id=43526996</comments><guid isPermaLink="false">https://news.ycombinator.com/item?id=43526996</guid></item><item><title><![CDATA[New comment by 1932812267 in "Tail Call Recursion in Java with ASM (2023)"]]></title><description><![CDATA[
<p>This isn't a _general_ tail call optimization--just tail recursion. The issue is that this won't support mutual tail recursion.<p>e.g.:<p>(defun func-a (x) (func-b (- x 34))
(defun func-b (x) (cond ((<= 0 x) x)
                        ('t (func-a (-x 3))))<p>Because func-a and func-b are different (JVM) functions, you'd need an inter-procedural goto (i.e. a tail call) in order to natively implement this.<p>As an alternative, some implementations will use a trampoline. func-a and func-b return a _value_ which says what function to call (and what arguments) for the next step of the computation. The trampoline then calls the appropriate function. Because func-a and func-b _return_ instead of actually calling their sibling, the stack depth is always constant, and the trampoline takes care of the dispatch.</p>
]]></description><pubDate>Sun, 30 Mar 2025 16:25:30 +0000</pubDate><link>https://news.ycombinator.com/item?id=43525333</link><dc:creator>1932812267</dc:creator><comments>https://news.ycombinator.com/item?id=43525333</comments><guid isPermaLink="false">https://news.ycombinator.com/item?id=43525333</guid></item><item><title><![CDATA[New comment by 1932812267 in "Tail Call Recursion in Java with ASM (2023)"]]></title><description><![CDATA[
<p>Scala has been using this technique for years with its scala.annotation.tailrec annotation. Regardless, it's cool to see this implemented as a bytecode pass.</p>
]]></description><pubDate>Sun, 30 Mar 2025 16:17:10 +0000</pubDate><link>https://news.ycombinator.com/item?id=43525259</link><dc:creator>1932812267</dc:creator><comments>https://news.ycombinator.com/item?id=43525259</comments><guid isPermaLink="false">https://news.ycombinator.com/item?id=43525259</guid></item><item><title><![CDATA[New comment by 1932812267 in "Towards fearless SIMD, 7 years later"]]></title><description><![CDATA[
<p>Sure! However, the work-stealing queue in rayon [1] uses three atomic operations instead of the two atomic operations for a mutex for a global lock. The difference, however, is the three atomic operations for the thread-local queue should be uncontended, whereas a global lock on a global work queue would experience contention from every thread trying to access it for jobs.<p>Between the choices of "single work sharing queue with a big mutex on it that all threads access for work" vs "per-thread work-stealing queue that's uncontended for the cost of one extra atomic," in what situations would the work-sharing queue with the global mutex outperform? Perhaps if there's a small number of jobs, and there's not enough time for the work-stealing algorithm to distribute jobs to the worker threads before the work-sharing algorithm has already finished.<p>[1]: <a href="https://github.com/crossbeam-rs/crossbeam/blob/423e46fe204718785af99a2d68a52092463d0167/crossbeam-deque/src/deque.rs#L444-L481" rel="nofollow">https://github.com/crossbeam-rs/crossbeam/blob/423e46fe20471...</a></p>
]]></description><pubDate>Sun, 30 Mar 2025 16:15:07 +0000</pubDate><link>https://news.ycombinator.com/item?id=43525243</link><dc:creator>1932812267</dc:creator><comments>https://news.ycombinator.com/item?id=43525243</comments><guid isPermaLink="false">https://news.ycombinator.com/item?id=43525243</guid></item><item><title><![CDATA[New comment by 1932812267 in "Towards fearless SIMD, 7 years later"]]></title><description><![CDATA[
<p>While it's true that par_iter() uses a concurrent data structure under the hood, it's specifically designed to use work-stealing to avoid needing threads to communicate.<p>Why would putting a lock over a global workqueue be faster than per-thread workqueues that don't require inter-thread communication (except in the case where work-stealing is required)?</p>
]]></description><pubDate>Sun, 30 Mar 2025 14:41:01 +0000</pubDate><link>https://news.ycombinator.com/item?id=43524568</link><dc:creator>1932812267</dc:creator><comments>https://news.ycombinator.com/item?id=43524568</comments><guid isPermaLink="false">https://news.ycombinator.com/item?id=43524568</guid></item><item><title><![CDATA[New comment by 1932812267 in "Towards fearless SIMD, 7 years later"]]></title><description><![CDATA[
<p>I love your username, btw :)</p>
]]></description><pubDate>Sun, 30 Mar 2025 04:52:18 +0000</pubDate><link>https://news.ycombinator.com/item?id=43521446</link><dc:creator>1932812267</dc:creator><comments>https://news.ycombinator.com/item?id=43521446</comments><guid isPermaLink="false">https://news.ycombinator.com/item?id=43521446</guid></item><item><title><![CDATA[New comment by 1932812267 in "Towards fearless SIMD, 7 years later"]]></title><description><![CDATA[
<p>I've written a fair bit of SIMD code in Rust, and it definitely had lots of sore spots.<p>The main advantage was that, because Rust doesn't use TBAA, it's completely legal (and safe, if you use bytemuck) to freely cast pointers and values around. TBAA in C++ makes it much easier to hit undefined behavior.<p>But also, because of various miscompilations, Rust refuses (or at least refused) to pass SIMD arguments in registers, so every non-inlined function call passes arguments via the stack. There were also miscompilations if you enabled a target_feature for just one function, so we ended up just passing `-C target-cpu=...` globally, and if we wanted to support a different microarchitecture, we just recompiled the whole program. On top of that, there's no good way to check to see what microarchitecture you're compiling for, so we had to resort to specifying the target cpu in multiple places, with comments reminding us to keep the places in sync.</p>
]]></description><pubDate>Sun, 30 Mar 2025 04:44:46 +0000</pubDate><link>https://news.ycombinator.com/item?id=43521425</link><dc:creator>1932812267</dc:creator><comments>https://news.ycombinator.com/item?id=43521425</comments><guid isPermaLink="false">https://news.ycombinator.com/item?id=43521425</guid></item></channel></rss>