<rss version="2.0" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:atom="http://www.w3.org/2005/Atom"><channel><title>Hacker News: shwestrick</title><link>https://news.ycombinator.com/user?id=shwestrick</link><description>Hacker News RSS</description><docs>https://hnrss.org/</docs><generator>hnrss v2.1.1</generator><lastBuildDate>Sat, 30 May 2026 00:19:07 +0000</lastBuildDate><atom:link href="https://hnrss.org/user?id=shwestrick" rel="self" type="application/rss+xml"></atom:link><item><title><![CDATA[New comment by shwestrick in "We shouldn't have needed lockfiles"]]></title><description><![CDATA[
<p>I like this example.<p>The client who didn't notice a difference would probably call it a bugfix.<p>The client whose software got ever-so-slightly more reliable probably would call it a minor update.<p>The client whose software previously was loading large files (luckily) without issue would call it major, because now their software just doesn't work anymore.</p>
]]></description><pubDate>Wed, 06 Aug 2025 22:54:43 +0000</pubDate><link>https://news.ycombinator.com/item?id=44818805</link><dc:creator>shwestrick</dc:creator><comments>https://news.ycombinator.com/item?id=44818805</comments><guid isPermaLink="false">https://news.ycombinator.com/item?id=44818805</guid></item><item><title><![CDATA[New comment by shwestrick in "Enum of Arrays"]]></title><description><![CDATA[
<p>Worth mentioning that you can always safely switch between AoS and SoA. Either can represent the other; all you've done is transpose the data. The same is not true of AoE/EoA. The AoE [Spam1, Egg1, Spam2, Spam3, Egg2] has no corresponding EoA that can represent it.<p>What they're actually doing is an AoE => AoEoA transformation: find batches elements with the same tag and reorder the elements so that redundant tags can be eliminated. Essentially, a kind of run-length encoding. It's a nice idea.</p>
]]></description><pubDate>Sun, 22 Dec 2024 01:31:05 +0000</pubDate><link>https://news.ycombinator.com/item?id=42483769</link><dc:creator>shwestrick</dc:creator><comments>https://news.ycombinator.com/item?id=42483769</comments><guid isPermaLink="false">https://news.ycombinator.com/item?id=42483769</guid></item><item><title><![CDATA[New comment by shwestrick in "Spice: Fine-grained parallelism with sub-nanosecond overhead in Zig"]]></title><description><![CDATA[
<p>For those curious, this implementation is based on a recent line of research called "heartbeat scheduling" which amortizes the overheads of creating parallelism, essentially accomplishing a kind of dynamic automatic granularity control.<p>Related papers:<p>(2018) Heartbeat Scheduling:
Provable Efficiency for Nested Parallelism. <a href="https://www.andrew.cmu.edu/user/mrainey/papers/heartbeat.pdf" rel="nofollow">https://www.andrew.cmu.edu/user/mrainey/papers/heartbeat.pdf</a><p>(2021) Task Parallel Assembly Language for
Uncompromising Parallelism. <a href="https://users.cs.northwestern.edu/~simonec/files/Research/papers/HELIX_PLDI_2021.pdf" rel="nofollow">https://users.cs.northwestern.edu/~simonec/files/Research/pa...</a><p>(2024) Compiling Loop-Based Nested Parallelism for Irregular Workloads. <a href="https://users.cs.northwestern.edu/~simonec/files/Research/papers/HELIX_ASPLOS_2024.pdf" rel="nofollow">https://users.cs.northwestern.edu/~simonec/files/Research/pa...</a><p>(2024) Automatic Parallelism Management. <a href="https://www.cs.cmu.edu/~swestric/24/popl24-par-manage.pdf" rel="nofollow">https://www.cs.cmu.edu/~swestric/24/popl24-par-manage.pdf</a></p>
]]></description><pubDate>Tue, 13 Aug 2024 13:43:38 +0000</pubDate><link>https://news.ycombinator.com/item?id=41235389</link><dc:creator>shwestrick</dc:creator><comments>https://news.ycombinator.com/item?id=41235389</comments><guid isPermaLink="false">https://news.ycombinator.com/item?id=41235389</guid></item><item><title><![CDATA[New comment by shwestrick in "Bend: a high-level language that runs on GPUs (via HVM2)"]]></title><description><![CDATA[
<p>Nowadays 210 is actually parallel! You can run 210-style code using MaPLe (<a href="https://github.com/MPLLang/mpl">https://github.com/MPLLang/mpl</a>) and get competitive performance with respect to C/C++.<p>If you liked 210, you might also like <a href="https://futhark-lang.org/" rel="nofollow">https://futhark-lang.org/</a> which is an ML-family language that compiles to GPU with good performance.</p>
]]></description><pubDate>Fri, 17 May 2024 19:50:56 +0000</pubDate><link>https://news.ycombinator.com/item?id=40393603</link><dc:creator>shwestrick</dc:creator><comments>https://news.ycombinator.com/item?id=40393603</comments><guid isPermaLink="false">https://news.ycombinator.com/item?id=40393603</guid></item><item><title><![CDATA[New comment by shwestrick in "Garbage collection for systems programmers (2023)"]]></title><description><![CDATA[
<p>The two systems have very different tradeoffs.<p>A few things in particular:<p><pre><code>  * Separate compilation vs whole-program compilation. OCaml uses separate compilation and therefore has a very constrained heap object model which makes it possible to get polymorphism across separately compiled modules. In contrast, MaPLe uses whole-program compilation and therefore is able to monomorphize and optimize much more aggressively. Whole-program compilation can be slow for large projects.

  * The multicore OCaml effort was driven by backwards compatibility, especially in terms of performance -- they wanted to ensure that the performance of existing sequential OCaml code would be completely unaffected by the new run-time system.  In contrast, MaPLe focuses on efficiency and scalability for parallel code.

  * Multicore OCaml will let you implement your own scheduler, as a library, on top of coarse-grained threads. In contrast, MaPLe comes with a built-in scheduler, and it's not easy to change it.
</code></pre>
We did a comparison with multicore OCaml in <a href="https://dl.acm.org/doi/10.1145/3591284" rel="nofollow">https://dl.acm.org/doi/10.1145/3591284</a>, and found that MaPLe can be significantly faster, but that comes with all of the tradeoffs above. And, it's a cross-language comparison, so take it with a grain of salt. Our comparison in particular emphasized similar source code, but typically, fast code in OCaml just looks different from fast code in MaPLe. For example, in OCaml, you often need to manually unbox certain data structures to get better memory efficiency (but MaPLe will often do this for you, automatically).<p>By the way, there's a recent effort---led by the compiler team at Jane Street---to make it possible to get automatic data unboxing in OCaml! Some more info here: <a href="https://www.janestreet.com/tech-talks/unboxed-types-for-ocaml/" rel="nofollow">https://www.janestreet.com/tech-talks/unboxed-types-for-ocam...</a></p>
]]></description><pubDate>Sun, 31 Mar 2024 16:37:45 +0000</pubDate><link>https://news.ycombinator.com/item?id=39885729</link><dc:creator>shwestrick</dc:creator><comments>https://news.ycombinator.com/item?id=39885729</comments><guid isPermaLink="false">https://news.ycombinator.com/item?id=39885729</guid></item><item><title><![CDATA[New comment by shwestrick in "Garbage collection for systems programmers (2023)"]]></title><description><![CDATA[
<p>I'm one of the authors of this work -- I can explain a little.<p>"Provably efficient" means that the language provides worst-case performance guarantees.<p>For example in the "Automatic Parallelism Management" paper (<a href="https://dl.acm.org/doi/10.1145/3632880" rel="nofollow">https://dl.acm.org/doi/10.1145/3632880</a>), we develop a compiler and run-time system that can execute extremely fine-grained parallel code without losing performance. (Concretely, imagine tiny tasks of around only 10-100 instructions each.)<p>The key idea is to make sure that any task which is *too tiny* is executed sequentially instead of in parallel. To make this happen, we use a scheduler that runs in the background during execution. It is the scheduler's job to decide on-the-fly which tasks should be sequentialized and which tasks should be "promoted" into actual threads that can run in parallel. Intuitively, each promotion incurs a cost, but also exposes parallelism.<p>In the paper, we present our scheduler and prove a worst-case performance bound. We specifically show that the total overhead of promotion will be at most a small constant factor (e.g., 1% overhead), and also that the theoretical amount of parallelism is unaffected, asymptotically.<p>All of this is implemented in MaPLe (<a href="https://github.com/mpllang/mpl">https://github.com/mpllang/mpl</a>) and you can go play with it now!</p>
]]></description><pubDate>Sun, 31 Mar 2024 16:16:06 +0000</pubDate><link>https://news.ycombinator.com/item?id=39885519</link><dc:creator>shwestrick</dc:creator><comments>https://news.ycombinator.com/item?id=39885519</comments><guid isPermaLink="false">https://news.ycombinator.com/item?id=39885519</guid></item><item><title><![CDATA[New comment by shwestrick in "Recursion Viewer"]]></title><description><![CDATA[
<p>Enjoyed playing with this!<p>N-queens search is another nice recursive example. E.g. call this with nqueens(0, 5, [])<p><pre><code>    function nqueens(i:number, n:number, queens: number[][]) {
      if (i >= n) return 1;

      let count = 0;
      for (let j = 0; j < n; j++) {

        let is_threatened = false;
        for (let k = 0; k < queens.length; k++) {
          let x = queens[k][0];
          let y = queens[k][1];

          if (i == x || j == y || i - j == x - y || i + j == x + y) {
            is_threatened = true;
            break;
          }
        }

        if (is_threatened) continue;

        count += nqueens(i+1, n, Array(Array(i,j)).concat(queens))
      }

      return count;
    }</code></pre></p>
]]></description><pubDate>Tue, 05 Mar 2024 05:27:20 +0000</pubDate><link>https://news.ycombinator.com/item?id=39599772</link><dc:creator>shwestrick</dc:creator><comments>https://news.ycombinator.com/item?id=39599772</comments><guid isPermaLink="false">https://news.ycombinator.com/item?id=39599772</guid></item><item><title><![CDATA[New comment by shwestrick in "Efficient parallelization of an ubiquitous sequential computation"]]></title><description><![CDATA[
<p>On modern multicore hardware this will be memory-bound; the amount of computation per byte is pretty small (just a few arithmetic instructions on average). My intuition is that the single scan will be faster because it requires a much smaller number of cache misses.<p>And yes, definitely, the numerical accuracy thing could be a problem. I suspect it wouldn't be too difficult to work around, but I can't say for sure off the top of my head.</p>
]]></description><pubDate>Fri, 08 Dec 2023 17:16:35 +0000</pubDate><link>https://news.ycombinator.com/item?id=38571438</link><dc:creator>shwestrick</dc:creator><comments>https://news.ycombinator.com/item?id=38571438</comments><guid isPermaLink="false">https://news.ycombinator.com/item?id=38571438</guid></item><item><title><![CDATA[New comment by shwestrick in "Efficient parallelization of an ubiquitous sequential computation"]]></title><description><![CDATA[
<p>It's worth noting that you can solve these linear recurrences, `x(t) = a(t)x(t-1) + b(t)`, using a single parallel prefix sum where the elements are the input tuples `(a(t), b(t))`.<p>The following operator called `combine` works. It's associative over these tuples; the final result will be the final value of the second component. Altogether, this gives you O(n) work and O(log n) span, but using just a single parallel prefix kernel, which may be more efficient in practice.<p><pre><code>    combine ((a1, b1), (a2, b2))  =  (a1 * a2, b1 * a2 + b2)
</code></pre>
For example, here's a C++ implementation: <a href="https://github.com/MPLLang/parallel-ml-bench/blob/main/cpp/linearrec.h">https://github.com/MPLLang/parallel-ml-bench/blob/main/cpp/l...</a>  And, here's an implementation in a functional language: <a href="https://github.com/MPLLang/parallel-ml-bench/blob/main/mpl/bench/linearrec/MkLinearRec.sml">https://github.com/MPLLang/parallel-ml-bench/blob/main/mpl/b...</a><p>I'm pretty sure this generalizes, too, to abstract multiplications and additions in any field (at first glance, seems like it should but I haven't done the formal proof yet).<p>Anyway, it would be interesting to compare this against the solution in the arxiv paper.<p>----<p>EDIT: ah, good, this is already being discussed here: <a href="https://github.com/glassroom/heinsen_sequence/issues/1">https://github.com/glassroom/heinsen_sequence/issues/1</a><p>And, for reference, I learned this algorithm from Guy Blelloch; see Sec 1.4.1 of <a href="https://www.cs.cmu.edu/~guyb/papers/Ble93.pdf" rel="nofollow noreferrer">https://www.cs.cmu.edu/~guyb/papers/Ble93.pdf</a></p>
]]></description><pubDate>Thu, 07 Dec 2023 18:37:02 +0000</pubDate><link>https://news.ycombinator.com/item?id=38560001</link><dc:creator>shwestrick</dc:creator><comments>https://news.ycombinator.com/item?id=38560001</comments><guid isPermaLink="false">https://news.ycombinator.com/item?id=38560001</guid></item><item><title><![CDATA[New comment by shwestrick in "Optimization Techniques for GPU Programming [pdf]"]]></title><description><![CDATA[
<p>Tons and tons of parallel algorithms use prefix sums. Typically the most common use is to compute a collection of offsets in parallel. Some examples:<p>- compact a hash table (i.e., remove the empty slots)<p>- flatten a jagged 2D array<p>- rewrite a dense matrix in compressed-sparse-row (CSR) format</p>
]]></description><pubDate>Thu, 10 Aug 2023 02:50:08 +0000</pubDate><link>https://news.ycombinator.com/item?id=37071347</link><dc:creator>shwestrick</dc:creator><comments>https://news.ycombinator.com/item?id=37071347</comments><guid isPermaLink="false">https://news.ycombinator.com/item?id=37071347</guid></item><item><title><![CDATA[New comment by shwestrick in "Is parallel programming hard, and, if so, what can you do about it?"]]></title><description><![CDATA[
<p>Parallelism is only about performance, that's it. If you need something to go faster, parallelism is an option.<p>Looking into the future, parallelism is one of the only remaining techniques for scaling classical computing. Processors have stopped getting faster. Instead, they're getting fatter (more cores).</p>
]]></description><pubDate>Wed, 14 Jun 2023 13:00:46 +0000</pubDate><link>https://news.ycombinator.com/item?id=36325229</link><dc:creator>shwestrick</dc:creator><comments>https://news.ycombinator.com/item?id=36325229</comments><guid isPermaLink="false">https://news.ycombinator.com/item?id=36325229</guid></item><item><title><![CDATA[New comment by shwestrick in "Everything you always wanted to know about mathematics (2013) [pdf]"]]></title><description><![CDATA[
<p>It's a doctoral degree with a heavy emphasis on pedagogy and teaching. From <a href="https://www.cmu.edu/math/grad/phd/index.html" rel="nofollow">https://www.cmu.edu/math/grad/phd/index.html</a>:<p>> The Doctor of Arts degree shares all requirements and standards with the Ph.D., except with regard to the thesis. The D.A. thesis is not expected to display the sort of original research required for a Ph.D. thesis, but rather to demonstrate an ability to organize, understand, and present mathematical ideas in a scholarly way, usually with sufficient innovation and worth to produce a publishable work. Whenever practical, the department provides D.A. candidates with the opportunity to use materials developed to teach a course. While a typical Ph.D. recipient will seek a position that has a substantial research component, the D.A. recipient will usually seek a position where research is not central.</p>
]]></description><pubDate>Thu, 25 May 2023 13:58:05 +0000</pubDate><link>https://news.ycombinator.com/item?id=36070785</link><dc:creator>shwestrick</dc:creator><comments>https://news.ycombinator.com/item?id=36070785</comments><guid isPermaLink="false">https://news.ycombinator.com/item?id=36070785</guid></item><item><title><![CDATA[New comment by shwestrick in "The Garbage Collection Handbook, 2nd Edition"]]></title><description><![CDATA[
<p>More and more people nowadays are programming at high levels of abstraction. If you're designing the frontend of a website, or making a mobile game, or developing a stock trading algorithm, or whatever else, then you probably don't want to constantly worry about details of memory management...<p>On top of this, GC is necessary for some algorithms. Any data structure with partial sharing (e.g. binary search tree with versioning via path-copying) needs GC to be space-efficient. You could either rely on a built-in GC, or write your own. If you write your own, I think you'll find that it is tedious and error-prone due to memory safety issues.</p>
]]></description><pubDate>Sat, 08 Apr 2023 16:58:43 +0000</pubDate><link>https://news.ycombinator.com/item?id=35495224</link><dc:creator>shwestrick</dc:creator><comments>https://news.ycombinator.com/item?id=35495224</comments><guid isPermaLink="false">https://news.ycombinator.com/item?id=35495224</guid></item><item><title><![CDATA[New comment by shwestrick in "Typst, a new markup-based typesetting system, is now open source"]]></title><description><![CDATA[
<p>in latex, when typesetting math, by default, parentheses (and other brackets) are always a constant height. So if you put something which is taller than one line in between parentheses, it will look strange, with the content sticking out past the parentheses at the top and/or bottom.<p>The fix for this (in latex) it to mark pairs of matching parentheses with the macros `\left` and `\right`, (for example: `\left( \frac{x}{y} \right)`) which informs latex that it should figure out how tall the content between the brackets is, and then resize the parentheses appropriately.</p>
]]></description><pubDate>Tue, 21 Mar 2023 19:43:45 +0000</pubDate><link>https://news.ycombinator.com/item?id=35251483</link><dc:creator>shwestrick</dc:creator><comments>https://news.ycombinator.com/item?id=35251483</comments><guid isPermaLink="false">https://news.ycombinator.com/item?id=35251483</guid></item><item><title><![CDATA[New comment by shwestrick in "Comparing Objective Caml and Standard ML"]]></title><description><![CDATA[
<p>Some of us are still using SML for research and teaching, e.g. <a href="https://github.com/mpllang/mpl">https://github.com/mpllang/mpl</a></p>
]]></description><pubDate>Thu, 16 Feb 2023 05:27:22 +0000</pubDate><link>https://news.ycombinator.com/item?id=34815399</link><dc:creator>shwestrick</dc:creator><comments>https://news.ycombinator.com/item?id=34815399</comments><guid isPermaLink="false">https://news.ycombinator.com/item?id=34815399</guid></item><item><title><![CDATA[New comment by shwestrick in "Mandatory helmet laws make cyclists less safe"]]></title><description><![CDATA[
<p>That's not a useful perspective. It doesn't matter what the goal of the law is; it matters what effect the law has.</p>
]]></description><pubDate>Mon, 05 Dec 2022 18:32:15 +0000</pubDate><link>https://news.ycombinator.com/item?id=33869672</link><dc:creator>shwestrick</dc:creator><comments>https://news.ycombinator.com/item?id=33869672</comments><guid isPermaLink="false">https://news.ycombinator.com/item?id=33869672</guid></item><item><title><![CDATA[New comment by shwestrick in "Mandatory helmet laws make cyclists less safe"]]></title><description><![CDATA[
<p>You seem to be interested in this question: "If I bike without a helmet, how much more likely am I to be injured than if I bike with a helmet?". And of course, the answer is that you are safer with a helmet.<p>But the article is interested in a different question: "If I bike, how likely am I to be injured?".<p>This question is very heavily influenced by the ratio of bikes to cars on the road. More bikes leads to lower chance of injury for bicyclists.</p>
]]></description><pubDate>Mon, 05 Dec 2022 18:05:06 +0000</pubDate><link>https://news.ycombinator.com/item?id=33869216</link><dc:creator>shwestrick</dc:creator><comments>https://news.ycombinator.com/item?id=33869216</comments><guid isPermaLink="false">https://news.ycombinator.com/item?id=33869216</guid></item><item><title><![CDATA[New comment by shwestrick in "On “correct and efficient work-stealing for weak memory models”"]]></title><description><![CDATA[
<p>Private deques can still be very effective for work-stealing, both in theory and practice! This paper comes to mind: <a href="https://hal.inria.fr/hal-00863028/document" rel="nofollow">https://hal.inria.fr/hal-00863028/document</a></p>
]]></description><pubDate>Sat, 08 Oct 2022 14:34:10 +0000</pubDate><link>https://news.ycombinator.com/item?id=33132610</link><dc:creator>shwestrick</dc:creator><comments>https://news.ycombinator.com/item?id=33132610</comments><guid isPermaLink="false">https://news.ycombinator.com/item?id=33132610</guid></item><item><title><![CDATA[New comment by shwestrick in "Memories: Edinburgh ML to Standard ML"]]></title><description><![CDATA[
<p>MLton also has perhaps the most impressive performance of any existing functional language implementation. It generates code that easily competes with hand-optimized low-level C/C++/whatever.</p>
]]></description><pubDate>Wed, 05 Oct 2022 17:01:36 +0000</pubDate><link>https://news.ycombinator.com/item?id=33098168</link><dc:creator>shwestrick</dc:creator><comments>https://news.ycombinator.com/item?id=33098168</comments><guid isPermaLink="false">https://news.ycombinator.com/item?id=33098168</guid></item><item><title><![CDATA[New comment by shwestrick in "Race Conditions Can Be Useful for Parallelism"]]></title><description><![CDATA[
<p>When visiting vertices in parallel, there might be multiple potential parents that all attempt to visit the same vertex simultaneously. So, we need a way of picking which parent "wins".</p>
]]></description><pubDate>Sun, 02 Oct 2022 17:04:02 +0000</pubDate><link>https://news.ycombinator.com/item?id=33057843</link><dc:creator>shwestrick</dc:creator><comments>https://news.ycombinator.com/item?id=33057843</comments><guid isPermaLink="false">https://news.ycombinator.com/item?id=33057843</guid></item></channel></rss>