Hacker News: shwestrick

New comment by shwestrick in "We shouldn't have needed lockfiles"

shwestrick — Wed, 06 Aug 2025 22:54:43 +0000

I like this example.

The client who didn't notice a difference would probably call it a bugfix.

The client whose software got ever-so-slightly more reliable probably would call it a minor update.

The client whose software previously was loading large files (luckily) without issue would call it major, because now their software just doesn't work anymore.

New comment by shwestrick in "Enum of Arrays"

shwestrick — Sun, 22 Dec 2024 01:31:05 +0000

Worth mentioning that you can always safely switch between AoS and SoA. Either can represent the other; all you've done is transpose the data. The same is not true of AoE/EoA. The AoE [Spam1, Egg1, Spam2, Spam3, Egg2] has no corresponding EoA that can represent it.

What they're actually doing is an AoE => AoEoA transformation: find batches elements with the same tag and reorder the elements so that redundant tags can be eliminated. Essentially, a kind of run-length encoding. It's a nice idea.

New comment by shwestrick in "Spice: Fine-grained parallelism with sub-nanosecond overhead in Zig"

shwestrick — Tue, 13 Aug 2024 13:43:38 +0000

For those curious, this implementation is based on a recent line of research called "heartbeat scheduling" which amortizes the overheads of creating parallelism, essentially accomplishing a kind of dynamic automatic granularity control.

New comment by shwestrick in "Bend: a high-level language that runs on GPUs (via HVM2)"

shwestrick — Fri, 17 May 2024 19:50:56 +0000

Nowadays 210 is actually parallel! You can run 210-style code using MaPLe (https://github.com/MPLLang/mpl) and get competitive performance with respect to C/C++.

If you liked 210, you might also like https://futhark-lang.org/ which is an ML-family language that compiles to GPU with good performance.

New comment by shwestrick in "Garbage collection for systems programmers (2023)"

shwestrick — Sun, 31 Mar 2024 16:37:45 +0000

The two systems have very different tradeoffs.

A few things in particular:

  * Separate compilation vs whole-program compilation. OCaml uses separate compilation and therefore has a very constrained heap object model which makes it possible to get polymorphism across separately compiled modules. In contrast, MaPLe uses whole-program compilation and therefore is able to monomorphize and optimize much more aggressively. Whole-program compilation can be slow for large projects.

  * The multicore OCaml effort was driven by backwards compatibility, especially in terms of performance -- they wanted to ensure that the performance of existing sequential OCaml code would be completely unaffected by the new run-time system.  In contrast, MaPLe focuses on efficiency and scalability for parallel code.

  * Multicore OCaml will let you implement your own scheduler, as a library, on top of coarse-grained threads. In contrast, MaPLe comes with a built-in scheduler, and it's not easy to change it.

We did a comparison with multicore OCaml in https://dl.acm.org/doi/10.1145/3591284, and found that MaPLe can be significantly faster, but that comes with all of the tradeoffs above. And, it's a cross-language comparison, so take it with a grain of salt. Our comparison in particular emphasized similar source code, but typically, fast code in OCaml just looks different from fast code in MaPLe. For example, in OCaml, you often need to manually unbox certain data structures to get better memory efficiency (but MaPLe will often do this for you, automatically).

By the way, there's a recent effort---led by the compiler team at Jane Street---to make it possible to get automatic data unboxing in OCaml! Some more info here: https://www.janestreet.com/tech-talks/unboxed-types-for-ocam...

New comment by shwestrick in "Garbage collection for systems programmers (2023)"

shwestrick — Sun, 31 Mar 2024 16:16:06 +0000

I'm one of the authors of this work -- I can explain a little.

"Provably efficient" means that the language provides worst-case performance guarantees.

For example in the "Automatic Parallelism Management" paper (https://dl.acm.org/doi/10.1145/3632880), we develop a compiler and run-time system that can execute extremely fine-grained parallel code without losing performance. (Concretely, imagine tiny tasks of around only 10-100 instructions each.)

The key idea is to make sure that any task which is *too tiny* is executed sequentially instead of in parallel. To make this happen, we use a scheduler that runs in the background during execution. It is the scheduler's job to decide on-the-fly which tasks should be sequentialized and which tasks should be "promoted" into actual threads that can run in parallel. Intuitively, each promotion incurs a cost, but also exposes parallelism.

In the paper, we present our scheduler and prove a worst-case performance bound. We specifically show that the total overhead of promotion will be at most a small constant factor (e.g., 1% overhead), and also that the theoretical amount of parallelism is unaffected, asymptotically.

All of this is implemented in MaPLe (https://github.com/mpllang/mpl) and you can go play with it now!

New comment by shwestrick in "Recursion Viewer"

shwestrick — Tue, 05 Mar 2024 05:27:20 +0000

Enjoyed playing with this!

N-queens search is another nice recursive example. E.g. call this with nqueens(0, 5, [])

    function nqueens(i:number, n:number, queens: number[][]) {
      if (i >= n) return 1;

      let count = 0;
      for (let j = 0; j < n; j++) {

        let is_threatened = false;
        for (let k = 0; k < queens.length; k++) {
          let x = queens[k][0];
          let y = queens[k][1];

          if (i == x || j == y || i - j == x - y || i + j == x + y) {
            is_threatened = true;
            break;
          }
        }

        if (is_threatened) continue;

        count += nqueens(i+1, n, Array(Array(i,j)).concat(queens))
      }

      return count;
    }

New comment by shwestrick in "Efficient parallelization of an ubiquitous sequential computation"

shwestrick — Fri, 08 Dec 2023 17:16:35 +0000

On modern multicore hardware this will be memory-bound; the amount of computation per byte is pretty small (just a few arithmetic instructions on average). My intuition is that the single scan will be faster because it requires a much smaller number of cache misses.

And yes, definitely, the numerical accuracy thing could be a problem. I suspect it wouldn't be too difficult to work around, but I can't say for sure off the top of my head.

New comment by shwestrick in "Efficient parallelization of an ubiquitous sequential computation"

shwestrick — Thu, 07 Dec 2023 18:37:02 +0000

It's worth noting that you can solve these linear recurrences, `x(t) = a(t)x(t-1) + b(t)`, using a single parallel prefix sum where the elements are the input tuples `(a(t), b(t))`.

The following operator called `combine` works. It's associative over these tuples; the final result will be the final value of the second component. Altogether, this gives you O(n) work and O(log n) span, but using just a single parallel prefix kernel, which may be more efficient in practice.

    combine ((a1, b1), (a2, b2))  =  (a1 * a2, b1 * a2 + b2)

For example, here's a C++ implementation: https://github.com/MPLLang/parallel-ml-bench/blob/main/cpp/l... And, here's an implementation in a functional language: https://github.com/MPLLang/parallel-ml-bench/blob/main/mpl/b...

I'm pretty sure this generalizes, too, to abstract multiplications and additions in any field (at first glance, seems like it should but I haven't done the formal proof yet).

Anyway, it would be interesting to compare this against the solution in the arxiv paper.

----

EDIT: ah, good, this is already being discussed here: https://github.com/glassroom/heinsen_sequence/issues/1

And, for reference, I learned this algorithm from Guy Blelloch; see Sec 1.4.1 of https://www.cs.cmu.edu/~guyb/papers/Ble93.pdf

New comment by shwestrick in "Optimization Techniques for GPU Programming [pdf]"

shwestrick — Thu, 10 Aug 2023 02:50:08 +0000

Tons and tons of parallel algorithms use prefix sums. Typically the most common use is to compute a collection of offsets in parallel. Some examples:

- compact a hash table (i.e., remove the empty slots)

- flatten a jagged 2D array

- rewrite a dense matrix in compressed-sparse-row (CSR) format

New comment by shwestrick in "Is parallel programming hard, and, if so, what can you do about it?"

shwestrick — Wed, 14 Jun 2023 13:00:46 +0000

Parallelism is only about performance, that's it. If you need something to go faster, parallelism is an option.

Looking into the future, parallelism is one of the only remaining techniques for scaling classical computing. Processors have stopped getting faster. Instead, they're getting fatter (more cores).

New comment by shwestrick in "Everything you always wanted to know about mathematics (2013) [pdf]"

shwestrick — Thu, 25 May 2023 13:58:05 +0000

It's a doctoral degree with a heavy emphasis on pedagogy and teaching. From https://www.cmu.edu/math/grad/phd/index.html:

> The Doctor of Arts degree shares all requirements and standards with the Ph.D., except with regard to the thesis. The D.A. thesis is not expected to display the sort of original research required for a Ph.D. thesis, but rather to demonstrate an ability to organize, understand, and present mathematical ideas in a scholarly way, usually with sufficient innovation and worth to produce a publishable work. Whenever practical, the department provides D.A. candidates with the opportunity to use materials developed to teach a course. While a typical Ph.D. recipient will seek a position that has a substantial research component, the D.A. recipient will usually seek a position where research is not central.

New comment by shwestrick in "The Garbage Collection Handbook, 2nd Edition"

shwestrick — Sat, 08 Apr 2023 16:58:43 +0000

More and more people nowadays are programming at high levels of abstraction. If you're designing the frontend of a website, or making a mobile game, or developing a stock trading algorithm, or whatever else, then you probably don't want to constantly worry about details of memory management...

On top of this, GC is necessary for some algorithms. Any data structure with partial sharing (e.g. binary search tree with versioning via path-copying) needs GC to be space-efficient. You could either rely on a built-in GC, or write your own. If you write your own, I think you'll find that it is tedious and error-prone due to memory safety issues.

New comment by shwestrick in "Typst, a new markup-based typesetting system, is now open source"

shwestrick — Tue, 21 Mar 2023 19:43:45 +0000

in latex, when typesetting math, by default, parentheses (and other brackets) are always a constant height. So if you put something which is taller than one line in between parentheses, it will look strange, with the content sticking out past the parentheses at the top and/or bottom.

The fix for this (in latex) it to mark pairs of matching parentheses with the macros `\left` and `\right`, (for example: `\left( \frac{x}{y} \right)`) which informs latex that it should figure out how tall the content between the brackets is, and then resize the parentheses appropriately.

New comment by shwestrick in "Comparing Objective Caml and Standard ML"

shwestrick — Thu, 16 Feb 2023 05:27:22 +0000

Some of us are still using SML for research and teaching, e.g. https://github.com/mpllang/mpl

New comment by shwestrick in "Mandatory helmet laws make cyclists less safe"

shwestrick — Mon, 05 Dec 2022 18:32:15 +0000

That's not a useful perspective. It doesn't matter what the goal of the law is; it matters what effect the law has.

New comment by shwestrick in "Mandatory helmet laws make cyclists less safe"

shwestrick — Mon, 05 Dec 2022 18:05:06 +0000

You seem to be interested in this question: "If I bike without a helmet, how much more likely am I to be injured than if I bike with a helmet?". And of course, the answer is that you are safer with a helmet.

But the article is interested in a different question: "If I bike, how likely am I to be injured?".

This question is very heavily influenced by the ratio of bikes to cars on the road. More bikes leads to lower chance of injury for bicyclists.

New comment by shwestrick in "On “correct and efficient work-stealing for weak memory models”"

shwestrick — Sat, 08 Oct 2022 14:34:10 +0000

Private deques can still be very effective for work-stealing, both in theory and practice! This paper comes to mind: https://hal.inria.fr/hal-00863028/document

New comment by shwestrick in "Memories: Edinburgh ML to Standard ML"

shwestrick — Wed, 05 Oct 2022 17:01:36 +0000

MLton also has perhaps the most impressive performance of any existing functional language implementation. It generates code that easily competes with hand-optimized low-level C/C++/whatever.

New comment by shwestrick in "Race Conditions Can Be Useful for Parallelism"

shwestrick — Sun, 02 Oct 2022 17:04:02 +0000

When visiting vertices in parallel, there might be multiple potential parents that all attempt to visit the same vertex simultaneously. So, we need a way of picking which parent "wins".