Hacker News: stephencanon

New comment by stephencanon in "The Oxford Comma – Why and Why Not (2024)"

stephencanon — Thu, 26 Mar 2026 19:37:46 +0000

"I'd like to thank my mother, Ayn Rand, and God" is the usual example.

Yes, you can reorder the list to remove the ambiguity, but sometimes the order of the list matters. The serial comma should be used when necessary to remove ambiguity, and not used when it introduces ambiguity. Rewrite the sentence when necessary. Worth noting that this is the Oxford University Press's own style rule!

New comment by stephencanon in "How many branches can your CPU predict?"

stephencanon — Thu, 19 Mar 2026 21:22:58 +0000

That would fall under "more constrained", due to process limits.

New comment by stephencanon in "How many branches can your CPU predict?"

stephencanon — Thu, 19 Mar 2026 14:10:40 +0000

Enlarging a branch predictor requires area and timing tradeoffs. CPU designers have to balance branch predictor improvements against other improvements they could make with the same area and timing resources. What this tells you is that either Intel is more constrained for one reason or another, or Intel's designers think that they net larger wins by deploying those resources elsewhere in the CPU (which might be because they have identified larger opportunities for improvement, or because they are basing their decision making on a different sample of software, or both).

New comment by stephencanon in "What every computer scientist should know about floating-point arithmetic (1991) [pdf]"

stephencanon — Mon, 16 Mar 2026 15:49:46 +0000

Your students should be able to figure out if a computation is exact or not, because they should understand binary representation of numbers.

New comment by stephencanon in "Faster asin() was hiding in plain sight"

stephencanon — Thu, 12 Mar 2026 00:01:27 +0000

You can often eke something out for order-four, depending on uArch details. But basically yeah.

New comment by stephencanon in "Faster asin() was hiding in plain sight"

stephencanon — Wed, 11 Mar 2026 19:52:42 +0000

For throughput-dominated contexts, evaluation via Horner's rule does very well because it minimizes register pressure and the number of operations required. But the latency can be relatively high, as you note.

There are a few good general options to extract more ILP for latency-dominated contexts, though all of them trade additional register pressure and usually some additional operation count; Estrin's scheme is the most commonly used. Factoring medium-order polynomials into quadratics is sometimes a good option (not all such factorizations are well behaved wrt numerical stability, but it also can give you the ability to synthesize selected extra-precise coefficients naturally without doing head-tail arithmetic). Quadratic factorizations are a favorite of mine because (when they work) they yield good performance in _both_ latency- and throughput-dominated contexts, which makes it easier to deliver identical results for scalar and vectorized functions.

There's no general form "best" option for optimizing latency; when I wrote math library functions day-to-day we just built a table of the optimal evaluation sequence for each order of polynomial up to 8 or so and each microarchitecture and grabbed the one we needed unless there were special constraints that required a different choice.

New comment by stephencanon in "Faster asin() was hiding in plain sight"

stephencanon — Wed, 11 Mar 2026 18:02:09 +0000

When Intel specced the rsqrt[ps]s and rcp[ps]s instructions ~30 years ago, they didn't fully specify their behavior. They just said their relative error is "smaller than 1.5 * 2⁻¹²," which someone thought was very clever because it gave them leeway to use tables or piecewise linear approximations or digit-by-digit computation or whatever was best suited to future processors. Since these are not IEEE 754 correctly-rounded operations, and there was (by definition) no software that currently used them, this was "fine".

And mostly it has been OK, except for some cases like games or simulations that want to get bitwise identical results across HW, which (if they're lucky) just don't use these operations or (if they're unlucky) use them and have to handle mismatches somehow. Compilers never generate these operations implicitly unless you're compiling with some sort of fast-math flag, so you mostly only get to them by explicitly using an intrinsic, and in theory you know what you're signing up for if you do that.

However, this did make them unusable for some scenarios where you would otherwise like to use them, so a bunch of graphics and scientific computing and math library developers said "please fully specify these operations next time" and now NEON/SVE and AVX512 have fully-specified reciprocal estimates,¹ which solves the problem unless you have to interoperate between x86 and ARM.

¹ e.g. Intel "specifies" theirs here: https://www.intel.com/content/www/us/en/developer/articles/c...

ARM's is a little more readable: https://developer.arm.com/documentation/ddi0596/2021-03/Shar...

New comment by stephencanon in "Faster asin() was hiding in plain sight"

stephencanon — Wed, 11 Mar 2026 16:19:02 +0000

For the asinf libcall on macOS/x86, my former colleague Eric Postpischil invented the novel (at least at the time, I believe) technique of using a Remez-optimized refinement polynomial following rsqrtss instead of the standard Newton-Raphson iteration coefficients, which allowed him to squeeze out just enough extra precision to make the function achieve sub-ulp accuracy. One of my favorite tricks.

We didn't carry that algorithm forward to arm64, sadly, because Apple's architects made fsqrt fast enough that it wasn't worth it in scalar contexts.

New comment by stephencanon in "Faster asin() was hiding in plain sight"

stephencanon — Wed, 11 Mar 2026 16:12:33 +0000

Ideally either one is just a library call to generate the coefficients. Remez can get into trouble near the endpoints of the interval for asin and require a little bit of manual intervention, however.

New comment by stephencanon in "Faster asin() was hiding in plain sight"

stephencanon — Wed, 11 Mar 2026 16:07:11 +0000

Newer rsqrt approximations (ARM NEON and SVE, and the AVX512F approximations on x86) make the behavior architectural so this is somewhat less of a problem (it still varies between _architectures_, however).

New comment by stephencanon in "Faster asin() was hiding in plain sight"

stephencanon — Wed, 11 Mar 2026 15:38:43 +0000

These sorts of approximations (and more sophisticated methods) are fairly widely used in systems programming, as seen by the fact that Apple's asin is only a couple percent slower and sub-ulp accurate (https://members.loria.fr/PZimmermann/papers/accuracy.pdf). I would expect to get similar performance on non-Apple x86 using Intel's math library, which does not seem to have been measured, and significantly better performance while preserving accuracy using a vectorized library call.

The approximation reported here is slightly faster but only accurate to about 2.7e11 ulp. That's totally appropriate for the graphics use in question, but no one would ever use it for a system library; less than half the bits are good.

Also worth noting that it's possible to go faster without further loss of accuracy--the approximation uses a correctly rounded square root, which is much more accurate than the rest of the approximation deserves. An approximate square root will deliver the same overall accuracy and much better vectorized performance.

New comment by stephencanon in "Yann LeCun's AI startup raises $1B in Europe's largest ever seed round"

stephencanon — Tue, 10 Mar 2026 13:08:36 +0000

Yann is definitely more well-known outside of academia. Inside academia, it's going to depend a lot on your specific background and how old you are.

New comment by stephencanon in "BMW Group to deploy humanoid robots in production in Germany for the first time"

stephencanon — Thu, 05 Mar 2026 13:37:03 +0000

Mechanically, you're probably right, but the screen-centric controls of the newer generation are _awful_ by comparison to the F generation's physical buttons and dials (this isn't BMW though, it's the whole industry).

My wife and I both have F31s, which we will drive until we can no longer source replacement parts unless the industry comes to its senses first (unlikely). Any time we've ever looked at plausible replacements, the screen-based controls are an immediate hard no.

New comment by stephencanon in "BMW Group to deploy humanoid robots in production in Germany for the first time"

stephencanon — Thu, 05 Mar 2026 13:32:02 +0000

Our (~2015) 3-series controls are just about perfect. Where they differ from Honda/Toyota's controls that I am also very familiar with, they're noticeably better now that I'm familiar with them. Everything is really well thought-out.

Of course, now they (and almost every other manufacturer) have followed Tesla off the cliff and made everything a screen, so the current generation cars have abysmal controls.

New comment by stephencanon in "Schubfach: The smallest floating point double-to-string impleme"

stephencanon — Thu, 04 Dec 2025 13:33:39 +0000

Schubfach's table is quite large compared to some alternatives with similar performance characteristics. swiftDtoa's code and tables combined are smaller than just Schubfach's table in the linked implementation. Ryu and Dragonbox are larger than swiftDtoa, but also use smaller tables than Schubfach, IIRC.

If I$ is all you care about, then table size may not matter, but for constrained systems, other algorithms in general, and swiftDtoa in particular, may be better choices.

New comment by stephencanon in "DEC64: Decimal Floating Point (2020)"

stephencanon — Mon, 10 Nov 2025 11:52:04 +0000

IEEE 754 is a floating point standard. It has a few warts that would be nice to fix if we had tabula rasa, but on the whole is one of the most successful standards anywhere. It defines a set of binary and decimal types and operations that make defensible engineering tradeoffs and are used across all sorts of software and hardware with great effect. In the places where better choices might be made knowing what we know today, there are historical reasons why different choices were made in the past.

DEC64 is just some bullshit one dude made up, and has nothing to do with “floating-point standards.”

New comment by stephencanon in "Implicit ODE solvers are not universally more robust than explicit ODE solvers"

stephencanon — Tue, 16 Sep 2025 15:58:06 +0000

The orbital example where BDF loses momentum is really about the difference between a second-order method (BDF2) and a fourth-order method (RK), rather than explicit vs implicit (but: no method with order > 2 can be A-stable; since the whole point of implict methods is to achieve stability, the higher order BDF formulas are relatively niche).

There are whole families of _symplectic_ integrators that conserve physical quantities and are much more suitable for this sort of problem than either option discussed. Even a low-order symplectic method will conserve momentum on an example like this.

New comment by stephencanon in "Pontevedra, Spain declares its entire urban area a "reduced traffic zone""

stephencanon — Wed, 10 Sep 2025 13:39:39 +0000

> Where you are not under any circumstances can be robbed by a random person on a street.

I will be very surprised if there's anywhere in the world where the expected loss from being robbed on the street while walking exceeds the expected loss from being in a car accident while driving.

Getting in a car is by far the most dangerous thing most people do routinely.

New comment by stephencanon in "Who Invented Backpropagation?"

stephencanon — Mon, 18 Aug 2025 17:05:19 +0000

I don't think most people think to do either direction by hand; it's all just matrix multiplication, you can multiply them in whatever order makes it easier.

New comment by stephencanon in "White Mountain Direttissima"

stephencanon — Tue, 12 Aug 2025 00:07:15 +0000

All 48 peaks on the AMC white mountains 4000-footers¹ list in one continuous trek (no driving/shuttling/etc between trailheads).

¹ this list is outdated vis-a-vis modern mapping and includes at least one peak shorter than 4000 feet (Tecumseh) and omits at least one peak that should qualify per the rules (Guyot), but if the list were updated they would still have completed the direttissima, since they passed over Guyot on the way to the Bonds (dropping Tecumseh could only make the diretissima easier, but I'm not sure it makes much of a difference; it's been a decade or so since I hiked that section of the whites).

As an aside, that day 5 from Wildcat to Cabot is absolutely brutal even if you're fresh, to say nothing of having already covered 180 miles in the previous four days.