Hacker News: aengelke

New comment by aengelke in "Better JIT for Postgres"

aengelke — Sun, 08 Mar 2026 19:29:44 +0000

I'm a bit late, but: Umbra doesn't use AsmJIT anymore since many years, it was too slow.

New comment by aengelke in "Better JIT for Postgres"

aengelke — Wed, 04 Mar 2026 13:08:39 +0000

That's not generally correct. Compile-time is a concern for several databases.

New comment by aengelke in "Better JIT for Postgres"

aengelke — Wed, 04 Mar 2026 13:05:55 +0000

> It's very difficult to do low-latency queries if you cannot cache the compiled code

This is not too difficult, it just requires a different execution style. Salesforce's Hyper for example very heavily relies on JIT compilation, as does Umbra [1], which some people regard as one of the fastest databases right now. Umbra doesn't cache any IR or compiled code and still has an extremely low start-up latency; an interpreter exists but is practically never used.

Postgres is very robust and very powerful, but simply not designed for fast execution of queries.

Disclosure: I work in the group that develops Umbra.

[1]: https://umbra-db.com/

New comment by aengelke in "Apple has locked my Apple ID, and I have no recourse. A plea for help"

aengelke — Sat, 13 Dec 2025 07:21:32 +0000

> What's the rationale?

Gift cards are used by phishers. In our institution, we routinely get personalized spam mails (in the name of the corresponding group lead of the recipient, sent via GMail -- this is not low-effort) that ask whether they are available and, when (accidentally) responding, ask for Apple gift cards.

New comment by aengelke in "Addressing the adding situation"

aengelke — Tue, 02 Dec 2025 16:26:22 +0000

I fully agree, but:

> these are the string instructions like REP MOVSB

AArch64 nowadays has somewhat similar CPY* and SET* instructions. Does that make AArch64 CISC? :-) (Maybe REP SCASB/CMPSB/LODSB (the latter being particularly useless) is a better example.)

New comment by aengelke in "Addressing the adding situation"

aengelke — Tue, 02 Dec 2025 16:20:31 +0000

> LEA happens to be the unique instruction where the memory operand is not dereferenced

Not quite unique: the now-deprecated Intel MPX instructions had similar semantics, e.g. BNDCU or BNDMK. BNDLDX/BNDSTX are even weirder as they don't compute the address as specified but treat the index part of the memory operand separately.

New comment by aengelke in "FEX-emu – Run x86 applications on ARM64 Linux devices"

aengelke — Fri, 21 Nov 2025 12:41:56 +0000

Been there, done that during my PhD (code: [1]). Works reasonably well, except for compile times (for which I implemented a caching strategy). However, due to calling conventions, using LLVM isn't going to give the best possible performance. Some features like signal handling are extremely hard to implement with LLVM (I didn't, therefore). Although the overall performance results have been good, it's not an approach that I could strongly recommend.

[1]: https://github.com/aengelke/instrew

New comment by aengelke in "Encoding x86 Instructions"

aengelke — Thu, 30 Oct 2025 15:11:39 +0000

Actually, nowadays Arm describes the ISA as a load-store architecture. The RISC vs. CISC debate is, in my opinion, pretty pointless nowadays and I'd prefer if we'd just stop using these words to describe ISAs.

New comment by aengelke in "Encoding x86 Instructions"

aengelke — Wed, 29 Oct 2025 20:53:14 +0000

The same site hosts [1], but that's not nearly as nice as the 32-bit version. It's also a bit outdated.

[1]: https://www-user.tu-chemnitz.de/~heha/hs/chm/x86.chm/x64.htm

New comment by aengelke in "Encoding x86 Instructions"

aengelke — Wed, 29 Oct 2025 19:10:54 +0000

> I’d suggest starting with arm

I agree: AArch64 is a nice instruction set to learn. (Source: I taught ARMv7, AArch64, x86-64 to first-year students in the past.)

> how simple instruction encoding is on arm64

Having written encoders, decoders, and compilers for AArch64 and x86-64, I disagree. While AArch64 is, in my opinion, very well designed (also better than RISC-V), it's certainly not simple. Here's some of my favorite complexities:

- Many instructions have (sometimes very) different encodings. While x86 has a more complex encoding structure, most encodings follow the same structure and are therefore remarkably similar.

- Huge amount of instruction operand types: memory + register, memory + unsigned scaled offset, memory + signed offset, optionally with pre/post-increment, but every instruction supports a different subset; vector, vector element, vector table, vector table element; sometimes general-purpose register encodes a stack pointer, sometimes a zero register; various immediate encodings; ...

- Logical immediate encoding. Clever, but also very complex. (To be sure that I implemented the decoding correctly, I brute-force test all inputs...)

- Register constraints: MUL (by element) with 16-bit integers has a register constraint on the lowest 16 registers. CASP requires an even-numbered register. LD64B requires an even-numbered register less than 24 (it writes Xt..Xt+7).

- Much more instructions: AArch64 SIMD (even excluding SVE) has more instructions than x86 including up to AVX-512. SVE/SME takes this to another level.

New comment by aengelke in "Using the TPDE codegen back end in LLVM ORC"

aengelke — Tue, 30 Sep 2025 16:58:04 +0000

TPDE co-author here. Nice work, this was easier than expected; so we'll have better upstream ORC support soon [1].

The benchmark is suboptimal in multiple ways:

- Multi-threading makes things just slower. When enabling multi-threading, LLJIT clones every module into a new context before compilation, which is much more expensive than compilation. There's also no way to disable this. This causes a ~1.5x (LLVM)/~6.5x (TPDE) slowdown (very rough measurement on my laptop).

- The benchmark compares against the optimizing LLVM back-end, not the unoptimizing back-end (which would be a fairer comparison) (Code: JTMB.setCodeGenOptLevel(CodeGenOptLevel::None);). Additionally, enabling FastISel helps (command line -fast-isel; setting the TargetOption EnableFastISel seems to have no effect). This gives LLVM a 1.6x speedup.

- The benchmark is not really representative, as it causes FastISel fallbacks to SelectionDAG in some very large basic blocks -- i24 occurs rather rarely in real-world code. This is the reason why the speedup from the unoptimizing LLVM back-end is so low. Replacing i24 with i16 gives LLVM another 2.2x speedup. (Hint: to get information on FastISel fallbacks, enable FastISel and pass the command line options "-fast-isel-report-on-fallback -pass-remarks-missed=sdagisel" to LLVM. This is really valuable when optimizing for compile times.)

So we get ~140ms (TPDE), ~730ms (LLVM -O0), or 5.2x improvement. This is nowhere near the 10-20x speedup that TPDE typically achieves. Why? The new bottleneck is JITLink, which is featureful but slow -- profiling indicates that it consumes ~55% of the TPDE "compile time" (so the net compile time speedup is ~10x). TPDE therefore ships its own JIT mapper, which has fewer features but is much faster.

LLVM is really powerful, and despite being not particularly fast, the JIT API makes it extremely difficult to make it not extra-slow, even for LLVM experts.

[1]: https://github.com/tpde2/tpde/commit/29bcf1841c572fcdc75dd61...

New comment by aengelke in "IRHash: Efficient Multi-Language Compiler Caching by IR-Level Hashing"

aengelke — Sun, 07 Sep 2025 20:07:06 +0000

> but typically a change to the preprocessed output implies a change to the IR (e.g., it's a functional change and not just a variable name change or something). Otherwise, why would I recompile it?

For C++, this could happen more often, e.g. when changing the implementation of an inline function or a non-instantiated template in a header that is not used in the compilation unit.

New comment by aengelke in "IRHash: Efficient Multi-Language Compiler Caching by IR-Level Hashing"

aengelke — Sun, 07 Sep 2025 17:26:45 +0000

Or rather: There are 2 hard problems in computer science: cache invalidation, naming things, and off-by-1 errors.

(source: https://martinfowler.com/bliki/TwoHardThings.html)

New comment by aengelke in "IRHash: Efficient Multi-Language Compiler Caching by IR-Level Hashing"

aengelke — Sun, 07 Sep 2025 17:25:30 +0000

Template instantiation caching is likely to help -- in an unoptimized LLVM build, I found that 40-50% of the compiled code at object file level is discarded at link-time as redundant.

Another thing I'd consider as interesting is parse caching from token to AST. Most headers don't change, so even when a TU needs to be recompiled, most parts of the AST could be reused. (Some kind of more clever and transparent precompiled headers.) This is likely to need some changes in the AST data structures for fast serialization and loading/inserting. And that makes me think that maybe the text book approach of generating an AST is a bad idea if we care about fast compilation.

Tangentially, I'm astonished that they claim correctness while a large amount of IR is inadequately (if at all) captured in the hash (comdat, symbol visibility, aliases, constant exprs, block address, calling convention/attributes for indirect calls, phi nodes, fast math flags, GEP type, ....). I'm also a bit annoyed, because this is the type of research that is very sloppily implemented, only evaluates projects where compile time is not a big problem and then only achieves small absolute savings, and papers over inherent difficulties (here: capturing the IR, parse time) that makes this unlikely to be used in practice.

New comment by aengelke in "TPDE-LLVM: Faster LLVM -O0 Back-End"

aengelke — Wed, 03 Sep 2025 20:05:51 +0000

In AoT compilation, unoptimized code is primarily useful for debugging and short compile-test round trips. Your point on C++ is correct, but test workloads are typically small so the cost is often tolerable and TPDE also supports -O1 IR -- nothing precludes using an -O0 back-end with optimized IR, so if performance is relevant for debugging/testing, there's still a measurable compile-time improvement. (Obviously, with -O1 IR, the TPDE-generated code is ~2-3x slower than the code from the LLVM-O1-back-end; but it's still better than using unoptimized IR. It might also be possible to cut down the -O1 pass pipeline to passes that are actually important for performance.)

In JIT compilation, a fast baseline is always useful. LLVM is obviously not a great fit (the IR is slow to generate and inspect), but for projects that don't want to roll their own IR and use LLVM for optimized builds anyway, this is an easy way to drastically reduce the startup latency. (There is a JIT case study showing the overhead of LLVM-IR in Section 7/Fig. 10 in the paper.)

> And if a project is not large one then build times should not be that much of a problem.

I disagree -- I'm always annoyed when my builds take longer than a few seconds, and typically my code changes only involve fewer compilation units than I have CPU cores (even when working on LLVM). There's also this study [1] from Google, which claims that even modest improvements in build times improve productivity.

[1]: https://www.computer.org/csdl/magazine/so/2023/04/10176199/1...

New comment by aengelke in "TPDE-LLVM: Faster LLVM -O0 Back-End"

aengelke — Wed, 03 Sep 2025 14:37:28 +0000

The paper is rather selective about the used benchmarks and baselines. They do two comparisons (3 microbenchmarks and a re-implementation of a few (rather simple) database queries) against LLVM -- and have written all benchmarks themselves through their own framework. These benchmarks start from their custom AST data structures and they have their own way of generating LLVM-IR. For the non-optimizing LLVM back-end, the performance obviously strongly depends on the way the IR is generated -- they might not have put a lot of effort into generating "good IR" (=IR similar to what Clang generates).

The fact that they don't do a comparison against LLVM on larger benchmarks/functions or any other code they haven't written themselves makes that single number rather questionable for a general claim of being faster than LLVM -O0.

New comment by aengelke in "TPDE-LLVM: Faster LLVM -O0 Back-End"

aengelke — Wed, 03 Sep 2025 12:34:12 +0000

There's a longer paragraph on that topic in Section 8. We also previously built an LLVM back-end using that approach [1]. While that approach leads to even faster compilation, run-time performance is much worse (2.5x slower than LLVM -O0) due to more-or-less impossible register allocation for the snippets.

[1]: https://home.cit.tum.de/~engelke/pubs/2403-cc.pdf

New comment by aengelke in "TPDE-LLVM: Faster LLVM -O0 Back-End"

aengelke — Wed, 03 Sep 2025 09:05:03 +0000

In terms of runtime performance, the TPDE-generated code is comparable with and sometimes a bit faster than LLVM -O0.

I agree that front-ends are a big performance problem and both rustc and Clang (especially in C++ mode) are quite slow. For Clang with LLVM -O0, 50-80% is front-end time, with TPDE it's >98%. More work on front-end performance is definitely needed; maybe some things can be learned from Carbon. With mold or lld, I don't think linking is that much of a problem.

We now support most LLVM-IR constructs that are frequently generated by rustc (most notably, vectors). I just didn't get around to actually integrate it into rustc and get performance data.

> The 10-20x improvement described here doesn’t work yet for clang

Not sure what you mean here, TPDE can compile C/C++ programs with Clang-generated LLVM-IR (95% of llvm-test-suite SingleSource/MultiSource, large parts of the LLVM monorepo).

New comment by aengelke in "TPDE-LLVM: Faster LLVM -O0 Back-End"

aengelke — Mon, 01 Sep 2025 19:11:56 +0000

The documentation has a list of currently unsupported features: https://docs.tpde.org/tpde-llvm-main.html

New comment by aengelke in "Constrained languages are easier to optimize"

aengelke — Sun, 27 Jul 2025 16:50:36 +0000

Storing the string length explicitly as an 8-byte integer does have a measurable cost. Consider llvm::Twine as an example, it supports storing a null-terminated string and a ptr+len string (among other options). I once changed the implementation to store string literals (length known at compile-time) as ptr+len instead of a pointer to a C string, with the intention of avoiding the strlen in the callee on constant strings. However, this change made things slower, because of the cost of storing the length everywhere. (That's why I never proposed such a change upstream.)

The critical (data) path of the null-terminated loop, however, does not include the load -- the actually loaded value is not a loop-carried dependency in your example. The re-steering of the branch at the end of the loop might happen much later, however.

Vectorization with null-terminated strings is possible and done, but requires alignment checking, which adds some cost.