Hacker News: mikemike

New comment by mikemike in "How to make a fast dynamic language interpreter"

mikemike — Tue, 21 Apr 2026 15:50:00 +0000

To illustrate this, here's the contorted Lua code from https://news.ycombinator.com/item?id=11327201

    local t = setmetatable({}, {
      __index = pcall, __newindex = rawset,
      __call = function(t, i) t[i] = 42 end,
    })
    for i=1,100 do assert(t[i] == true and rawget(t, i) == 42) end

Arguably this exercises only the slow paths of the VM.

A more nuanced take is that Lua has many happy fast paths, whereas Python has some unfortunate semantic baggage that complicates those. Another key issue is the over-reliance on C modules with bindings that expose way too many internals.

New comment by mikemike in "LÖVE: 2D Game Framework for Lua"

mikemike — Tue, 07 Apr 2026 07:24:26 +0000

As the open source author in question, I'd politely ask everyone to not draw overly-generic conclusions from an ancient discussion in some third-party forum, which links to a (now) resolved bug report.

Open source is not a one-way street. By publicly disparaging open source projects, you're actually harming the ecosystem you rely on.

New comment by mikemike in "A Walk with LuaJIT"

mikemike — Wed, 13 Nov 2024 19:59:18 +0000

A good read if you want to learn (more than you ever wanted) about stack frame unwinding in conjunction with a JIT compiler.

The only correction I have: LuaJIT _does_ have 64 bit integers, e.g. 0x0123456789abcdefLL.

A Walk with LuaJIT

mikemike — Wed, 13 Nov 2024 19:54:27 +0000

Article URL: https://www.polarsignals.com/blog/posts/2024/11/13/lua-unwinding

Comments URL: https://news.ycombinator.com/item?id=42129277

Points: 20

# Comments: 2

New comment by mikemike in "Modernizing compiler design for Carbon's toolchain [video]"

mikemike — Sun, 27 Aug 2023 10:38:54 +0000

This is conjecture. OTOH I measured while I designed the LuaJIT IR.

1. An array index is just as suitable as a pointer for dereferencing. 2. What matters is how many dereferences are needed and their locality. 3. Data structure density is important to get high cache utilization.

References show a lot of locality: 40% of all IR operands reference the previous node. 70% reference the previous 10 nodes. A linear IR is the best cache-optimized data structure for this.

That said, dereferencing of an operand happens less often than one might think. Most of the time, one really needs the operand index itself, e.g. for hashes or comparisons. Again, indexes have many advantages over pointers here.

What paid off the most was to use a fixed size IR instruction format (only 64 bit!) with 2 operands and 16 bit indexes. The restriction to 2 operands is actually beneficial, since it helps with commoning and makes you think about IR design. The 16 bit index range is not a limitation in practice (split IR chunks, if you need to). The high orthogonality of the IR avoids many iterations and unpredictable branches in the compiler itself.

The 16 bit indexes also enable the use of tagged references in the compiler code (not in the IR). The tag caches node properties: type, flags, constness. This avoids even more dereferences. LuaJIT uses this in the front pipeline for fast type checks and on-the-fly folding.

New comment by mikemike in "The Solid-State Register Allocator"

mikemike — Wed, 05 Oct 2022 13:35:03 +0000

I had already changed the title after your reply. The objection is about the naming, which implies an invention claim without further explanation. It's not about the code.

New comment by mikemike in "The Solid-State Register Allocator"

mikemike — Wed, 05 Oct 2022 12:22:21 +0000

You may want to clarify that in the GitHub repo, too. See my issue there.

If you want to go the didactic route, then consider documenting the improvements over the naive implementation: register hinting, register priorities (PHI), two-headed register picking, fixed register picking, optimized register picking for 2-operand instructions (x86/x64), register pair picking, ABI calling-conventions, weak allocations, cost heuristics, eviction heuristics, lazy/eager spill/restore, rematerialization, register shuffling (PHI) with cycle breaking, register renaming, etc. That's all in ~2000 lines of lj_asm.c.

New comment by mikemike in "The Solid-State Register Allocator"

mikemike — Wed, 05 Oct 2022 09:59:37 +0000

Uh? This *is* the LuaJIT register allocator. Period.

Code published 2009. Description published here: https://lua-users.org/lists/lua-l/2009-11/msg00089.html (ignore the TLS cert error).

Coming up with a silly markting name, writing a naive implementation and then claiming it's their invention is impertinent. Especially since they mention LuaJIT itself in the text ...

New comment by mikemike in "An unexpected Redis sandbox escape affecting Debian-based distros"

mikemike — Wed, 09 Mar 2022 21:47:08 +0000

That's what I'm wondering, too, right now.

It's trivial to DoS-hang redis with the script feature (and SCRIPT KILL won't help).

And I found at least 3 DoS-crash, because it hasn't backported fixes to its copy of Lua 5.1.5 (but Debian's liblua 5.1 might -- I haven't checked).

And that's without even exploring the really problematic builtins it still has available.

Maybe they should instead clarify their security guarantee for redis scripting (e.g. "none").

New comment by mikemike in "An unexpected Redis sandbox escape affecting Debian-based distros"

mikemike — Wed, 09 Mar 2022 18:49:34 +0000

Yes, of course it's vulnerable, verified with Docker debian:sid. That was my first reaction when I read this, but I wanted to verify it first. You beat me with this post.

Since you've already let the cat out of the hat (which is not ideal), please file the bugs at Debian and Ubuntu.

Test command:

    redis-cli eval 'return select(2, loadstring("\027")):match("binary") and "VULNERABLE" or "OK"' 0

While we're at it, redis has ignored the advice at: http://lua-users.org/wiki/SandBoxes Almost all of the critical functions (loadstring, load, getmetatable, getfenv, ...) are present and unprotected in the redis 'SandBox' (which isn't).

Which means, disable scripting or shut down your redis instances NOW, which do not run with the same privileges as any client which has access to this. Scripting can be disabled by renaming the EVAL and EVALSHA commands to unguessable names.

New comment by mikemike in "Malloc broke Serenity's JPGLoader, or: how to win the lottery"

mikemike — Thu, 03 Jun 2021 11:22:08 +0000

One year ago I hardened LuaJIT's VM against these kind of attacks. Since then, there has been a constant influx of complaints and issues filed. All bitterly complaining their code, which mistakenly assumed a fixed hash table iteration order, is now broken.

Even when told that the Lua manual clearly states the undefined order since 20 years, they do not cease to complain. They do not realize this change helped them to discover a serious bug in their code (the order could differ even before that change). Sigh.

You can now have a guess, what one of the lesser enlightened forks of LuaJIT did ...

New comment by mikemike in "Compiler Optimizations are Awesome"

mikemike — Thu, 01 Jun 2017 13:26:42 +0000

Actually, LuaJIT 1.x is just that: a translator from a register-based bytecode to machine code using templates (small assembler snippets) with fixed register assignment. There's only a little bit more magic to that, like template variants depending on the inferred type etc.

You can compare the performance of LuaJIT 1.x and 2.0 yourself on the benchmark page (for x86). The LuaJIT 1.x JIT-compiled code is only slightly faster than the heavily tuned LuaJIT 2.x VM plus the 2.x interpreter written in assembly language by hand. Sometimes the 2.x interpreter even beats the 1.x compiler.

A lot of this is due to the better design of the 2.x VM (object layout, stack layout, calling conventions, builtins etc.). But from the perspective of the CPU, a heavily optimized interpreter does not look that different from simplistic, template-generated code. The interpreter dispatch overhead can be moved to independent dependency-chains by the CPU, if you're doing this right.

Of course, the LuaJIT 2.x JIT compiler handily beats both the 2.x interpreter and the 1.x compiler.

New comment by mikemike in "DynASM"

mikemike — Sat, 03 Dec 2016 21:13:44 +0000

Notable recent use of DynASM: Zend is using it to write a JIT compiler for PHP 8.0. http://externals.io/thread/268#email-12706-body

New comment by mikemike in "My love-hate relationship with LuaJIT (2015)"

mikemike — Mon, 26 Sep 2016 06:16:55 +0000

No. You DO need a good understanding of a computer language and of JIT compilers to understand the code base for any just-in-time compiler for that computer language.

LuaJIT is not a toy compiler from a textbook. There's a lot of inherent complexity in a production compiler that employs advanced optimizations and needs to work on various CPU architectures and operating systems. This reflects in the code.

New comment by mikemike in "My love-hate relationship with LuaJIT (2015)"

mikemike — Mon, 26 Sep 2016 05:47:46 +0000

This is a wrong perception. There is/was no shortage of sponsorships. I had to turn down most of these offers, due to time constraints.

New comment by mikemike in "LuaJIT 2.0 intellectual property disclosure (2009)"

mikemike — Mon, 21 Mar 2016 10:44:18 +0000

Just in case, anyone has somehow gotten to the conclusion that Lua's semantics are 'simple', they should closely inspect this example and try to figure out through which contortions the VM has to go to make this work:

    local t = setmetatable({}, {
      __index = pcall, __newindex = rawset,
      __call = function(t, i) t[i] = 42 end,
    })
    for i=1,100 do assert(t[i] == true and rawget(t, i) == 42) end

[LuaJIT has no problems with this code and turns it into 8 machine code instructions for the actual loop.]

Anyway ...

This permanent excuse of JavaScript proponents that it has more complex semantics, which somehow prevents it from being made fast, is getting old. There are no insurmountable obstacles to make JavaScript fast -- it just takes more effort!

And they dug this hole themselves, by not cleaning up the language and allowing new complicated features into the language. Well ...

New comment by mikemike in "Tracing JITs and modern CPUs part 3: A bad case"

mikemike — Mon, 10 Aug 2015 07:37:27 +0000

By definition, a trace doesn't have internal branches.

The solution is to use Hyperblock Scheduling. This is an extra pass that merges multiple traces, e.g. the described root trace and its side trace. The result is a single trace with a predicated IR. This is amenable to most linear optimizations, with only minor limitations.

A predicated IR is the ideal representation to apply branch-free optimizations, using bit operations or SIMD tricks. If there are any predicates left in the IR, the compiler backend will either turn it into predicated machine code (on CPUs which support that to some extent, e.g. ARM32) or generate machine code with internal branches.

New comment by mikemike in "The death of optimizing compilers [pdf]"

mikemike — Sat, 18 Apr 2015 12:31:33 +0000

Thank you for taking the time to perform these tests!

One thing that people advocating FDO often forget: this is statically tuning the code for a specific use case. Which is not what you want for an interpreter that has many, many code paths and is supposed to run a wide variety of code.

You won't get a 30% FDO speedup in any practical scenario. It does little for most other benchmarks and it'll pessimize quite a few of them, for sure.

Ok, so feed it with a huge mix of benchmarks that simulate typical usage. But then the profile gets flatter and FDO becomes much less effective.

Anyway, my point still stands: a factor of 1.1x - 1.3x is doable. Fine. But we're talking about a 3x speedup for my hand-written machine vs. what the C compiler produces. And that's only a comparatively tiny speedup you get from applying domain-specific knowledge. Just ask the people writing video codecs about their opinion on C vector intrinsics sometime.

I write machine code, so you don't have to. The fact that I have to do it at all is disappointing. Especially from my perspective as a compiler writer.

But DJB is of course right: the key problem is not the compiler. We don't have a source language that's at the right level to express our domain-specific knowledge while leaving the implementation details to the compiler (or the hardware).

And I'd like to add: we probably don't have the CPU architectures that would fit that hypothetical language.

See my ramblings about preserving programmer intent, I made in the past: http://www.freelists.org/post/luajit/Ramblings-on-languages-...

New comment by mikemike in "How does LuaJIT's trace compiler work?"

mikemike — Tue, 03 Dec 2013 21:43:04 +0000

Oh, great! Thank you very much!

New comment by mikemike in "How does LuaJIT's trace compiler work?"

mikemike — Fri, 29 Nov 2013 21:00:13 +0000

Oh, well ... pasting my standard rant on this:

This is a common misinterpretation of the Dynamo paper: they compiled their C code at the lowest optimization level and then ran the (suboptimal) machine code through Dynamo. So there was actually something left to optimize.

Think about it this way: a 20% difference isn't unrealistic if you compare -O1 vs. -O3.

But it's completely unrealistic to expect a 20% improvement if you'd try this with the machine code generated by a modern C compiler at the highest optimization level.

Claiming that JIT compilers outperform static compilers, solely based on this paper, is an untenable position.

But, yes, JIT compilers can outperform static compilers under specific circumstances. This has more to do with e.g. better profiling feedback or extra specialization opportunities. But this is not what this paper demonstrates.

Many compiler optimizations have strong non-linear costs in terms of the number of control flow edges. A static compiler has to punt at a certain complexity. OTOH a JIT compiler is free to ignore many edges, since it may fall back to an interpreter for cold paths or attach new code anytime later if some edges become hot.

One interesting example is auto-vectorization (SIMDization) where static compilers have to generate code for all possible combinations of vector alignments in case the underlying alignment of the participating vectors is not statically known. This quickly gets very expensive in terms of code space. OTOH a JIT compiler can simply specialize to the observed vector alignment(s) at runtime, which show almost no variation in practice.