Hacker News: diamondlovesyou

New comment by diamondlovesyou in "Deterministic Fully-Static Whole-Binary Translation Without Heuristics"

diamondlovesyou — Wed, 13 May 2026 05:54:06 +0000

That won't be located on the stack either. The underlying buffer will be a TU local - ie static and not rx

New comment by diamondlovesyou in "The acyclic e-graph: Cranelift's mid-end optimizer"

diamondlovesyou — Tue, 14 Apr 2026 20:44:10 +0000

> This post makes it seem like the pass ordering problem is bigger than it really is and then overestimates the extent to which egraphs solve it.

It isn't so much for SoTA implementations like LLVM, but it is for HL IRs like those present in MLIR. For LLVM, you're basically always in the same representation and every pass operates in that shared representation. But even then, this is not quite true. For example, SLP in LLVM is one of the last passes because running SLP before most "latency sensitive cleanups" would break most of them.

In particular, HL to LL lowering pipelines suffer very heavily from the ordering concerns.

New comment by diamondlovesyou in "Why are credit card rates so high?"

diamondlovesyou — Tue, 01 Apr 2025 23:11:35 +0000

I don't use credit cards for the credit; in fact mine are completely paid for every statement. They are used for the customer protections and other provided "free" benefits. If some scummy or outright scam-y thing is charged to my Amex, I know I will have Amex on my side. If my card is stolen, Amex will refund any fraudulent charges and overnight me a new card; I probably won't get my debit card overnighted, though they will probably refund the fraud. The other thing is credit card points, which are essentially a benefit paid for by credit card processing fees charged to businesses. Many cards also offer access to "private" airport lounges. And other benefits I'm forgetting off the top of my head.

Additionally, having high credit limits, low usage, and older accounts improves credit scores for loans/etc.

No interest is charged if there is no balance carried statement-to-statement, so why bother with silly debit pins and such.

That's how it becomes the default way of payment; it's not really "credit".

New comment by diamondlovesyou in "Xfinity XB3 hardware mod: Disable WiFi and save 2 watts"

diamondlovesyou — Mon, 31 Mar 2025 03:37:33 +0000

Less area means less sources of interference for others (this property is also true in the other direction). So the attenuation reduces the signal area, and stronger attenuation lets the transmitter be "strong" in the house without the downsides in congested areas.

New comment by diamondlovesyou in "Constant-Time Code: The Pessimist Case [pdf]"

diamondlovesyou — Thu, 13 Mar 2025 02:33:41 +0000

> Why is cooperation unlikely? AFAIK it’s not too hard to make a compiler support a function attribute that says “do not optimize this function at all”

Compilers like Clang actually generate terrible code; it's expected that a sufficiently smart optimizer (of which LLVM is a member) will clean it up anyway, so Clang makes no attempt to generate good code. Rust is similar. For example, a simple for-loop's induction variable is stored/loaded to an alloca (ie stack) on every use, it isn't an SSA variable. So one of the first things in the optimization pipeline is to promote those to SSA registers/variables. Disabling that would cost a ton of perf just right there, nevermind the impact on instruction combining/value tracking/scalar evolution, and crypto is pretty perf sensitive after security.

BTW, Clang/LLVM already has such a function-level attribute, `optnone`, which was actually added to support LTO. But it's all or nothing; LLVM IR/Clang doesn't have the info needed to know what instructions are timing sensitive.

New comment by diamondlovesyou in "High-speed 10Gbps full-mesh network based on USB4 for just $47.98"

diamondlovesyou — Mon, 15 Jan 2024 18:47:48 +0000

GB6 will use the Zen4's AVX512, which Zen2 doesn't support.

New comment by diamondlovesyou in "Rust std fs slower than Python? No, it's hardware"

diamondlovesyou — Wed, 29 Nov 2023 18:37:48 +0000

Fast is relative here. These are microcoded instructions, which are generally terrible for latency: microcoded instructions don't get branch prediction benefits, nor OoO benefits (they lock the FE/scheduler while running). Small memcpy/moves are always latency bound, hence even if the HW supports "fast" rep store, you're better off not using them. L2 is wicked fast, and these copies are linear, so prediction will be good.

Note that for rep store to be better it must overcome the cost of the initial latency and then catch up to the 32byte vector copies, which yes generally have not-as-good-perf vs DRAM speed, but they aren't that bad either. Thus for small copies.... just don't use string store.

All this is not even considering non-temporal loads/stores; many larger copies would see better perf by not trashing the L2 cache, since the destination or source is often not inspected right after. String stores don't have a non-temporal option, so this has to be done with vectors.

New comment by diamondlovesyou in "Rust std fs slower than Python? No, it's hardware"

diamondlovesyou — Wed, 29 Nov 2023 16:56:52 +0000

AMD's string store is not like Intel's. Generally, you don't want to use it until you are past the CPU's L2 size (L3 is a victim cache), making ~2k WAY too small. Once past that point, it's profitable to use string store, and should run at "DRAM speed". But it has a high startup cost, hence 256bit vector loads/stores should be used until that threshold is met.

New comment by diamondlovesyou in "Steam Deck OLED"

diamondlovesyou — Thu, 09 Nov 2023 19:33:25 +0000

I have been very happy with my Minisforum Venus UM790, though I use it as a mobile computer since I can just throw it into my backpack. It's been great to have access to AVX512 on the go.

New comment by diamondlovesyou in "Speed Up C++ Compilation"

diamondlovesyou — Sun, 06 Aug 2023 01:24:06 +0000

> It is not a language flaw. C++ requires types to be complete when defining them because it needs to have access to their internal structure and layout to be in a position to apply all the optimizations that C++ is renowned for. Knowing this, at most it's a design tradeoff, and one where C++ came out winning.

This statement is incorrect. "Definition resolution" (my made up term for FE Stuff(TM) (not what I work on)) happens during the frontend compilation phase. Optimization is a backend phase, and we don't use source level info on type layout there. The FE does all that layout work and gives the BE an IR which uses explicit offsets.

C++ doesn't allow two phase lookup (at least originally); that's why definitions must precede uses.

New comment by diamondlovesyou in "Speed Up C++ Compilation"

diamondlovesyou — Sun, 06 Aug 2023 01:14:20 +0000

The power of the optimizations available to C++ are what make it so fast (see how slow debug mode is vs -O2/etc), and what allow C++ to be fast in the face of common/easy-to-understand, but technically perf-hostile, patterns. Bit counting loops vs popcnt, auto-vectorization, DCE, RCE, CSE, CFG simplification, LTCG/LTO, and so on. These things let you write "high level" (to a point - there are some ways to do "high level" paradigms and absolutely eviscerate the compilers ability to optimize) code/algos and still get great hardware level performance. This is so much more important overall than the time it takes to compile your program, and even more so once you consider that often such programs are shipped once and then enter maintenance mode.

It doesn't really have anything to do with compatibility (not entirely, but the things that are the biggest issue to good optimization quality and are fixable are things that need a system-level rethinking on how hardware exceptions happen). It just isn't reasonable to expect developers to know how to optimize, and it doesn't scale.

New comment by diamondlovesyou in "The World Might Be Better Off Without College for Everyone (2018)"

diamondlovesyou — Fri, 07 Jul 2023 04:18:54 +0000

https://archive.li/ZN5MJ

New comment by diamondlovesyou in "The RISC Wars Part 1: The Cambrian Explosion"

diamondlovesyou — Mon, 01 May 2023 03:50:13 +0000

CISC vs RISC doesn't matter. An ISA should ideally be a healthy mixture of both (citation needed). Arm64 allows memory operands, "just" like x86; but it still has code size issues. Memory operands (ie having a bit of address calculation in the load that's fused into its use) are very useful for reducing register pressure, which is an issue that every call ABI must contend with. This is something that the RISC ISA totally misses (and ARM64.. isn't really RISC).

The issue with this "debate" is that it misses the forest for trees. Instead we should be talking about binary encoding (ie how much "variability" is required), and you're right on that bit; memory isn't the issue it once was.

New comment by diamondlovesyou in "JDK 20 G1/Parallel/Serial GC Changes"

diamondlovesyou — Sat, 18 Mar 2023 05:37:45 +0000

Sadly, nobody can run from memory management.

New comment by diamondlovesyou in "Project Orion"

diamondlovesyou — Fri, 17 Mar 2023 03:56:02 +0000

> Like almost any thorny military problem of the 1950s, the solution was the application of nuclear bombs.

Magnificent.

New comment by diamondlovesyou in "Trimming spaces from strings faster with SVE on an Amazon Graviton 3 processor"

diamondlovesyou — Mon, 13 Mar 2023 23:56:12 +0000

I don't think scalable vectors is particularly useful feature, especially compared to what compilers have to go though to support it. It's much more useful to be able to do "more powerful" things with existing vector widths at hardware speeds (or perhaps just make the existing stuff faster than it is) than to be able to go wider. Scalable vectors also doesn't solve the ISA problem: don't break existing processors.

New comment by diamondlovesyou in "Bing: “I will not harm you unless you harm me first”"

diamondlovesyou — Thu, 16 Feb 2023 18:02:59 +0000

Add a period to the end of the sentence and aberration is gone.

"맙소사, 절대평화주의자들도 가끔 존재 자체가 고통이라 해도 남에게 해를 끼치는 행동을 하는 것 같아요."

New comment by diamondlovesyou in "GPU Caching Compared Among AMD, Intel UHD, Apple M1"

diamondlovesyou — Tue, 17 Jan 2023 05:17:03 +0000

See AMD "Smart Memory" a.k.a. PCIe Large Bar. This expands the amount of GPU memory that the CPU can directly access, usually to the GPU's entire memory range (ordinarily only ~256Mb is accessible). GPU->CPU Reads have very high latencies, but that's not an issue for CPU->GPU writes.

GPUs have been able to access "host" memory for a long time now, with a few restrictions: you have to setup the GPU mappings first and pin the pages in memory.

New comment by diamondlovesyou in "Why did the F-14 Tomcat retire decades before its peers? (2021)"

diamondlovesyou — Sun, 30 Oct 2022 23:15:01 +0000

> By the time of the second gulf war, the F-14D cost 20% more per unit than the F-18E, and some 80-100% more to maintain. I'm not sure how you're concluding that the taxpayers were ripped off by that.

He/she isn't concluding that, they are repeating what they were told at the time.

New comment by diamondlovesyou in "Ushering out strlcpy()"

diamondlovesyou — Sat, 27 Aug 2022 11:19:12 +0000

Naw, just use {pointer, length} tuples. Crisis averted.