Hacker News: jhj

New comment by jhj in "North Korean's 100k fake IT workers net $500M a year for Kim"

jhj — Wed, 18 Mar 2026 18:30:30 +0000

This might include people working in lumber camps in places like Siberia, "mercenaries" in Ukraine, people in NK-managed restaurants in China, Laos etc, or similar efforts that have been reported on, where the average revenue per worker is likely a lot lower.

New comment by jhj in "The Cray-1 Computer System (1977) [pdf]"

jhj — Wed, 14 Jan 2026 02:23:06 +0000

These flops are not the same. The 2013 phone flops are fp32, the A13 flops look to be fp32 as well (not entirely sure), while the Cray numbers (like the rest of the HPC industry) are fp64 (Cray 1 predates what would become IEEE 754 binary64 though, so not same exact arithmetic but similar in dynamic range and precision).

A modern Nvidia GB200 only does about 40 tflop/s in fp64 for instance. You can emulate higher precision/dynamic range arithmetic with multiple passes and manipulations of lower precision/dynamic range arithmetic but without an insane number of instructions it won't meet all the IEEE 754 guarantees for instance.

Certainly if Nvidia wanted to dedicate much more chip area to fp64 they could get a lot higher, but fp64 FMA units alone would be likely >30 times larger than their fp16 cousins and probably 100s of times larger than fp4 versions.

New comment by jhj in "Lossless LLM compression for efficient GPU inference via dynamic-length float"

jhj — Fri, 25 Apr 2025 23:33:51 +0000

Unlike quantization, dimensionality reduction/low rank approximation, distillation etc, lossless compression is an always-correct addition to any ML system as you are computing the same thing you did before, the only question is if it is fast enough to not cause substantial bottlenecks and if the achievable compression ratio is high enough to be useful.

Floating point is just an inefficient use of bits (due to excessive dynamic range), especially during training, so it will always be welcome there. Extreme quantization techniques (some of the <= 4-bit methods, say) also tend to increase entropy in the weights limiting the applicability of lossless compression, so lossless and lossy compression (e.g., quantization) sometimes go against each other.

If you have billions in dollars in inference devices, even reducing the number of devices you need for a given workload by 5% is very useful.

New comment by jhj in "Lossless LLM compression for efficient GPU inference via dynamic-length float"

jhj — Fri, 25 Apr 2025 23:28:21 +0000

Not really, it's just adding some data transposition (coalescing individual bytes from the data words together) and an option to use a LZ/dictionary-type compressor to compress redundant things. But an LZ-type compressor doesn't make much sense on NN weights I think since it is not as redundant as most text data with many repeats, and also the space of possible dictionary matches is pretty small since unless the data is highly sparse, there may not be many repetitions that you can leverage to avoid the dictionary overhead.

If you add an LZ-type compressor and have this be in the critical path for inference, then decompression will be a lot slower. It would be best to fuse decompression with the compute kernels (e.g., a GEMM that performs decompression on each tile before the arithmetic), and the simpler the decompression routine, the easier this will be.

New comment by jhj in "Lossless LLM compression for efficient GPU inference via dynamic-length float"

jhj — Fri, 25 Apr 2025 20:45:23 +0000

This is just a consequence of the fact that bfloat16 has a very high dynamic range which is not all used. People like hyperparameters that look like 0.01 not 10^10, even though there is the same fractional precision available at each exponent and if you multiplied everything - hyperparameters, initialized weights, training data, etc in a network by 10^6 things will still work more or less the same since the upper range is hardly used (with the possible exception of some small number of special functions).

Typical entropy of bfloat16 values seen in weights (and activations) are about 10-12 bits (only 65-75% or so of the value range is used in practice). Sign and mantissa bits tend to be incompressible noise.

This has been exploited several times before in the context of both classical HPC and AI, with lossless compression work from Martin Burtscher's lab (https://userweb.cs.txstate.edu/~burtscher/), fpzip from LLNL (https://computing.llnl.gov/projects/fpzip) and my library dietgpu from 2021 (https://github.com/facebookresearch/dietgpu) which we used to speed training on a large GPU cluster by about 10% wall clock time overall by losslessly compressing all data prior to send and decompressing upon receive (e.g., gradients, weights from backup, etc), which is still computing the same thing as it did before as it is lossless.

Also, rANS is more efficient and easier to implement in SIMD-like instruction sets than Huffman coding. It would reduce the performance latency/throughput penalties as well with DFloat11 (since we have to decompress before we do the arithmetic).

New comment by jhj in "Ask HN: Former employees' RSUs at risk after startup's IPO"

jhj — Thu, 13 Feb 2025 00:15:00 +0000

re #3, if your RSU windfall is substantially large, you might be eligible for the 100%/110% safe harbor that won't penalize you for tax underpayments (assuming you are a US taxpayer)

e.g., you make $200K in 2024 and $5 million in 2025 (which includes the RSU windfall). Assuming you pay at least 110% of what you paid in taxes in 2024 in 2025, you need not pay estimated tax or anything beyond statutory withholding amounts on the RSU windfall, and can just make up the 6 or 7 figures of tax owed at tax settlement time (e.g., by April 15/16 after the tax year in question). This is the optimal strategy, you can just park the money for tax owed in a close to as risk-free investment as possible in the meantime.

Statutory withholding rates might be higher; e.g., at my employer, if your RSU earnings are below $1 million, you can set your federal withholding as low as 22%. If your earnings are above $1 million, you are stuck with the 37% mandatory federal withholding rate (both done by sell to cover). This does not include per-state withholding minima, which can vary widely.

New comment by jhj in "Mission Accomplished? Heat pump adoption has a long way to go"

jhj — Sun, 09 Feb 2025 19:14:07 +0000

I have some of (possibly the?) cheapest residential electric power in the US, at 5.58 cents per kWh all-in cost here in Wyoming, 90%+ hydropower.

Absolute lowest cold here each year will be around -30 F / -34 C (there will be several nights in the winter where it gets below -20 F / -29 C), and absolute hottest it will ever be around 85 F / 29 C, but average annual temperature is about 35 F / 2 C. It can snow any month of the year here, with snow on the ground usually between November and mid May.

My house was built in 1968 and I have primarily resistive baseboard heating, with a large Mitsubishi mini-split installed by my home's previous owner mainly for air conditioning purposes in major rooms for a couple of weeks in the summer. I live at 6500 ft / 2000 m altitude, so even on the hottest summer days once the sun goes down it gets quite chilly and can get close to freezing, so it's really just for a few hours in the afternoon for a/c purposes. I otherwise use the heat pumps as baseline heat in the winter.

I'd like to put trust in heat pumps more because they are obviously more efficient (also as seen by my already low power bill), but lack of heat on certain days in the winter has serious implications here for home integrity, and while this might just be this one Mitsubishi model (though they are less than 5 years old), I haven't been left with a good opinion of heat pump design and repairability in general and am not tempted much to explore heat pumps further.

The heat pumps are rated to work down to -5 F / -21 C in the manual, but in practicality it's more like 15 F / -9 C otherwise they just spend a large part of their time defrosting. The models I have don't seem well engineered for reliability or maintenance either, there are important fuses hard-soldered to the main board that are not individually replaceable, and true enough in the middle of winter my HVAC technician and I had to bypass the blown fuses with an automotive fuse we had (same stats) attached with alligator clips, as it would take weeks or months to obtain a new $1500 (!) main circuit board from who knows where. On the other hand, resistive heating usually just works assuming you have power, and I also have two fireplaces as emergency backup if there's no power (though power lines are almost all buried here due to snow/ice anyways).

I really would like to see more emphasis on reliability and repairability rather than, like, SEER, HSPF, or COP ratings or whatever.

New comment by jhj in "Lossless Compression of Vector IDs for Approximate Nearest Neighbor Search"

jhj — Thu, 23 Jan 2025 17:18:19 +0000

Brute-force indices are usually arithmetic bound (e.g., GEMM). Cell-probe based indices are usually memory bandwidth bound (IVF, LSH bucketing, etc). Graph-based indices are usually memory latency bound (traversing linked lists / graph data structures).

(I wrote the GPU half of Faiss and work with the people who wrote this paper).

New comment by jhj in "U.S. chip revival plan chooses sites"

jhj — Tue, 05 Nov 2024 23:03:06 +0000

If you have a limited number of long range ICBMs then you will likely prefer more directly military targets rather than a manufacturing facility which would likely only start to matter for a conflict months into combat, which itself is a scenario (drawn out conventional war) that is likely precluded by exchange of nuclear weapons in the first place.

New comment by jhj in "AI engineers claim new algorithm reduces AI power consumption by 95%"

jhj — Sun, 20 Oct 2024 14:17:27 +0000

As someone who has worked in this space (approximate compute) on both GPUs and in silicon in my research, the power consumption claims are completely bogus, as are the accuracy claims:

> In this section, we show that L-Mul is more precise than fp8 e4m3 multiplications

> To be concise, we do not consider the rounding to nearest even mode in both error analysis and complexity estimation for both Mul and L-Mul

These two statements together are non-sensical. Sure, if you analyze accuracy while ignoring the part of the algorithm that gives you accuracy in the baseline you can derive whatever cherry-picked result you want.

The multiplication of two floating point values if you round to nearest even will be the correctly rounded result of multiplying the original values at infinite precision, this is how floating point rounding usually works and what IEEE 754 mandates for fundamental operations if you choose to follow those guidelines (e.g., multiplication here). But not rounding to nearest even will result in a lot more quantization noise, and biased noise at that too.

> applying the L-Mul operation in tensor processing hardware can potentially reduce 95% energy cost by elementwise floating point tensor multiplications and 80% energy cost of dot products

A good chunk of the energy cost is simply moving data between memories (especially external DRAM/HBM/whatever) and along wires, buffering values in SRAMs and flip-flops and the like. Combinational logic cost is usually not a big deal. While having a ton of fixed-function matrix multipliers does raise the cost of combinational logic quite a bit, at most what they have will probably cut the power of an overall accelerator by 10-20% or so.

> In this section, we demonstrate that L-Mul can replace tensor multiplications in the attention mechanism without any loss of performance, whereas using fp8 multiplications for the same purpose degrades inference accuracy

I may have missed it in the paper, but they have provided no details on (re)scaling and/or using higher precision accumulation for intermediate results as one would experience on an H100 for instance. Without this information, I don't trust these evaluation results either.

New comment by jhj in "Initial CUDA Performance Lessons"

jhj — Sat, 12 Oct 2024 01:23:47 +0000

> The first thing to consider is the register pressure. Increasing the number of registers per thread to optimize for ILP can lead to register spilling when the register file is exhausted

Kernels should almost never use local memory (except in arcane cases where you are using recursion and thus a call stack that will spill where an alternative non-recursive formulation would not really work).

> Many real-world applications, especially compute-bound kernels, need high occupancy to fully utilize the GPU’s resources

> while low-occupancy optimizations can be effective for specific workloads (e.g, memory-bound kernels)

I think this is almost exactly backwards, performant high compute intensity kernels (on a (fl)op/byte of memory traffic basis) tend to uniformly have low occupancy; look at a ncu trace of many kernels in cuBLAS or cuDNN for instance. You need a large working set of arguments in registers or in smem to feed scalar arithmetic or especially MMA units quickly enough as gmem/L2 bandwidth alone is not sufficient to achieve peak performance in many case. The only thing you need to do is to ensure that you are using all SMs (and thus all available scalar arithmetic or MMA units) which does not by itself imply high occupancy (e.g., a kernel that has 1 CTA per SM).

The simplest way to write a memory-bound kernel is to simply spawn a bunch of threads and perform load/stores from them and it isn't too hard to achieve close to peak this way, but even then depending upon the warp scheduler to rotate other warps in to issue more load/stores is inferior to unrolling loops, and you can also get close to peak mem b/w by using not too many SMs either through such unrolling, so even these need not have high occupancy.

(I've been Nvidia GPU programming for around 11 years and wrote the original pytorch GPU backend/tensor library, the Faiss GPU library, and contributed some stuff to cuDNN in its early days such as FFT convolution.)

New comment by jhj in "Initial CUDA Performance Lessons"

jhj — Fri, 11 Oct 2024 14:21:37 +0000

Aiming for higher occupancy is not always a desired solution, what frequently matters more is avoiding global memory latencies by retaining more data in registers and/or shared memory. This was first noted in 2010 and is still true today:

https://www.nvidia.com/content/gtc-2010/pdfs/2238_gtc2010.pd...

I would also think in terms of latency hiding rather than just work parallelism (though latency hiding on GPUs is largely because of parallelism). This is the reason why GPUs have massive register files, because unlike modern multi-core CPUs, we omit latency reducing hardware (e.g., speculative execution, large caches, that out-of-order execution stuff/register renaming etc) and in order to fill pipelines we need to have many instructions outstanding, which means that the operands for those pending arguments need to remain around for a lot longer, hence the massive register file.

New comment by jhj in "Mazda's $10 Subscription for Remote Start Sparks Backlash"

jhj — Mon, 30 Sep 2024 21:14:33 +0000

Remote start is accidental carbon monoxide poisoning waiting to happen if your garage is directly connected to your residence. I live in an area with brutal winters in Wyoming and just bought a new Ford Bronco, wish I could fully disable it (there's a button on the key fob as well).

https://www.nytimes.com/2018/05/13/business/deadly-convenien...

New comment by jhj in "Someone's been messing with Python's floating point subnormals"

jhj — Wed, 14 Aug 2024 00:29:27 +0000

The original sin here is that original 1980s designs carry over: the processor retains FP unit state, rather than each instruction indicating what subnormal flush mode (or rounding mode or whatever) one wishes to use with no retained FP unit state. See also: the IEEE FP exception design (e.g., signaling NaNs) causing havoc with SIMD, deep pipelining, out-of-order execution etc.

New comment by jhj in "Meta Large Language Model Compiler: Foundation Models of Compiler Optimization"

jhj — Thu, 27 Jun 2024 20:11:50 +0000

A less risky use is to use the model to choose compilation flags and pass orderings many (but not all) of which (in theory) should always be correct but that's more of a problem of the compiler itself than of the model if it produces incorrect output, in order to replace auto-tuners that do the same in order to optimize production binaries. Large companies already use such auto-tuners anyways for widely used public or internal binaries (e.g., code size is a huge problem for mobile application builds that you are delivering to a billion people, and the pieces of code that you care about for code size compression you are doing so explicitly because they are not of serious performance concern (e.g., you don't need unrolled loops or whatever); it's not the entire binary to which you are doing this). Such flags/pass ordering options for binary optimization are already an exponentially huge search space.

To use something for this for IR rewriting directly by the model is certainly more risky (it's difficult to guarantee post-rewrite that you would be computing the same thing; at least for compiler passes and optimization options, many of which also perform IR rewrites, the compiler such as LLVM/gcc should already have a huge suite of test coverage anyways).

(I'm not a compiler person but I'm a researcher on the same team at Meta FAIR as the authors)

New comment by jhj in "Silicon Valley's best kept secret: Founder liquidity"

jhj — Thu, 13 Jun 2024 21:11:03 +0000

While the median is much much lower, there are a couple of thousand individual contributor SWEs (non-managers) between Google, Meta and a few other big-ish tech companies who make >$1 million/year (steady state, does not depend upon recent run ups in stock prices), with a couple of hundred of those above $4 million/year even. The risk/reward for joining a startup is very skewed in terms of risk in these cases.

New comment by jhj in "State Farm announces major change affecting tens of thousands households in CA"

jhj — Sat, 04 May 2024 21:49:06 +0000

Construction costs here (Teton County, WY) are significantly higher than CA or most places in the US due to labor constraints (we have the highest average per capita income in the US, yet an ~80 : 1 median house price : median yearly income ratio, contractors have to commute in from 50+ miles away because unless you bought your home 20+ years ago it's hard to do so as a tradesperson now, etc). It's hard to construct SFHs here for less than $800/sq ft. The same I would imagine is true of other expensive resort places like Aspen or Park City as well.

My house is insured for more than I bought both the land and house for, as suggested by State Farm themselves due to ludicrous construction costs.

New comment by jhj in "State Farm announces major change affecting tens of thousands households in CA"

jhj — Sat, 04 May 2024 21:26:19 +0000

It’s probably more California regulations than the wildfire risk per se?

The direct backyard of my house in Wyoming is Bridger-Teton National Forest, wooded mountainous wilderness for miles with its trees abutting my property. A wildfire in 2012 in the forest came within 1.3 miles of me. I’m insured by State Farm, pay substantially less percentage wise than most places in the country for home insurance and my rate went down this year by about $1K, go figure.

New comment by jhj in "The simple beauty of XOR floating point compression"

jhj — Fri, 12 Apr 2024 01:25:26 +0000

Not just MPI over a network. We can compress floats, send them over NVLink or PCIe to another GPU in the same host, and decompress and it can be faster than sending data raw between GPUs, that's the premise behind dietgpu even (it's cheap compression, not a great compression ratio, like 0.6-0.9x of original size, but it's extremely fast, 100s of GB/s throughput, with the idea that you're trying to race something that is similarly as fast. General floating point data could be quite incompressible or highly compressible, it really just depends upon what is being passed around).

The interconnects are improving at a slower rate in general than compute on the CPU/GPU is and it can be exploited.

New comment by jhj in "The simple beauty of XOR floating point compression"

jhj — Thu, 11 Apr 2024 21:00:02 +0000

People in the HPC/classical supercomputing space have done this sort of thing for a while. There's a fair amount of literature on lossless floating point compression, such as Martin Burtscher's work or stuff out of LLNL (fpzip):

https://userweb.cs.txstate.edu/~burtscher/ https://computing.llnl.gov/projects/floating-point-compressi...

but it tends to be very application specific, where there tends to be high correlation / small deltas between neighboring values in a 2d/3d/4d/etc floating point array (e.g., you are compressing neighboring temperature grid points in a PDE weather simulation model; temperature differences in neighboring cells won't differ by that much).

In a lot of other cases (e.g., machine learning) the floating point significand bits (and sometimes the sign bit) tends to be incompressible noise. The exponent is the only thing that is really compressible, and the xor trick does not help you as much because neighboring values could still vary a bit in terms of exponents. An entropy encoder instead works well for that (encode closer to the actual underlying data distribution/entropy), and you also don't depend upon neighboring floats having similar exponents as well.

In 2022, I created dietgpu, a library to losslessly compress/decompress floating point data at up to 400 GB/s on an A100. It uses a general-purpose asymmetric numeral system encoder/decoder on GPU (the first such implementation of general ANS on GPU, predating nvCOMP) for exponent compression.

We have used this to losslessly compress floating point data between GPUs (e.g., over Infiniband/NVLink/ethernet/etc) in training massive ML models to speed up overall wall clock time of training across 100s/1000s of GPUs without changing anything about how the training works (it's lossless compression, it computes the same thing that it did before).

https://github.com/facebookresearch/dietgpu