Hacker News: suresk

New comment by suresk in "Report: Tim Cook could step down as Apple CEO 'as soon as next year'"

suresk — Sun, 16 Nov 2025 07:37:59 +0000

The opposite problem can happen- the CEO uses the product all the time and becomes blind to problems. “It has always worked that way”, or “who would want to do that!?”” are much more common than pure apathy.

New comment by suresk in "Electric bill may be paying for big data centers' energy use"

suresk — Sun, 07 Sep 2025 22:00:03 +0000

They also get massive subsidies and tax breaks for building these data centers. They require the negotiations be done in secret and often fight to keep the agreements secret to make it so people don’t flip out when they see how bad they are.

New comment by suresk in "Sorting algorithms with CUDA"

suresk — Wed, 12 Mar 2025 00:51:14 +0000

Kind of a fun toy problem to play around with. I noticed you had thread coarsening as an option to play around with - there is often some gain to be had here. I think this is also a fun thing to play around with Nsight on - things that are impacting your performance aren't always obvious and it is a pretty good profiler - might be worth playing around with. (I wrote about a fun thing I found with thread coarsening and automatic loop unrolling with Nsight here: https://www.spenceruresk.com/loop-unrolling-gone-bad-e81f66f...)

You may also want to look at other sorting algorithms - common CPU sorting algorithms are hard to maximize GPU hardware with - a network sort like bitonic sorting involves more work (and you have to pad to a power of 2) but often runs much faster on parallel hardware.

I had a fairly naive implementation that would sort 10M in around 10ms on an H100. I'm sure with more work they can get quite a bit faster, but they need to be fairly big to make up for the kernel launch overhead.

New comment by suresk in "Show HN: We built a Plug-in Home Battery for the 99.7% of us without Powerwalls"

suresk — Wed, 12 Mar 2025 00:10:22 +0000

> I'm surprised you're not touting the "save on your power bill" benefits.

At ~$600/kWh for capacity, the ROI isn't great. I have a pretty big differential on my rates because I have an EV, and even then I'd need over a decade to make the $1,000 back assuming I fully discharged it every day.

New comment by suresk in "Introduction to CUDA programming for Python developers"

suresk — Fri, 21 Feb 2025 06:24:21 +0000

Are you sure they ditched CUDA? I keep hearing this, but it seems odd because that would be a ton of extra work to entirely ditch it vs selectively employing some ptx in CUDA kernels which is fairly straightforward.

Their paper [1] only mentions using PTX in a few areas to optimize data transfer operations so they don't blow up the L2 cache. This makes intuitive sense to me, since the main limitation of the H800 vs H100 is reduced nvlink bandwidth, which would necessitate doing stuff like this that may not be a common thing for others who have access to H100s.

1. https://arxiv.org/abs/2412.19437

New comment by suresk in "Introduction to CUDA programming for Python developers"

suresk — Fri, 21 Feb 2025 05:45:25 +0000

I've only done the CUDA side (and not professionally), so I've always wondered how much those skills transfer either way myself. I imagine some of the specific techniques employed are fairly different, but a lot of it is just your mental model for programming, which can be a bit of a shift if you're not used to it.

I'd think things like optimizing for occupancy/memory throughput, ensuring coalesced memory accesses, tuning block sizes, using fast math alternatives, writing parallel algorithms, working with profiling tools like nsight, and things like that are fairly transferable?

New comment by suresk in "Railroad Tycoon II"

suresk — Mon, 13 Jan 2025 19:59:47 +0000

So many fond memories of this game - it was a really fun blend of railroad sim and economic sim that I haven't really found since. I'll never forget the "ding ding ding" sound that goes off when a train pulls into a station and earns you a bit of cash!

New comment by suresk in "Zen5's AVX512 Teardown and More"

suresk — Wed, 07 Aug 2024 21:08:21 +0000

I was having trouble finding an E Core die shot, but that helps put it into perspective a bit anyway. Thanks!

New comment by suresk in "Zen5's AVX512 Teardown and More"

suresk — Wed, 07 Aug 2024 20:34:18 +0000

My non-expert brain immediately jumped to double-pumping + maybe working with their thread director to have tasks using a lot of AVX512 instructions prefer P cores more. It feels like such an obvious solution to a really dumb problem that I assumed there was something simple I was missing.

The register file size makes sense, I didn't think they were that much of the die on those processors but I guess they had to be pretty aggressive to meet power goals?

New comment by suresk in "Run CUDA, unmodified, on AMD GPUs"

suresk — Tue, 16 Jul 2024 21:56:30 +0000

Having dabbled in CUDA, but not worked on it professionally, it feels like a lot of the complexity isn't really in CUDA/C++, but in the algorithms you have to come up with to really take advantage of the hardware.

Optimizing something for SIMD execution isn't often straightforward and it isn't something a lot of developers encounter outside a few small areas. There are also a lot of hardware architecture considerations you have to work with (memory transfer speed is a big one) to even come close to saturating the compute units.

New comment by suresk in "Show HN: A web debugger an ex-Cloudflare team has been working on for 4 years"

suresk — Sat, 11 May 2024 02:51:44 +0000

This would have been so useful this past week while we debugged something that ended up being a weird combo of Comcast/Cloudflare/http3 - only some people could reliably reproduce it and it was a lot to coach them through all the steps.

Being able to redact a few values would be really important for us (I just wrote Python scripts to clean the har files up), but I’m going to play around with it this weekend.

New comment by suresk in "Show HN: A web debugger an ex-Cloudflare team has been working on for 4 years"

suresk — Fri, 10 May 2024 22:10:23 +0000

Kind of funny, I was just looking at adding similar functionality to an internal Chrome plugin I built because we struggle to get enough useful info in bug reports (being able to look at the HAR, in particular, is useful, but difficult to get users to do correctly).

Two questions -

1) Any way to customize which headers/cookies get scrubbed?

2) Is there a way to get something like the lower-level export you can get by going to chrome://net-export/?

New comment by suresk in "RAG is more than just embedding search"

suresk — Thu, 21 Sep 2023 23:59:34 +0000

GAR - Generation-Augmented Retrieval?

I've actually had some success with getting ChatGPT to create Redshift queries based on user text and then I can run them and render results, which has some interesting use-cases.

Max calls out the biggest problem with using something like ChatGPT in a search flow - it is way too slow. I've talked to a lot of people wondering if we can just shove a catalog at ChatGPT and have it magically do a really good job of search, and token limits + latency are two pretty hard stops there (plus I think it would be generally a worse experience in many cases).

What I'm trying to look at now is how LLMs can be used to make documents better suited for search by pulling out useful metadata, summarizing related content, etc. Things that can be done at index time instead of search time, so the latency requirements are less of an issue.

New comment by suresk in "Jacobin: A more than minimal JVM written in Go"

suresk — Fri, 25 Aug 2023 01:25:16 +0000

Sorry, I used the wrong terminology. They are called “spans” in Go’s GC. There are different sizes of spans that allocations end up in, which helps avoid fragmentation.

New comment by suresk in "Jacobin: A more than minimal JVM written in Go"

suresk — Thu, 24 Aug 2023 22:47:49 +0000

Yeah, I think any bytecode interpreter ends up with a giant switch in the critical path at some point :)

Around the time that change was made to Go, Andrew and I were looking at this and wondering how big of a performance hit it was and if there were a better way to structure that. I had a hunch that the compiler should be smart enough to not compile that as a switch/giant if block, and a quick trip to a disassembler showed it using binary search. This commit: https://github.com/golang/go/commit/1ba96d8c0909eca59e28c048... added the jump table and has some nice analysis on where it makes sense to do binary search vs jump tables.

As far as I can tell, with certain restrictions (that are fine in this case), it is pretty reliable at optimizing giant if/else blocks and switches.

New comment by suresk in "Jacobin: A more than minimal JVM written in Go"

suresk — Thu, 24 Aug 2023 22:37:17 +0000

Can you elaborate more on this?

> E.g. Haskell has an awesome concurrent GC that’d work like crap for Java, because it assumes tons of really short-lived, really small garbage and almost no mutation. The other way around is also bound to be problematic—I don’t know how the Scala people do it

I don't know a ton about Haskell's GC, but at surface level it seems very similar to several of the JVM GC implementations - a generational GC with a concept of a nursery. Java GC is very heavily designed around the weak generational hypothesis (ie, most objects don't live long) and very much optimizes for short-lived object lifecycles, so most GC implementations have at least a few nursery-type areas before anything gets to the main heap where GC is incredibly cheap, plus some stuff ends up getting allocated on the stack in some cases.

The only big difference is that in Haskell there are probably some optimizations you can do if most of your structures are immutable since nothing in an older generation can refer to something in the nursery. But it isn't super clear to me that alone makes a big enough difference?

New comment by suresk in "Jacobin: A more than minimal JVM written in Go"

suresk — Thu, 24 Aug 2023 20:25:22 +0000

It is kind of interesting to look at some of the differences in GC approach in the JVM vs Go - the different goals, different tradeoffs, different approaches, etc. Go's is definitely simpler in that there is a single implementation, it doesn't have nearly as many tuning knobs, and it is focused on one things vs the JVM GC implementations that give you a lot of control (whether that is good or not..) over tuning knobs and it is a pretty explicit goal to support the different GC-related use cases (ie, low-latency vs long-running jobs where you only care about throughput).

One of the things I really like about Go is that a lot of the designs and decisions, along with their rationales, are pretty well documented. Here are the GC docs, for example - https://go.dev/doc/gc-guide.

For example, Go doesn't move data on the heap around so to combat fragmentation, it breaks the heap up into a bunch of different arenas based on fixed sizes. So a 900 byte object goes into the 1K arena, etc. This wastes some heap space, but saves the overhead and complexity with moving data around.

New comment by suresk in "Inside the JVM: Arrays and how they differ from other objects"

suresk — Thu, 24 Aug 2023 16:54:03 +0000

I think it depends a lot on the other details, especially how expensive the extra GC will be vs the wasted space. Hard to give a rule that will work in all contexts.

In our case, it wasn't a single hard-coded number - the input data gave us the upper bound, and the difference between the upper bound and the median case was so small that going with the upper bound worked out best.

New comment by suresk in "Inside the JVM: Arrays and how they differ from other objects"

suresk — Thu, 24 Aug 2023 16:18:54 +0000

It's one of those things where you usually have to let profiling and other observations guide your approach. 99.9% of the time it doesn't really matter and the default behavior is fine. But I can think of a few times where this has been a big deal.

One in particular - I was profiling an application with low-latency needs and GC was taking up a ton of time. Mission control showed tons of allocations of arrays - at one point it was creating a bunch of lists in a loop and adding stuff to them, which triggered creating a new underlying array. We found that a) Many of the arrays were just over the first resizing size, and b) There was a good heuristic that we could use to give them an initial size that would never have to be expanded and wouldn't result in huge amounts of waste.

This had a pretty dramatic effect on our GC times and the overall latency. I think this is where the JVM really shines - tons of tooling to help you profile and observe these kinds of details to help you figure out when you actually need to care about stuff like the initial array capacity.

New comment by suresk in "Jacobin: A more than minimal JVM written in Go"

suresk — Thu, 24 Aug 2023 16:01:51 +0000

I'm not the author, but I have contributed a few things to the project over the past year. The performance isn't anywhere close to the vanilla JVM - several times slower at best. It is entirely interpreted and there aren't any of the optimizations that have made their way into the JVM over the decades it has been around.

It has been a fun project to play around with for someone like me who thinks this kind of stuff is fun and interesting but will probably never get a chance to work on it full time. Cool to see it noticed, though!