Hacker News: vimarsh6739

New comment by vimarsh6739 in "CS 6120: Advanced Compilers: The Self-Guided Online Course (2020)"

vimarsh6739 — Fri, 19 Jun 2026 06:28:22 +0000

I disagree with several parts here. But hopefully, this leads to a fun discussion!

> no need for a lexer with a recursive descent parser

I'd argue that teaching how to write a lexer + recursive descent parser is more relevant in the context of production compilers: many major production compilers out there use hand-written recursive descent parsers (cpp, javac, rust, go,javascript...). Recursive descent parsers are also really nice for emitting error messages.

> It is better for a first compiler to compile to a higher level language in which neither register assignment nor memory management are necessary.

Compiling to a high-level target can be a reasonable first project(e.g., you can emit LLVM), but imo its a different objective from learning the full stack. Emitting actual ISA instructions(even sub-optimally, after all it's a university course) forces you to learn calling conventions, isel, register pressure, stack layouts etc. Building a compiler,at least for me, is probably one of the easiest ways to understand how all of it works together.

> optimize a specific function by source level rewrite

I don't think replacing optimizations with a per-function source-level rewrite works as a general model. Many optimizations are not local to a single function (for example, inlining function calls can lead to new constant-propagation opportunities). If your argument rests on the fact that not all functions are hot, a lot of general-purpose JIT compilers out there already use runtime info to decide when to optimize hot functions, so part of what you're proposing already exists.

> implementation is human readable and not buried in a binary

Is this really a requirement for your program? In most cases, I think the optimization story is more like: "code you want to write" != "code you want to run"

> Moreover, and perhaps even more importantly, by not doing optimizations in the compiler, compilation times can be much faster, easily 100-1000x than state of the art optimizing compiler, while generating equivalent or even better runtime performance

I think the actual answer here is "it depends". For long-running programs, one tradeoff is build time vs future execution time. Also many optimizations cannot be expressed in source code itself. For example, in C++, you can do stuff like whole program de-virtualization only at link time, which is why LTO exists.

Aside: I personally work on source-to-source automatic differentiation inside compilers, and I can give examples for missed optimizations in generated derivative code if you don't run existing optimization passes like LICM/CSE before differentiating a function.

New comment by vimarsh6739 in "CUDA-oxide: Nvidia's official Rust to CUDA compiler"

vimarsh6739 — Mon, 11 May 2026 17:35:50 +0000

Really hard to find alternatives to Julia for AD as a first class citizen

TileIR Internals

vimarsh6739 — Fri, 30 Jan 2026 17:20:25 +0000

Article URL: https://maknee.github.io/blog/2026/NVIDIA-TileIR-Internals-from-CuTile-to-MLIR-LLVM-to-SASS/

Comments URL: https://news.ycombinator.com/item?id=46827090

Points: 10

# Comments: 1

Updated Practice for Review Articles and Position Papers in ArXiv CS Category

vimarsh6739 — Sat, 01 Nov 2025 05:57:34 +0000

Article URL: https://blog.arxiv.org/2025/10/31/attention-authors-updated-practice-for-review-articles-and-position-papers-in-arxiv-cs-category/

Comments URL: https://news.ycombinator.com/item?id=45779541

Points: 2

# Comments: 0

New comment by vimarsh6739 in "Carefully but Purposefully Oxidising Ubuntu"

vimarsh6739 — Thu, 13 Mar 2025 18:19:57 +0000

To me, this feels less about Rust and more about moving away from copyleft.

New comment by vimarsh6739 in "Finite Field Assembly: A Language for Emulating GPUs on CPU"

vimarsh6739 — Sat, 18 Jan 2025 05:34:55 +0000

yes, but in practice, I believe people spam __syncthreads() in GPU kernels just to ensure correctness. There is value in statically proving that you don't need a synchronization instruction at a certain point. Doubly more so in the transpilation case, when you now find your naive __syncthreads() being called multiple times due to it being present in CUDA code(or MLIR in this case).

An interesting add on to me would be the handling of conditionals. Because newer GPUs have independent thread scheduling which is not present in the older ones, you have to wonder what is the desired behaviour if you are using CPU execution as a debugger of sorts(or are just GPU poor). It'd be super cool to expose those semantics as a compiler flag for your transpiler, allowing me to potentially debug some code as if it ran on an ancient GPU like a K80 for some fast local debugging.

But the ambitious question here is this - if you take existing GPU code, run it through a transpiler and generate better code than handwritten OpenMP, do you need to maintain an OpenMP backend for the CPU in the first place? It'd be better to express everything in a more richer parallel model with support for nested synchronization right? And let the compiler handle the job of inter-converting between parallelism models. It's like saying if Pytorch 2.0 generates good Triton code, we could just transpile that to CPUs and get rid of the CPU backend. (of course triton doesn't support all patterns so you would fall back to aten, and this kind of goes for a toss)

New comment by vimarsh6739 in "Finite Field Assembly: A Language for Emulating GPUs on CPU"

vimarsh6739 — Sat, 18 Jan 2025 01:19:11 +0000

One of the more subtle aspects of retargeting GPU code to run on the CPU is the presence of fine grained(read - block level and warp level) explicit synchronization mechanisms being available in the GPU. However, this is not the same in CPU land, so additional care has to be taken to handle this. One example of work which tries this is https://arxiv.org/pdf/2207.00257 .

Interestingly, in the same work, contrary to what you’d expect, transpiling GPU code to run on CPU gives ~76% speedups in HPC workloads compared to a hand optimized multi-core CPU implementation on Fugaku(a CPU only supercomputer), after accounting for these differences in synchronization.

Perspectives on Floating Point

vimarsh6739 — Tue, 15 Oct 2024 07:37:37 +0000

Article URL: https://www.eigentales.com/Floating-Point/

Comments URL: https://news.ycombinator.com/item?id=41845884

Points: 54

# Comments: 19

New comment by vimarsh6739 in "Groq runs Mixtral 8x7B-32k with 500 T/s"

vimarsh6739 — Mon, 19 Feb 2024 19:29:30 +0000

Thanks for the quick reply! About hardware support, I was wondering if the LPU has a hardware instruction to compute the attention matrix similar to the MatrixMultiply/Convolve instruction in the TPU ISA. (Maybe a hardware instruction which fuses a softmax on the matmul epilogue?)

New comment by vimarsh6739 in "Groq runs Mixtral 8x7B-32k with 500 T/s"

vimarsh6739 — Mon, 19 Feb 2024 19:07:03 +0000

Might be a bit out of context, but isn't the TPU also optimized for low latency inference? (Judging by reading the original TPU architecture paper here - https://arxiv.org/abs/1704.04760). If so, does Groq actually provide hardware support for LLM inference?

New comment by vimarsh6739 in "Apple pulls plug on Goldman credit-card partnership"

vimarsh6739 — Wed, 29 Nov 2023 00:51:43 +0000

The card is pretty useful to me as a first card since it has no foreign transaction fee.

Debugging a Bit-Flip Error (2023)

vimarsh6739 — Fri, 29 Sep 2023 04:39:33 +0000

Article URL: https://sillycross.github.io/2023/06/11/2023-06-11/

Comments URL: https://news.ycombinator.com/item?id=37699331

Points: 1

# Comments: 0

New comment by vimarsh6739 in "Ask HN: What's the Situation with YouTube-Dl?"

vimarsh6739 — Sat, 26 Aug 2023 12:21:18 +0000

it isn't a real download, because you don't have access to the raw file.

New comment by vimarsh6739 in "AI Boyfriend [video]"

vimarsh6739 — Tue, 08 Aug 2023 21:29:16 +0000

KRAZAM is really under-rated. It is one of those few YouTube channels which produce really really good content and stories relative to their size.

New comment by vimarsh6739 in "Intel Arc A580 could be the next great affordable GPU"

vimarsh6739 — Sat, 05 Aug 2023 08:53:02 +0000

How does oneAPI/SYCL compare to CUDA? We certainly need an alternative to OpenCL, but every day, I can't help but notice the widening gulf between CUDA and any other GPGPU API out there.

LLeaves: A LLVM-based compiler for LightGBM decision trees

vimarsh6739 — Sat, 08 Jul 2023 16:40:14 +0000

Article URL: https://github.com/siboehm/lleaves

Comments URL: https://news.ycombinator.com/item?id=36646087

Points: 1

# Comments: 0

Screen Sizes and Breakpoints for Responsive Design

vimarsh6739 — Sat, 25 Mar 2023 05:04:29 +0000

Article URL: https://learn.microsoft.com/en-us/windows/apps/design/layout/screen-sizes-and-breakpoints-for-responsive-design

Comments URL: https://news.ycombinator.com/item?id=35299622

Points: 1

# Comments: 0

New comment by vimarsh6739 in "Ask HN: If I get locked out of everything, please try to help me"

vimarsh6739 — Tue, 13 Dec 2022 09:33:45 +0000

I really don't have anything to say to the OP, but I wonder(in a similar situation) if with the recent push towards e-sim, will SMS based 2FA become more problematic?

If a phone with an e-sim dies, and you need some kind of OTP, I wonder how you'll receive it. You can't exactly 'transplant' the SIM into another phone.