Hacker News: nirw4nna

New comment by nirw4nna in "Ask HN: What are you working on? (October 2025)"

nirw4nna — Mon, 13 Oct 2025 09:00:24 +0000

I'm currently chipping away at DSC, a tensor library I wrote from scratch to play with large language models. Last week I re-wrote flash attention from scratch in CUDA and was able to get good perf.

[1]: https://github.com/nirw4nna/dsc

[2]: https://x.com/nirw4nna/status/1968812772944126329

Why I Ditched Malloc for AI Inference

nirw4nna — Thu, 28 Aug 2025 20:01:58 +0000

Article URL: https://gilli.dev/programming/2025/08/28/why-i-ditched-malloc.html

Comments URL: https://news.ycombinator.com/item?id=45056408

Points: 4

# Comments: 0

New comment by nirw4nna in "Ask HN: What are you working on? (July 2025)"

nirw4nna — Sun, 27 Jul 2025 21:33:43 +0000

I'm currently working on DSC, a tensor library I wrote from scratch in C++ with a PyTorch-like API.

Right now it works on both CPU and GPU (both AMD and NVIDIA) and is capable of running LLMs like Qwen, I'm currently implementing a native profiler to trace CPU and GPU kernels and then I'll work on speed. Goal is to be competitive with PyTorch eager by the end of the year.

Source code: https://github.com/nirw4nna/dsc

My original HN post: https://news.ycombinator.com/item?id=44310678

New comment by nirw4nna in "Show HN: I built a tensor library from scratch in C++/CUDA"

nirw4nna — Thu, 19 Jun 2025 06:24:40 +0000

Because I happen to know C++ and I just wanted to build something rather than learn a new language. Zig looks very interesting though, there are already other projects in this space that use it with great success (see: https://github.com/zml/zml).

New comment by nirw4nna in "Show HN: I built a tensor library from scratch in C++/CUDA"

nirw4nna — Thu, 19 Jun 2025 06:15:31 +0000

You just need a foundation of C/C++. If you already have that then just start programming, it's way better than reading books/guides/blogs (at least until you're stuck!). Also, you can read the source code of other similar projects on GitHub and get ideas from them, this is what I did at the beginning.

New comment by nirw4nna in "Show HN: I built a tensor library from scratch in C++/CUDA"

nirw4nna — Thu, 19 Jun 2025 06:07:17 +0000

Yes! This was actually one of my initial goals! I actually like to work in a C-style-C++ let's say where I turn off C++ features I don't need and just use the one I actually need like templates, objects ecc... I find this style to be easy to reason about when it comes to performance.

New comment by nirw4nna in "Show HN: I built a tensor library from scratch in C++/CUDA"

nirw4nna — Thu, 19 Jun 2025 05:59:22 +0000

Thanks for pointing this out! I'll definitely have to investigate other approaches. nanobind looks interesting but I don't need to expose complex C++ objects, I just need the 'fastest' way of calling into a C API. I guess the goto for this is CFFI?

New comment by nirw4nna in "Show HN: I built a tensor library from scratch in C++/CUDA"

nirw4nna — Wed, 18 Jun 2025 22:22:22 +0000

I developed this on an HP Omen 15 with an i7-8750H, a GTX 1050TI and 32GB or RAM with Linux Mint as my OS.

New comment by nirw4nna in "Show HN: I built a tensor library from scratch in C++/CUDA"

nirw4nna — Wed, 18 Jun 2025 22:20:15 +0000

Right now I can load tensors directly from a safetensors file or from a NumPy array so I don't really have in mind to add my own custom format but I do plan to support GGUF files.

New comment by nirw4nna in "Show HN: I built a tensor library from scratch in C++/CUDA"

nirw4nna — Wed, 18 Jun 2025 22:14:23 +0000

You are absolutely correct! I started working on a sort of compiler a while back but decided to get the basics down first. The templates and switch(s) are not really the issue but rather going back and forth between C & Python. This is an experiment I did a few months ago: https://x.com/nirw4nna/status/1904114563672354822 as you can see there is a ~20% perf gain just by generating a naive C++ kernel instead of calling 5 separate kernels in the case of softmax.

New comment by nirw4nna in "Show HN: I built a tensor library from scratch in C++/CUDA"

nirw4nna — Wed, 18 Jun 2025 22:05:20 +0000

Thanks! To be honest, it started purely as a learning project. I was really inspired when llama.cpp first came out and tried to build something similar in pure C++ (https://github.com/nirw4nna/YAMI), mostly for fun and to practice low-level coding. The idea for DSC came when I realized how hard it was to port new models to that C++ engine, especially since I don't have a deep ML background. I wanted something that felt more like PyTorch, where I could experiment with new architectures easily. As for llama.cpp, it's definitely faster! They have hand-optimizing kernels for a whole bunch of architectures, models and data types. DSC is more of a general-purpose toolkit. I'm excited to work on performance later on, but for now, I'm focused on getting the API and core features right.

New comment by nirw4nna in "Show HN: I built a tensor library from scratch in C++/CUDA"

nirw4nna — Wed, 18 Jun 2025 21:45:25 +0000

Yes, when I designed the API I wanted to keep a clear distinction between Python and C. At some point I had two APIs: 1 in Python and the other in high-level C++ and they both shared the same low-level C API. I find this design quite clean and easy to work with if multiple languages are involved. When I'll get to perf I plan to experiment a bit with nanobind (https://github.com/wjakob/nanobind) and see if there's a noticeable difference wrt ctypes.

Show HN: I built a tensor library from scratch in C++/CUDA

nirw4nna — Wed, 18 Jun 2025 15:20:05 +0000

Hi HN,

Over the past few months, I've been building `dsc`, a tensor library from scratch in C++/CUDA. My main focus has been on getting the basics right, prioritizing a clean API, simplicity, and clear observability for running small LLMs locally.

The key features are: - C++ core with CUDA support written from scratch. - A familiar, PyTorch-like Python API. - Runs real models: it's complete enough to load a model like Qwen from HuggingFace and run inference on both CUDA and CPU with a single line change[1]. - Simple, built-in observability for both Python and C++.

Next on the roadmap is adding BF16 support and then I'll be working on visualization for GPU workloads.

The project is still early and I would be incredibly grateful for any feedback, code reviews, or questions from the HN community!

GitHub Repo: https://github.com/nirw4nna/dsc

[1]: https://github.com/nirw4nna/dsc/blob/main/examples/models/qw...

Comments URL: https://news.ycombinator.com/item?id=44310678

Points: 119

# Comments: 28