<rss version="2.0" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:atom="http://www.w3.org/2005/Atom"><channel><title>Hacker News: nirw4nna</title><link>https://news.ycombinator.com/user?id=nirw4nna</link><description>Hacker News RSS</description><docs>https://hnrss.org/</docs><generator>hnrss v2.1.1</generator><lastBuildDate>Fri, 03 Jul 2026 09:26:21 +0000</lastBuildDate><atom:link href="https://hnrss.org/user?id=nirw4nna" rel="self" type="application/rss+xml"></atom:link><item><title><![CDATA[New comment by nirw4nna in "Ask HN: What are you working on? (October 2025)"]]></title><description><![CDATA[
<p>I'm currently chipping away at DSC, a tensor library I wrote from scratch to play with large language models. Last week I re-wrote flash attention from scratch in CUDA and was able to get good perf.<p>[1]: <a href="https://github.com/nirw4nna/dsc" rel="nofollow">https://github.com/nirw4nna/dsc</a><p>[2]: <a href="https://x.com/nirw4nna/status/1968812772944126329" rel="nofollow">https://x.com/nirw4nna/status/1968812772944126329</a></p>
]]></description><pubDate>Mon, 13 Oct 2025 09:00:24 +0000</pubDate><link>https://news.ycombinator.com/item?id=45566296</link><dc:creator>nirw4nna</dc:creator><comments>https://news.ycombinator.com/item?id=45566296</comments><guid isPermaLink="false">https://news.ycombinator.com/item?id=45566296</guid></item><item><title><![CDATA[Why I Ditched Malloc for AI Inference]]></title><description><![CDATA[
<p>Article URL: <a href="https://gilli.dev/programming/2025/08/28/why-i-ditched-malloc.html">https://gilli.dev/programming/2025/08/28/why-i-ditched-malloc.html</a></p>
<p>Comments URL: <a href="https://news.ycombinator.com/item?id=45056408">https://news.ycombinator.com/item?id=45056408</a></p>
<p>Points: 4</p>
<p># Comments: 0</p>
]]></description><pubDate>Thu, 28 Aug 2025 20:01:58 +0000</pubDate><link>https://gilli.dev/programming/2025/08/28/why-i-ditched-malloc.html</link><dc:creator>nirw4nna</dc:creator><comments>https://news.ycombinator.com/item?id=45056408</comments><guid isPermaLink="false">https://news.ycombinator.com/item?id=45056408</guid></item><item><title><![CDATA[New comment by nirw4nna in "Ask HN: What are you working on? (July 2025)"]]></title><description><![CDATA[
<p>I'm currently working on DSC, a tensor library I wrote from scratch in C++ with a PyTorch-like API.<p>Right now it works on both CPU and GPU (both AMD and NVIDIA) and is capable of running LLMs like Qwen, I'm currently implementing a native profiler to trace CPU and GPU kernels and then I'll work on speed. Goal is to be competitive with PyTorch eager by the end of the year.<p>Source code: <a href="https://github.com/nirw4nna/dsc">https://github.com/nirw4nna/dsc</a><p>My original HN post: <a href="https://news.ycombinator.com/item?id=44310678">https://news.ycombinator.com/item?id=44310678</a></p>
]]></description><pubDate>Sun, 27 Jul 2025 21:33:43 +0000</pubDate><link>https://news.ycombinator.com/item?id=44704901</link><dc:creator>nirw4nna</dc:creator><comments>https://news.ycombinator.com/item?id=44704901</comments><guid isPermaLink="false">https://news.ycombinator.com/item?id=44704901</guid></item><item><title><![CDATA[New comment by nirw4nna in "Show HN: I built a tensor library from scratch in C++/CUDA"]]></title><description><![CDATA[
<p>Because I happen to know C++ and I just wanted to build something rather than learn a new language. Zig looks very interesting though, there are already other projects in this space that use it with great success (see: <a href="https://github.com/zml/zml">https://github.com/zml/zml</a>).</p>
]]></description><pubDate>Thu, 19 Jun 2025 06:24:40 +0000</pubDate><link>https://news.ycombinator.com/item?id=44315993</link><dc:creator>nirw4nna</dc:creator><comments>https://news.ycombinator.com/item?id=44315993</comments><guid isPermaLink="false">https://news.ycombinator.com/item?id=44315993</guid></item><item><title><![CDATA[New comment by nirw4nna in "Show HN: I built a tensor library from scratch in C++/CUDA"]]></title><description><![CDATA[
<p>You just need a foundation of C/C++. If you already have that then just start programming, it's way better than reading books/guides/blogs (at least until you're stuck!). Also, you can read the source code of other similar projects on GitHub and get ideas from them, this is what I did at the beginning.</p>
]]></description><pubDate>Thu, 19 Jun 2025 06:15:31 +0000</pubDate><link>https://news.ycombinator.com/item?id=44315937</link><dc:creator>nirw4nna</dc:creator><comments>https://news.ycombinator.com/item?id=44315937</comments><guid isPermaLink="false">https://news.ycombinator.com/item?id=44315937</guid></item><item><title><![CDATA[New comment by nirw4nna in "Show HN: I built a tensor library from scratch in C++/CUDA"]]></title><description><![CDATA[
<p>Yes! This was actually one of my initial goals! I actually like to work in a C-style-C++ let's say where I turn off C++ features I don't need and just use the one I actually need like templates, objects ecc...
I find this style to be easy to reason about when it comes to performance.</p>
]]></description><pubDate>Thu, 19 Jun 2025 06:07:17 +0000</pubDate><link>https://news.ycombinator.com/item?id=44315905</link><dc:creator>nirw4nna</dc:creator><comments>https://news.ycombinator.com/item?id=44315905</comments><guid isPermaLink="false">https://news.ycombinator.com/item?id=44315905</guid></item><item><title><![CDATA[New comment by nirw4nna in "Show HN: I built a tensor library from scratch in C++/CUDA"]]></title><description><![CDATA[
<p>Thanks for pointing this out! I'll definitely have to investigate other approaches. nanobind looks interesting but I don't need to expose complex C++ objects, I just need the 'fastest' way of calling into a C API. I guess the goto for this is CFFI?</p>
]]></description><pubDate>Thu, 19 Jun 2025 05:59:22 +0000</pubDate><link>https://news.ycombinator.com/item?id=44315873</link><dc:creator>nirw4nna</dc:creator><comments>https://news.ycombinator.com/item?id=44315873</comments><guid isPermaLink="false">https://news.ycombinator.com/item?id=44315873</guid></item><item><title><![CDATA[New comment by nirw4nna in "Show HN: I built a tensor library from scratch in C++/CUDA"]]></title><description><![CDATA[
<p>I developed this on an HP Omen 15 with an i7-8750H, a GTX 1050TI and 32GB or RAM with Linux Mint as my OS.</p>
]]></description><pubDate>Wed, 18 Jun 2025 22:22:22 +0000</pubDate><link>https://news.ycombinator.com/item?id=44313774</link><dc:creator>nirw4nna</dc:creator><comments>https://news.ycombinator.com/item?id=44313774</comments><guid isPermaLink="false">https://news.ycombinator.com/item?id=44313774</guid></item><item><title><![CDATA[New comment by nirw4nna in "Show HN: I built a tensor library from scratch in C++/CUDA"]]></title><description><![CDATA[
<p>Right now I can load tensors directly from a safetensors file or from a NumPy array so I don't really have in mind to add my own custom format but I do plan to support GGUF files.</p>
]]></description><pubDate>Wed, 18 Jun 2025 22:20:15 +0000</pubDate><link>https://news.ycombinator.com/item?id=44313757</link><dc:creator>nirw4nna</dc:creator><comments>https://news.ycombinator.com/item?id=44313757</comments><guid isPermaLink="false">https://news.ycombinator.com/item?id=44313757</guid></item><item><title><![CDATA[New comment by nirw4nna in "Show HN: I built a tensor library from scratch in C++/CUDA"]]></title><description><![CDATA[
<p>You are absolutely correct! I started working on a sort of compiler a while back but decided to get the basics down first. The templates and switch(s) are not really the issue but rather going back and forth between C & Python. This is an experiment I did a few months ago: <a href="https://x.com/nirw4nna/status/1904114563672354822" rel="nofollow">https://x.com/nirw4nna/status/1904114563672354822</a> as you can see there is a ~20% perf gain just by generating a naive C++ kernel instead of calling 5 separate kernels in the case of softmax.</p>
]]></description><pubDate>Wed, 18 Jun 2025 22:14:23 +0000</pubDate><link>https://news.ycombinator.com/item?id=44313724</link><dc:creator>nirw4nna</dc:creator><comments>https://news.ycombinator.com/item?id=44313724</comments><guid isPermaLink="false">https://news.ycombinator.com/item?id=44313724</guid></item><item><title><![CDATA[New comment by nirw4nna in "Show HN: I built a tensor library from scratch in C++/CUDA"]]></title><description><![CDATA[
<p>Thanks!
To be honest, it started purely as a learning project. I was really inspired when llama.cpp first came out and tried to build something similar in pure C++ (<a href="https://github.com/nirw4nna/YAMI">https://github.com/nirw4nna/YAMI</a>), mostly for fun and to practice low-level coding.
The idea for DSC came when I realized how hard it was to port new models to that C++ engine, especially since I don't have a deep ML background. I wanted something that felt more like PyTorch, where I could experiment with new architectures easily.
As for llama.cpp, it's definitely faster! They have hand-optimizing kernels for a whole bunch of architectures, models and data types. DSC is more of a general-purpose toolkit. I'm excited to work on performance later on, but for now, I'm focused on getting the API and core features right.</p>
]]></description><pubDate>Wed, 18 Jun 2025 22:05:20 +0000</pubDate><link>https://news.ycombinator.com/item?id=44313673</link><dc:creator>nirw4nna</dc:creator><comments>https://news.ycombinator.com/item?id=44313673</comments><guid isPermaLink="false">https://news.ycombinator.com/item?id=44313673</guid></item><item><title><![CDATA[New comment by nirw4nna in "Show HN: I built a tensor library from scratch in C++/CUDA"]]></title><description><![CDATA[
<p>Yes, when I designed the API I wanted to keep a clear distinction between Python and C. At some point I had two APIs: 1 in Python and the other in high-level C++ and they both shared the same low-level C API. I find this design quite clean and easy to work with if multiple languages are involved. When I'll get to perf I plan to experiment a bit with nanobind (<a href="https://github.com/wjakob/nanobind">https://github.com/wjakob/nanobind</a>) and see if there's a noticeable difference wrt ctypes.</p>
]]></description><pubDate>Wed, 18 Jun 2025 21:45:25 +0000</pubDate><link>https://news.ycombinator.com/item?id=44313545</link><dc:creator>nirw4nna</dc:creator><comments>https://news.ycombinator.com/item?id=44313545</comments><guid isPermaLink="false">https://news.ycombinator.com/item?id=44313545</guid></item><item><title><![CDATA[Show HN: I built a tensor library from scratch in C++/CUDA]]></title><description><![CDATA[
<p>Hi HN,<p>Over the past few months, I've been building `dsc`, a tensor library from scratch in C++/CUDA. My main focus has been on getting the basics right, prioritizing a clean API, simplicity, and clear observability for running small LLMs locally.<p>The key features are:
- C++ core with CUDA support written from scratch.
- A familiar, PyTorch-like Python API.
- Runs real models: it's complete enough to load a model like Qwen from HuggingFace and run inference on both CUDA and CPU with a single line change[1].
- Simple, built-in observability for both Python and C++.<p>Next on the roadmap is adding BF16 support and then I'll be working on visualization for GPU workloads.<p>The project is still early and I would be incredibly grateful for any feedback, code reviews, or questions from the HN community!<p>GitHub Repo: <a href="https://github.com/nirw4nna/dsc">https://github.com/nirw4nna/dsc</a><p>[1]: <a href="https://github.com/nirw4nna/dsc/blob/main/examples/models/qwen2_5.py">https://github.com/nirw4nna/dsc/blob/main/examples/models/qw...</a></p>
<hr>
<p>Comments URL: <a href="https://news.ycombinator.com/item?id=44310678">https://news.ycombinator.com/item?id=44310678</a></p>
<p>Points: 119</p>
<p># Comments: 28</p>
]]></description><pubDate>Wed, 18 Jun 2025 15:20:05 +0000</pubDate><link>https://github.com/nirw4nna/dsc</link><dc:creator>nirw4nna</dc:creator><comments>https://news.ycombinator.com/item?id=44310678</comments><guid isPermaLink="false">https://news.ycombinator.com/item?id=44310678</guid></item></channel></rss>