<rss version="2.0" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:atom="http://www.w3.org/2005/Atom"><channel><title>Hacker News: praetor22</title><link>https://news.ycombinator.com/user?id=praetor22</link><description>Hacker News RSS</description><docs>https://hnrss.org/</docs><generator>hnrss v2.1.1</generator><lastBuildDate>Fri, 17 Apr 2026 18:25:23 +0000</lastBuildDate><atom:link href="https://hnrss.org/user?id=praetor22" rel="self" type="application/rss+xml"></atom:link><item><title><![CDATA[New comment by praetor22 in "Bend: a high-level language that runs on GPUs (via HVM2)"]]></title><description><![CDATA[
<p>Look, I understand the value proposition and how cool it is from a theoretical standpoint, but I honestly don't think this will ever become relevant.<p>Here are some notes from my first impressions and after skimming through the paper. And yes, I am aware that this is very very early software.<p>1. Bend looks like an extremely limited DSL. No FFI. No way of interacting with raw buffers. Weird 24bit floating point format.<p>2. There's a reason why ICs are not relevant: performance is and will always be terrible. There is no other way to put it, graph traversal simply doesn't map well on hardware.<p>3. The premise of optimal reduction is valid. However, you still need to write the kernels in a way that can be parallelized (ie. no data dependencies, use of recursion).<p>4. There are no serious examples that directly compare Bend/HVM code with it's equivalent OMP/CUDA program. How am I suppose to evaluate the reduction in implementation complexity and what to expect on performance. So many claims, so little actual comparisons.<p>5. In the real world of high performance parallel computing, tree-like structures are non-existent. Arrays are king. And that's because of the physical nature of how memory works on a hardware level. And do you know what works best on mutable contiguous memory buffers ? Loops. We'll see when HVM will implement this.<p>In the end, what we currently have is half-baked language that is (almost) fully isolated from external data, extremely slow, a massive abstraction on the underlying hardware (unutilised features: multilevel caches, tensor cores, simd, atomics).<p>I apologize if this comes out as harsh, I still find the technical implementation and the theoretical background to be very interesting. I'm simply not (yet) convinced of its usefulness in the real world.</p>
]]></description><pubDate>Fri, 17 May 2024 22:39:08 +0000</pubDate><link>https://news.ycombinator.com/item?id=40394814</link><dc:creator>praetor22</dc:creator><comments>https://news.ycombinator.com/item?id=40394814</comments><guid isPermaLink="false">https://news.ycombinator.com/item?id=40394814</guid></item></channel></rss>