<rss version="2.0" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:atom="http://www.w3.org/2005/Atom"><channel><title>Hacker News: xoranth</title><link>https://news.ycombinator.com/user?id=xoranth</link><description>Hacker News RSS</description><docs>https://hnrss.org/</docs><generator>hnrss v2.1.1</generator><lastBuildDate>Thu, 30 Apr 2026 10:10:12 +0000</lastBuildDate><atom:link href="https://hnrss.org/user?id=xoranth" rel="self" type="application/rss+xml"></atom:link><item><title><![CDATA[New comment by xoranth in "DeepSeek-R1-671B-Q4_K_M with 1 or 2 Arc A770 on Xeon"]]></title><description><![CDATA[
<p>> Crappy Pixel Fold 2022 mid-range Android CPU<p>Can you share what LLMs do you run on such small devices/what user case they address?<p>(Not a rhetorical question, it's just that I see a lot of work on local inference for edge devices with small models, but I could never get a small model to work for me. So I'm curious about other people's user cases.)</p>
]]></description><pubDate>Thu, 06 Mar 2025 07:56:52 +0000</pubDate><link>https://news.ycombinator.com/item?id=43277618</link><dc:creator>xoranth</dc:creator><comments>https://news.ycombinator.com/item?id=43277618</comments><guid isPermaLink="false">https://news.ycombinator.com/item?id=43277618</guid></item><item><title><![CDATA[New comment by xoranth in "S6 – Skarnet's small supervision suite"]]></title><description><![CDATA[
<p>Are there any good how-tos for how to set up a non-trivial container with s6 and s6-rc? Last time I looked at this the documentation was pretty sparse, and more of a reference and design document than a set of how-tos.</p>
]]></description><pubDate>Mon, 16 Sep 2024 21:24:16 +0000</pubDate><link>https://news.ycombinator.com/item?id=41561102</link><dc:creator>xoranth</dc:creator><comments>https://news.ycombinator.com/item?id=41561102</comments><guid isPermaLink="false">https://news.ycombinator.com/item?id=41561102</guid></item><item><title><![CDATA[New comment by xoranth in "Async hazard: MMAP is blocking IO"]]></title><description><![CDATA[
<p>I believe they mean that since it bypasses the (Tokio) scheduler, so if you use it in async code you lose the main benefit of async code (namely, the scheduler is able to switch to some other task while waiting for IO to complete.).
Basically the same behavior you'd get if you called a blocking syscall directly.</p>
]]></description><pubDate>Sat, 24 Aug 2024 17:51:11 +0000</pubDate><link>https://news.ycombinator.com/item?id=41339965</link><dc:creator>xoranth</dc:creator><comments>https://news.ycombinator.com/item?id=41339965</comments><guid isPermaLink="false">https://news.ycombinator.com/item?id=41339965</guid></item><item><title><![CDATA[New comment by xoranth in "Async hazard: MMAP is blocking IO"]]></title><description><![CDATA[
<p>On Linux, you might be able to use userfaultfd to make it async...</p>
]]></description><pubDate>Sat, 24 Aug 2024 17:32:04 +0000</pubDate><link>https://news.ycombinator.com/item?id=41339810</link><dc:creator>xoranth</dc:creator><comments>https://news.ycombinator.com/item?id=41339810</comments><guid isPermaLink="false">https://news.ycombinator.com/item?id=41339810</guid></item><item><title><![CDATA[New comment by xoranth in "SIMD Matters: Graph Coloring"]]></title><description><![CDATA[
<p>Thank you for your reply!<p>> GPUs are also just generally a lot more limiting than SIMD in many other ways.<p>What do you mean? (besides things like CUDA being available only on Nvidia/fragmentation issues.)</p>
]]></description><pubDate>Thu, 22 Aug 2024 13:22:54 +0000</pubDate><link>https://news.ycombinator.com/item?id=41319980</link><dc:creator>xoranth</dc:creator><comments>https://news.ycombinator.com/item?id=41319980</comments><guid isPermaLink="false">https://news.ycombinator.com/item?id=41319980</guid></item><item><title><![CDATA[New comment by xoranth in "SIMD Matters: Graph Coloring"]]></title><description><![CDATA[
<p>Sure, but how well do they perform compared to vector loads? Do they get converted to vector load + shuffle uops, and therefore require a specific layout anyway?<p>Last time I tried using gathers on AVX2, performance was comparable to doing scalar loads.</p>
]]></description><pubDate>Thu, 22 Aug 2024 13:10:10 +0000</pubDate><link>https://news.ycombinator.com/item?id=41319851</link><dc:creator>xoranth</dc:creator><comments>https://news.ycombinator.com/item?id=41319851</comments><guid isPermaLink="false">https://news.ycombinator.com/item?id=41319851</guid></item><item><title><![CDATA[New comment by xoranth in "SIMD Matters: Graph Coloring"]]></title><description><![CDATA[
<p>General questions for gamedevs here. 
How useful is SIMD given that now we have compute shaders on the GPU? If so, what workloads still require SIMD/why would you choose one over the other?</p>
]]></description><pubDate>Thu, 22 Aug 2024 09:33:37 +0000</pubDate><link>https://news.ycombinator.com/item?id=41318339</link><dc:creator>xoranth</dc:creator><comments>https://news.ycombinator.com/item?id=41318339</comments><guid isPermaLink="false">https://news.ycombinator.com/item?id=41318339</guid></item><item><title><![CDATA[New comment by xoranth in "SIMD Matters: Graph Coloring"]]></title><description><![CDATA[
<p>On x86-64, compilers use SIMD instructions and registers to implement floating point math, they just use the single lane instructions. E.g. (<a href="https://godbolt.org/z/94b3r8dMn" rel="nofollow">https://godbolt.org/z/94b3r8dMn</a>):<p><pre><code>    float my_func(float lhs, float rhs) {
        return 2.0f * lhs - 3.0f * rhs;
    }
</code></pre>
Becomes:<p><pre><code>    my_func(float, float):
        addss   xmm0, xmm0
        mulss   xmm1, DWORD PTR .LC0[rip]
        subss   xmm0, xmm1
        ret
</code></pre>
(addss, mulss and subss are SSE2 instructions.)</p>
]]></description><pubDate>Thu, 22 Aug 2024 09:30:11 +0000</pubDate><link>https://news.ycombinator.com/item?id=41318325</link><dc:creator>xoranth</dc:creator><comments>https://news.ycombinator.com/item?id=41318325</comments><guid isPermaLink="false">https://news.ycombinator.com/item?id=41318325</guid></item><item><title><![CDATA[New comment by xoranth in ""Doors" in Solaris: Lightweight RPC Using File Descriptors (1996)"]]></title><description><![CDATA[
<p>Sounds a bit like Google's proposal for a `switchto_switch` syscall [1] that would allow for cooperative multithreading bypassing the scheduler.<p>(the descendants of that proposal is `sched_ext`, so maybe it is possible to implement doors in eBPF + sched_ext?)<p>[1]: <a href="https://youtu.be/KXuZi9aeGTw?t=900" rel="nofollow">https://youtu.be/KXuZi9aeGTw?t=900</a></p>
]]></description><pubDate>Wed, 24 Jul 2024 11:25:44 +0000</pubDate><link>https://news.ycombinator.com/item?id=41055803</link><dc:creator>xoranth</dc:creator><comments>https://news.ycombinator.com/item?id=41055803</comments><guid isPermaLink="false">https://news.ycombinator.com/item?id=41055803</guid></item><item><title><![CDATA[New comment by xoranth in "Codestral Mamba"]]></title><description><![CDATA[
<p>Thanks!</p>
]]></description><pubDate>Wed, 17 Jul 2024 06:54:18 +0000</pubDate><link>https://news.ycombinator.com/item?id=40983223</link><dc:creator>xoranth</dc:creator><comments>https://news.ycombinator.com/item?id=40983223</comments><guid isPermaLink="false">https://news.ycombinator.com/item?id=40983223</guid></item><item><title><![CDATA[New comment by xoranth in "Codestral Mamba"]]></title><description><![CDATA[
<p>Is the extension you wrote public?</p>
]]></description><pubDate>Wed, 17 Jul 2024 06:10:56 +0000</pubDate><link>https://news.ycombinator.com/item?id=40983047</link><dc:creator>xoranth</dc:creator><comments>https://news.ycombinator.com/item?id=40983047</comments><guid isPermaLink="false">https://news.ycombinator.com/item?id=40983047</guid></item><item><title><![CDATA[New comment by xoranth in "KUtrace: Low-overhead Linux kernel tracing facility"]]></title><description><![CDATA[
<p>How would this interact with `io_uring`, especially the polling modes (IO_SETUP_SQPOLL, IO_SETUP_IOPOLL)?</p>
]]></description><pubDate>Tue, 16 Jul 2024 06:40:39 +0000</pubDate><link>https://news.ycombinator.com/item?id=40974106</link><dc:creator>xoranth</dc:creator><comments>https://news.ycombinator.com/item?id=40974106</comments><guid isPermaLink="false">https://news.ycombinator.com/item?id=40974106</guid></item><item><title><![CDATA[New comment by xoranth in "AMD MI300x GPUs with GEMM tuning improves throughput and latency by up to 7.2x"]]></title><description><![CDATA[
<p>I believe the reason they offer no details about how they tuned the kernels is that the tuning is done by a tool provided by AMD. See here:<p><a href="https://rocm.docs.amd.com/projects/rocBLAS/en/develop/how-to/Programmers_Guide.html#rocblas-gemm-tune" rel="nofollow">https://rocm.docs.amd.com/projects/rocBLAS/en/develop/how-to...</a></p>
]]></description><pubDate>Sat, 29 Jun 2024 12:37:30 +0000</pubDate><link>https://news.ycombinator.com/item?id=40830014</link><dc:creator>xoranth</dc:creator><comments>https://news.ycombinator.com/item?id=40830014</comments><guid isPermaLink="false">https://news.ycombinator.com/item?id=40830014</guid></item><item><title><![CDATA[New comment by xoranth in "ExectOS – brand new operating system which derives from NT architecture"]]></title><description><![CDATA[
<p>Is there any good article on NT internals (that isn't Russinovich' book), that highlight where/how it is better than Linux and other *BSDs?<p>When asked people point to IOCP vs epoll, but I'm not sure how relevant it is now that Linux has io_uring.<p>(They also point to stable ABIs for drivers, but I am more interested in internals)</p>
]]></description><pubDate>Thu, 20 Jun 2024 08:30:32 +0000</pubDate><link>https://news.ycombinator.com/item?id=40736330</link><dc:creator>xoranth</dc:creator><comments>https://news.ycombinator.com/item?id=40736330</comments><guid isPermaLink="false">https://news.ycombinator.com/item?id=40736330</guid></item><item><title><![CDATA[New comment by xoranth in "Scan HTML faster with SIMD instructions – Chrome edition"]]></title><description><![CDATA[
<p>> expression templates<p>That's one of the cases where you can't "mechanically" translate C++ to Rust. To obtain the same result, a good choice would be a proc macro.<p>Which is a pain to implement, but will also give you more flexibility.</p>
]]></description><pubDate>Fri, 14 Jun 2024 15:15:59 +0000</pubDate><link>https://news.ycombinator.com/item?id=40681660</link><dc:creator>xoranth</dc:creator><comments>https://news.ycombinator.com/item?id=40681660</comments><guid isPermaLink="false">https://news.ycombinator.com/item?id=40681660</guid></item><item><title><![CDATA[New comment by xoranth in "SIMD < SIMT < SMT: Parallelism in Nvidia GPUs (2011)"]]></title><description><![CDATA[
<p>Thanks!</p>
]]></description><pubDate>Tue, 11 Jun 2024 22:31:29 +0000</pubDate><link>https://news.ycombinator.com/item?id=40652400</link><dc:creator>xoranth</dc:creator><comments>https://news.ycombinator.com/item?id=40652400</comments><guid isPermaLink="false">https://news.ycombinator.com/item?id=40652400</guid></item><item><title><![CDATA[New comment by xoranth in "SIMD < SIMT < SMT: Parallelism in Nvidia GPUs (2011)"]]></title><description><![CDATA[
<p>> That allows things like individual threads to take locks, which is a pretty big leap.<p>Does anyone know how those get translated into SIMD instructions. Like, how do you do a CAS loop for each lane where each lane can individually succeed or fail? 
What happens if the lanes point to the same location?</p>
]]></description><pubDate>Tue, 11 Jun 2024 21:10:52 +0000</pubDate><link>https://news.ycombinator.com/item?id=40651532</link><dc:creator>xoranth</dc:creator><comments>https://news.ycombinator.com/item?id=40651532</comments><guid isPermaLink="false">https://news.ycombinator.com/item?id=40651532</guid></item><item><title><![CDATA[New comment by xoranth in "SIMD < SIMT < SMT: Parallelism in Nvidia GPUs (2011)"]]></title><description><![CDATA[
<p>Do you know any good tutorial for ISPC? Documentation is a bit sparse.</p>
]]></description><pubDate>Tue, 11 Jun 2024 21:03:35 +0000</pubDate><link>https://news.ycombinator.com/item?id=40651445</link><dc:creator>xoranth</dc:creator><comments>https://news.ycombinator.com/item?id=40651445</comments><guid isPermaLink="false">https://news.ycombinator.com/item?id=40651445</guid></item><item><title><![CDATA[New comment by xoranth in "SIMD < SIMT < SMT: Parallelism in Nvidia GPUs (2011)"]]></title><description><![CDATA[
<p>It is the same reason in software sometimes you batch operations:<p>When you add two numbers, the GPU needs to do a lot more stuff besides the addition.<p>If you implemented SIMT by having multiple cores, you would need to do the extra stuff once per core, so you wouldn't save power (and you have a fixed power budget).
With SIMD, you get $NUM_LANES additions, but you do the extra stuff only once, saving power.<p>(See this article by OP, which goes into more details: <a href="https://yosefk.com/blog/its-done-in-hardware-so-its-cheap.html#parallelization-throwing-more-hardware-at-the-job" rel="nofollow">https://yosefk.com/blog/its-done-in-hardware-so-its-cheap.ht...</a>
)</p>
]]></description><pubDate>Tue, 11 Jun 2024 21:03:02 +0000</pubDate><link>https://news.ycombinator.com/item?id=40651438</link><dc:creator>xoranth</dc:creator><comments>https://news.ycombinator.com/item?id=40651438</comments><guid isPermaLink="false">https://news.ycombinator.com/item?id=40651438</guid></item><item><title><![CDATA[New comment by xoranth in "SIMD < SIMT < SMT: Parallelism in Nvidia GPUs (2011)"]]></title><description><![CDATA[
<p>I believe the author is referring to how many logical threads/hyperthreads can a core run (for AMD and Intel, two. I believe POWER can do 8, Sparc 4).<p>The extra physical registers are there for superscalar execution, not for SMT/hyperthreading.</p>
]]></description><pubDate>Tue, 11 Jun 2024 20:32:43 +0000</pubDate><link>https://news.ycombinator.com/item?id=40651145</link><dc:creator>xoranth</dc:creator><comments>https://news.ycombinator.com/item?id=40651145</comments><guid isPermaLink="false">https://news.ycombinator.com/item?id=40651145</guid></item></channel></rss>