Hacker News: xoranth

New comment by xoranth in "DeepSeek-R1-671B-Q4_K_M with 1 or 2 Arc A770 on Xeon"

xoranth — Thu, 06 Mar 2025 07:56:52 +0000

> Crappy Pixel Fold 2022 mid-range Android CPU

Can you share what LLMs do you run on such small devices/what user case they address?

(Not a rhetorical question, it's just that I see a lot of work on local inference for edge devices with small models, but I could never get a small model to work for me. So I'm curious about other people's user cases.)

New comment by xoranth in "S6 – Skarnet's small supervision suite"

xoranth — Mon, 16 Sep 2024 21:24:16 +0000

Are there any good how-tos for how to set up a non-trivial container with s6 and s6-rc? Last time I looked at this the documentation was pretty sparse, and more of a reference and design document than a set of how-tos.

New comment by xoranth in "Async hazard: MMAP is blocking IO"

xoranth — Sat, 24 Aug 2024 17:51:11 +0000

I believe they mean that since it bypasses the (Tokio) scheduler, so if you use it in async code you lose the main benefit of async code (namely, the scheduler is able to switch to some other task while waiting for IO to complete.). Basically the same behavior you'd get if you called a blocking syscall directly.

New comment by xoranth in "Async hazard: MMAP is blocking IO"

xoranth — Sat, 24 Aug 2024 17:32:04 +0000

On Linux, you might be able to use userfaultfd to make it async...

New comment by xoranth in "SIMD Matters: Graph Coloring"

xoranth — Thu, 22 Aug 2024 13:22:54 +0000

Thank you for your reply!

> GPUs are also just generally a lot more limiting than SIMD in many other ways.

What do you mean? (besides things like CUDA being available only on Nvidia/fragmentation issues.)

New comment by xoranth in "SIMD Matters: Graph Coloring"

xoranth — Thu, 22 Aug 2024 13:10:10 +0000

Sure, but how well do they perform compared to vector loads? Do they get converted to vector load + shuffle uops, and therefore require a specific layout anyway?

Last time I tried using gathers on AVX2, performance was comparable to doing scalar loads.

New comment by xoranth in "SIMD Matters: Graph Coloring"

xoranth — Thu, 22 Aug 2024 09:33:37 +0000

General questions for gamedevs here. How useful is SIMD given that now we have compute shaders on the GPU? If so, what workloads still require SIMD/why would you choose one over the other?

New comment by xoranth in "SIMD Matters: Graph Coloring"

xoranth — Thu, 22 Aug 2024 09:30:11 +0000

On x86-64, compilers use SIMD instructions and registers to implement floating point math, they just use the single lane instructions. E.g. (https://godbolt.org/z/94b3r8dMn):

    float my_func(float lhs, float rhs) {
        return 2.0f * lhs - 3.0f * rhs;
    }

Becomes:

    my_func(float, float):
        addss   xmm0, xmm0
        mulss   xmm1, DWORD PTR .LC0[rip]
        subss   xmm0, xmm1
        ret

(addss, mulss and subss are SSE2 instructions.)

New comment by xoranth in ""Doors" in Solaris: Lightweight RPC Using File Descriptors (1996)"

xoranth — Wed, 24 Jul 2024 11:25:44 +0000

Sounds a bit like Google's proposal for a `switchto_switch` syscall [1] that would allow for cooperative multithreading bypassing the scheduler.

(the descendants of that proposal is `sched_ext`, so maybe it is possible to implement doors in eBPF + sched_ext?)

[1]: https://youtu.be/KXuZi9aeGTw?t=900

New comment by xoranth in "Codestral Mamba"

xoranth — Wed, 17 Jul 2024 06:54:18 +0000

Thanks!

New comment by xoranth in "Codestral Mamba"

xoranth — Wed, 17 Jul 2024 06:10:56 +0000

Is the extension you wrote public?

New comment by xoranth in "KUtrace: Low-overhead Linux kernel tracing facility"

xoranth — Tue, 16 Jul 2024 06:40:39 +0000

How would this interact with `io_uring`, especially the polling modes (IO_SETUP_SQPOLL, IO_SETUP_IOPOLL)?

New comment by xoranth in "AMD MI300x GPUs with GEMM tuning improves throughput and latency by up to 7.2x"

xoranth — Sat, 29 Jun 2024 12:37:30 +0000

I believe the reason they offer no details about how they tuned the kernels is that the tuning is done by a tool provided by AMD. See here:

https://rocm.docs.amd.com/projects/rocBLAS/en/develop/how-to...

New comment by xoranth in "ExectOS – brand new operating system which derives from NT architecture"

xoranth — Thu, 20 Jun 2024 08:30:32 +0000

Is there any good article on NT internals (that isn't Russinovich' book), that highlight where/how it is better than Linux and other *BSDs?

When asked people point to IOCP vs epoll, but I'm not sure how relevant it is now that Linux has io_uring.

(They also point to stable ABIs for drivers, but I am more interested in internals)

New comment by xoranth in "Scan HTML faster with SIMD instructions – Chrome edition"

xoranth — Fri, 14 Jun 2024 15:15:59 +0000

> expression templates

That's one of the cases where you can't "mechanically" translate C++ to Rust. To obtain the same result, a good choice would be a proc macro.

Which is a pain to implement, but will also give you more flexibility.

New comment by xoranth in "SIMD < SIMT < SMT: Parallelism in Nvidia GPUs (2011)"

xoranth — Tue, 11 Jun 2024 22:31:29 +0000

Thanks!

New comment by xoranth in "SIMD < SIMT < SMT: Parallelism in Nvidia GPUs (2011)"

xoranth — Tue, 11 Jun 2024 21:10:52 +0000

> That allows things like individual threads to take locks, which is a pretty big leap.

Does anyone know how those get translated into SIMD instructions. Like, how do you do a CAS loop for each lane where each lane can individually succeed or fail? What happens if the lanes point to the same location?

New comment by xoranth in "SIMD < SIMT < SMT: Parallelism in Nvidia GPUs (2011)"

xoranth — Tue, 11 Jun 2024 21:03:35 +0000

Do you know any good tutorial for ISPC? Documentation is a bit sparse.

New comment by xoranth in "SIMD < SIMT < SMT: Parallelism in Nvidia GPUs (2011)"

xoranth — Tue, 11 Jun 2024 21:03:02 +0000

It is the same reason in software sometimes you batch operations:

When you add two numbers, the GPU needs to do a lot more stuff besides the addition.

If you implemented SIMT by having multiple cores, you would need to do the extra stuff once per core, so you wouldn't save power (and you have a fixed power budget). With SIMD, you get $NUM_LANES additions, but you do the extra stuff only once, saving power.

(See this article by OP, which goes into more details: https://yosefk.com/blog/its-done-in-hardware-so-its-cheap.ht... )

New comment by xoranth in "SIMD < SIMT < SMT: Parallelism in Nvidia GPUs (2011)"

xoranth — Tue, 11 Jun 2024 20:32:43 +0000

I believe the author is referring to how many logical threads/hyperthreads can a core run (for AMD and Intel, two. I believe POWER can do 8, Sparc 4).

The extra physical registers are there for superscalar execution, not for SMT/hyperthreading.