Hacker News: ArcVRArthur

Lessons learned scaling LLM training and inference with RDMA (2024)

ArcVRArthur — Mon, 11 Nov 2024 18:14:35 +0000

Article URL: https://vgpu.io/blog/LLM-Training-And-Inference-With-Direct-Memory-Access-DMA/

Comments URL: https://news.ycombinator.com/item?id=42109203

Points: 1

# Comments: 0

New comment by ArcVRArthur in "[dead]"

ArcVRArthur — Wed, 30 Oct 2024 02:48:44 +0000

Hey YC, I helped Cohere scale GPUs to a 10k+ total GPU count for hybrid training and inference - if you’re curious about anything I wrote in the article I think there’s a good opportunity to read / reply to your comment. This article is made for fun.

New comment by ArcVRArthur in "[dead]"

ArcVRArthur — Wed, 30 Oct 2024 02:36:02 +0000

Hey YC, I helped Cohere scale GPUs to a 10k+ total GPU count for hybrid training/inference computer - if you’re curious about anything I wrote in the article I think there’s a good opportunity to read / reply to your comment here. Please leave a comment!

New comment by ArcVRArthur in "Scaling Transformers at Cohere: What I Learned"

ArcVRArthur — Tue, 29 Oct 2024 15:36:51 +0000

I had the opportunity to help Cohere work on scaling transformers in the last year. If you have any questions about the article leave them below and I’ll do my best to answer openly. :)

Scaling Transformers at Cohere: What I Learned

ArcVRArthur — Tue, 29 Oct 2024 15:36:51 +0000

Article URL: https://vgpu.io/blog/transformer-scaling-at-cohere/

Comments URL: https://news.ycombinator.com/item?id=41985090

Points: 1

# Comments: 1

New comment by ArcVRArthur in "[dead]"

ArcVRArthur — Tue, 29 Oct 2024 15:31:27 +0000

I had the opportunity to help Cohere work on scaling transformers in the last year. If you have any questions about the article leave them below and I’ll do my best to answer openly. :)

New comment by ArcVRArthur in "[dead]"

ArcVRArthur — Tue, 29 Oct 2024 15:28:58 +0000

I had the opportunity to help Cohere work on scaling transformers in the last year. If you have any questions about the article leave them below and I’ll do my best to answer openly. :)

New comment by ArcVRArthur in "GVM Server: A complete virtualization solution based on GPU Virtual Machine(GVM)"

ArcVRArthur — Fri, 04 Nov 2022 18:11:09 +0000

Hey all,

I'm the the co-author of the GPU Virtual Machine (GVM project), and LibVF.IO. We just announced our enterprise product based on GVM called GVM Server. I'd love to hear what you all think of the work we've done and give suggestions on where we can improve in the future!

GVM Server: A complete virtualization solution based on GPU Virtual Machine(GVM)

ArcVRArthur — Fri, 04 Nov 2022 18:06:45 +0000

Article URL: https://www.youtube.com/watch?v=LjpwFI2E1ms

Comments URL: https://news.ycombinator.com/item?id=33471266

Points: 1

# Comments: 1

GPU Virtual Machine (GVM) at QubesOS Summit

ArcVRArthur — Fri, 30 Sep 2022 15:20:43 +0000

Article URL: https://www.youtube.com/watch?v=YllX-ud70Nk

Comments URL: https://news.ycombinator.com/item?id=33036074

Points: 18

# Comments: 2

New comment by ArcVRArthur in "LibVF.IO: Add support for GPU Virtual Machine (GVM)"

ArcVRArthur — Mon, 29 Aug 2022 17:43:40 +0000

I was thinking this over in the past couple days and I think the words 'that they are aware' is really key here.

Ideally if GPU virtualization were sufficiently widespread as is support today for Intel VT-d, and AMD-v (IOMMU APIs for hardware assisted CPU virtualization) then software could make use of these functions without the user being aware of it. We're in a situation similar to that of CPU virtualization without hardware assistance with the early Xenoservers project from Cambridge (what would later become the Xen hypervisor and XenSource company). At that time there was not widespread support for virtualization assistance on most CPUs, and as a result Xen used methods like ring de-privileging to place the entire guest in ring 3 (userspace and kernel) while the hypervisor ran in ring 0 in order to virtualize any ordinary CPU model - my understanding is these were known as PV-guests (paravirtual guests). Over time however CPU companies began to introduce widespread support for features like VT-d and AMD-v to all of their models of CPU which enabled VM-exits/context save-restore with the use of shadow registers rather than ring de-privileging while Intel added new 'virtualization enhancements' through feature suites like vPro (SGX2 for example) which were only available on certain models of CPU (for example Xeon devices). Xen would adopt VT-d and AMD-v as HVM-guests (Hardware assisted virtualization) as they became more common on ubiquitous hardware and at the same time commercial forks of Xen would take advantage of these vPro features (like SGX2) for enterprise and high security government use-cases:

https://wiki.xenproject.org/wiki/Xen_Project_Software_Overvi...

Like before (around the time of the Xenoservers project) today we can effectively virtualize the GPU without hardware assistance mechanisms:

https://openmdev.io/index.php/GPU_Support

https://openmdev.io/index.php/Virtual_I/O_Internals#Mdev_Mod...

Since it's now practical to virtualize any GPU device (as was the case in the past with early Xen on CPUs supporting virtualization for various use-cases regardless of whether or not the hardware provided assistance mechanisms) it might then be time to start moving to a new paradigm of 'enterprise' vs. 'consumer' - in other words new 'virtualization enhancements' (similar to vPro on Intel's Xeons, ect..) are developed for enterprise GPUs (for example shadow page deduplication in VRAM, import/export of redundant objects between IO Virtual Address buffers, IOMMU protected balloon/deballoon, ect..) and basic hardware assistance mechanisms like SR-IOV & SIOV are enabled by default, across the board:

https://openmdev.io/index.php/Virtual_I/O_Internals#SR-IOV_M...

https://openmdev.io/index.php/Virtual_I/O_Internals#SIOV_Mod...

New comment by ArcVRArthur in "LibVF.IO: Add support for GPU Virtual Machine (GVM)"

ArcVRArthur — Thu, 25 Aug 2022 16:02:01 +0000

Thanks!! We do our best to keep the code as clean/readable as possible. The first version was a bit of a mess but we rewrote it again clean slate to improve over the first implementation. :)

New comment by ArcVRArthur in "LibVF.IO: Add support for GPU Virtual Machine (GVM)"

ArcVRArthur — Thu, 25 Aug 2022 15:59:17 +0000

That depends on your use-case. In general I would recommend you consider purchasing Nvidia's GPUs for the best price/performance and GVM support. Intel's Xe architecture is currently improving but the performance isn't quite there for a number of use-cases however some appear to work quite well and I expect that will improve with time. The 2080Ti works well with current software. If you are a developer and would like to help us improve support for devices you can purchase a 3090Ti (support in GVM for this device is under active development).

New comment by ArcVRArthur in "LibVF.IO: Add support for GPU Virtual Machine (GVM)"

ArcVRArthur — Wed, 24 Aug 2022 23:15:00 +0000

We have Ampere support working on some devices (and it is in development on other Ampere devices) as well as 11th & 12th generation Intel Xe. :)

Here's the GPU Support page if you'd like to take a look:

https://openmdev.io/index.php/GPU_Support

New comment by ArcVRArthur in "LibVF.IO: Add support for GPU Virtual Machine (GVM)"

ArcVRArthur — Wed, 24 Aug 2022 23:10:39 +0000

Ya, that's accurate. The precise driver implementation matters a lot. Having said that there are some good 'best practices' that seem to make a difference. In my opinion 'IOMMU Aware Mediated Device' could also make some much needed improvements here as it would allow for more granular IOMMU allocations - perhaps this mode could help further support the 'App VMs' use-case using shared work queues without breaking IO virtual address translation:

https://lwn.net/ml/linux-kernel/20190222021927.13132-1-baolu...

New comment by ArcVRArthur in "LibVF.IO: Add support for GPU Virtual Machine (GVM)"

ArcVRArthur — Wed, 24 Aug 2022 23:04:01 +0000

GVM uses IOMMU for compartmentalization:

This page has a comparison of the various IO assistance modes GVM can make use of (see comparison of assistance modes, the Mdev Mode section, and the SR-IOV Mode section):

https://openmdev.io/index.php/Virtual_IO_Internals

This will probably also play a role in future developments like SIOV (Scalable IO Virtualization):

https://lwn.net/ml/linux-kernel/20190222021927.13132-1-baolu...

New comment by ArcVRArthur in "LibVF.IO: Add support for GPU Virtual Machine (GVM)"

ArcVRArthur — Wed, 24 Aug 2022 23:01:31 +0000

Ya! You can use VMs that use X11 in the guest without issue. X11 also works on the host. Wayland is also working on the host - I haven't tested yet with Wayland guests yet so that's something to try.

New comment by ArcVRArthur in "LibVF.IO: Add support for GPU Virtual Machine (GVM)"

ArcVRArthur — Wed, 24 Aug 2022 22:49:05 +0000

Ya, LibVF.IO & GVM are built for things like this! For example I have a friend who uses it for various Adobe programs which also don't work well on Linux.

New comment by ArcVRArthur in "LibVF.IO: Add support for GPU Virtual Machine (GVM)"

ArcVRArthur — Wed, 24 Aug 2022 22:38:55 +0000

For sure! They can reach me at arthur@arccompute.io

I'll also be attending KVM Forum this year so I'd love to chat with folks there as well! :)

New comment by ArcVRArthur in "LibVF.IO: Add support for GPU Virtual Machine (GVM)"

ArcVRArthur — Wed, 24 Aug 2022 22:34:52 +0000

Thanks!! We'll do our best to keep improving things for everyone. Hopefully security by compartmentalization folks benefit from our work as well. I'll be going to QubesOS Summit so hopefully there will be more good conversations there. :)