Hacker News: jono_irwin

Reduce GVisor Cold Starts with GPU Snapshotting

jono_irwin — Wed, 01 Jul 2026 16:19:47 +0000

Article URL: https://cerebrium.ai/blog/reducing-gpu-cold-starts-with-memory-snapshots-restoring-cuda-workloads-in-second

Comments URL: https://news.ycombinator.com/item?id=48749313

Points: 48

# Comments: 15

New comment by jono_irwin in "The 1979 Design Choice Breaking AI Workloads"

jono_irwin — Mon, 09 Mar 2026 19:45:43 +0000

Yeah that’s fair. For weights specifically there often isn’t a huge dedupe win across versions since retraining tends to change most of them. That said, we generally don’t advocate including model weights in container images anyway. The main benefit for us is avoiding the need to pull the full image up front and only fetching the data actually touched during startup. On the latency side, reads happen over a local network with caching and prefetching, so the impact on request latency is typically minimal.

New comment by jono_irwin in "The 1979 Design Choice Breaking AI Workloads"

jono_irwin — Mon, 09 Mar 2026 19:25:03 +0000

That approach works really well when you have a stable shared base image.

Where it starts to get harder is when you have multiple base stacks (different CUDA versions, frameworks, etc.) or when you need to update them frequently. You end up with lots of slightly different multi-GB bases.

Chunked images keep the benefit you mentioned (we still cache heavily on the nodes) but the caching happens at a finer granularity. That makes it much more tolerant to small differences between images and to frequent updates, since unchanged chunks can still be reused.

New comment by jono_irwin in "The 1979 Design Choice Breaking AI Workloads"

jono_irwin — Mon, 09 Mar 2026 19:18:07 +0000

Good point, network dependency is a valid concern.

In practice these systems typically fetch data over a local, highly available network and aggressively cache anything that gets read. If that network path becomes unavailable, it usually indicates a much larger infrastructure issue since many other parts of the system rely on the same storage or registry endpoints.

So while it does introduce a different failure mode, in most production environments it ends up being a low practical risk compared to the startup latency improvements.

For us and our customers, the trade off is worth it.

New comment by jono_irwin in "The 1979 Design Choice Breaking AI Workloads"

jono_irwin — Mon, 09 Mar 2026 19:07:53 +0000

hey cosmotic, we're not really advocating for storing model weights in the container image.

even the smaller nvidia images (like nvidia/cuda:13.1.1-cudnn-runtime-ubuntu24.04) are about 2Gb before adding any python deps and that is a problem.

if you split the image into chunks and pull on-demand, your container will start much faster.

New comment by jono_irwin in "Launch HN: Cerebrium (YC W22) – Serverless Infrastructure Platform for ML/AI"

jono_irwin — Thu, 19 Sep 2024 03:39:31 +0000

Thanks for the feedback! I like the sound of all of those:

- clearer messaging - more tutorials - one-click deploys - clear & upfront costing

We have plans to add other runtimes (like Typescript) in the future but Python is our focus for now.

New comment by jono_irwin in "Launch HN: Cerebrium (YC W22) – Serverless Infrastructure Platform for ML/AI"

jono_irwin — Wed, 18 Sep 2024 21:23:19 +0000

There are definitely some parallels between Cerebrium and paperspace, but I don't think they are a direct competitor. The biggest difference being that paperspace doesn't have a serverless offering afaik.

Cerebrium abstracts some functionality - like streaming and batching endpoints. I think you would need to build that yourself on paperspace.