Hacker News: za_mike157

New comment by za_mike157 in "Reduce GVisor Cold Starts with GPU Snapshotting"

za_mike157 — Wed, 01 Jul 2026 19:27:51 +0000

Interesting! I didn't see they released this. Do you know what their benchmarks are? I know for cloud run they are pretty slow

New comment by za_mike157 in "Reduce GVisor Cold Starts with GPU Snapshotting"

za_mike157 — Wed, 01 Jul 2026 18:33:58 +0000

Us and the team from Modal have been upstreaming things to the GVisor repo (https://github.com/google/gvisor/pulls) in order to make it compatible with cuda-checkpoint and other parts of our system. While we are both contributing fixes and performance improvements we are unfortunately leaving some secret sauce on the side but hopefully it should get most folks to a successful implementation as is

New comment by za_mike157 in "Reduce GVisor Cold Starts with GPU Snapshotting"

za_mike157 — Wed, 01 Jul 2026 18:30:11 +0000

haha you are right that the title is a bit strange - should just be "Reduce GPU cold starts with snapshotting"

I can't read good ;)

New comment by za_mike157 in "Reduce GVisor Cold Starts with GPU Snapshotting"

za_mike157 — Wed, 01 Jul 2026 17:25:45 +0000

No we don't use it. CRIU is used for normal checkpoint/restore of Linux processes. Since we run GVisor for container isolation we use their checkpoint/restore support for the sandboxed process state.

Both approaches still need NVIDIA’s cuda-checkpoint for the GPU side, because CUDA/GPU memory and driver state are not something a normal process checkpointing tool can handle on its own.

New comment by za_mike157 in "Reduce GVisor Cold Starts with GPU Snapshotting"

za_mike157 — Wed, 01 Jul 2026 17:19:35 +0000

There are a lot of similarities.

They run their snapshot agent as a Kubernetes DaemonSet, whereas our implementation runs as part of the Cerebrium container runtime path. Under the hood, both approaches rely on cuda-checkpoint, since cuda-checkpoint is currently the main primitive NVIDIA exposes for interacting with GPU memory during checkpoint/restore.

One difference is how KV cache handling is exposed. NVIDIA’s approach appears to automatically handle KV cache allocation/deallocation, whereas today we expose that choice to users (vLLM and SGLang expose primitives to to his). In some cases, users may want to discard the KV cache to reduce checkpoint size and restore time; in others, preserving it may be useful.

Their DaemonSet approach is also nice because it can be more portable across Kubernetes environments and clouds. Our approach is more deeply integrated into the node/runtime layer, which gives us tighter control over the serverless startup path, but also means it depends on custom node VM images, which not every provider supports equally.

The optimizations they mention around parallel memfd restore and Linux native AIO for anonymous memory could also be applied to our architecture if we find them stable and beneficial. That said, our current results are already pretty close. For example, they report restoring Qwen3-8B in 4.7s with those changes, while we currently restore it in 6.49s.

The biggest thing we are excited for is multi-GPU restore, which is not supported yet. That would unlock a much broader set of workloads.

New comment by za_mike157 in "Reduce GVisor Cold Starts with GPU Snapshotting"

za_mike157 — Wed, 01 Jul 2026 16:50:45 +0000

Hey! Yes you are correct! We have both been upstreaming changes to the main GVisor repo. However, in order to work within our own infrastructure we had to make various changes that we explain throughout the article (Open TCP connections, multiprocessing, unix sockets etc).

Also in our benchmarks we seem to perform better than Modal by ~20% in 4/6 workloads we tested and have a lower spread of results meaning you get more consistent results. However the same fundamentals still apply -> how can you move storage into memory as quickly as possible

Why Kubernetes Serving Breaks Down for Real-Time AI

za_mike157 — Tue, 24 Mar 2026 16:11:14 +0000

Article URL: https://www.cerebrium.ai/blog/why-kubernetes-serving-breaks-down-for-realtime-ai

Comments URL: https://news.ycombinator.com/item?id=47504872

Points: 5

# Comments: 0

New comment by za_mike157 in "The 1979 Design Choice Breaking AI Workloads"

za_mike157 — Mon, 09 Mar 2026 19:09:11 +0000

Glad you liked it!

New comment by za_mike157 in "The 1979 Design Choice Breaking AI Workloads"

za_mike157 — Mon, 09 Mar 2026 19:09:01 +0000

You are correct! From our tests, storing model weights in the image actually isn't a preferred approach for model weights larger than ~1GB. We run a distributed, multi-layer cache system to combat this and we can load roughly 6-7GB of files in p99 of <2.5s

New comment by za_mike157 in "The 1979 Design Choice Breaking AI Workloads"

za_mike157 — Mon, 09 Mar 2026 19:06:38 +0000

A lot of AI workloads require GPUs which are expensive so customers would waste money running idle machines 24/7 with low utilisation which kills gross margins. By loading containers quickly means, means we can scale up quickly as requests come in and you only need to pay for usage.

This is successful for CPU workloads (AWS Lambda) but AI models and images are 50x the size

The 1979 Design Choice Breaking AI Workloads

za_mike157 — Mon, 09 Mar 2026 16:59:05 +0000

Article URL: https://www.cerebrium.ai/blog/rethinking-container-image-distribution-to-eliminate-cold-starts

Comments URL: https://news.ycombinator.com/item?id=47311745

Points: 25

# Comments: 20

AI Companies need to partner with Serverless compute platforms vs. K8s

za_mike157 — Mon, 02 Mar 2026 19:16:28 +0000

Article URL: https://www.cerebrium.ai/blog/why-serverless-compute-partners-are-now-more-important-than-ever

Comments URL: https://news.ycombinator.com/item?id=47222679

Points: 2

# Comments: 0

New comment by za_mike157 in "How to Migrate from OpenAI to Cerebrium for Cost-Predictable AI Inference"

za_mike157 — Tue, 22 Jul 2025 14:19:47 +0000

Hey! Founder of Cerebrium here.

- Runpod is one of the cheapest but it comes at the price of reliability (critical for businesses) - We have more performant cold start performance with something special launching soon here - Iterating on your application using CPUs/GPUs in the cloud takes just 2–10 seconds, compared to several minutes with Runpod due to Docker push/pull. - Allow you to deploy in multiple regions globally for lower latency and data residency compliance - We provide a lot of software abstractions (fire and forget jobs, websockets, batching, etc) where as Runpod just deploys your docker image. - SOC 2 and GDPR compliant

With that all being said - we are working on optimisations to bring down pricing

New comment by za_mike157 in "Launch HN: Cerebrium (YC W22) – Serverless Infrastructure Platform for ML/AI"

za_mike157 — Thu, 19 Sep 2024 11:56:19 +0000

I haven't used SkyPilot so I am unfamiliar with the experience and performance.

However, some of the situations you would like to use Cerebrium over Skypilot are: - You don't want to manage you own hardware - Reduced costs: With serverless Runtime and low cold starts (unclear if SkyPiolet offers this and what the peformance is like if they do) - Rapid iteration: Unclear of the deployment process on SkyPilot and how long projects take to go live - Observability: Looks like you would just have k8s metrics at your disposal

New comment by za_mike157 in "Launch HN: Cerebrium (YC W22) – Serverless Infrastructure Platform for ML/AI"

za_mike157 — Thu, 19 Sep 2024 11:11:05 +0000

I think we used this UI kit: https://minimals.cc/

New comment by za_mike157 in "Launch HN: Cerebrium (YC W22) – Serverless Infrastructure Platform for ML/AI"

za_mike157 — Thu, 19 Sep 2024 11:10:21 +0000

I guess then the next question would be how quickly can they start executing your container from cold start when a workload comes in? Typically we see companies on around 30-60s

New comment by za_mike157 in "Launch HN: Cerebrium (YC W22) – Serverless Infrastructure Platform for ML/AI"

za_mike157 — Thu, 19 Sep 2024 00:51:29 +0000

Do you mean why the individual file names aren't quoted?

You can see an example config file at the bottom of that link you attached - agreed we should probably make it more obvious

New comment by za_mike157 in "Launch HN: Cerebrium (YC W22) – Serverless Infrastructure Platform for ML/AI"

za_mike157 — Wed, 18 Sep 2024 19:11:57 +0000

Thanks for confirming! Our cold start, excluding model load is 2-4 seconds typically for HF models.

The only time it gets much longer when companies have done a lot with very specific CUDA implementations

New comment by za_mike157 in "Launch HN: Cerebrium (YC W22) – Serverless Infrastructure Platform for ML/AI"

za_mike157 — Wed, 18 Sep 2024 18:27:12 +0000

Thanks Tom! Excited to to support you and the team as you grow

New comment by za_mike157 in "Launch HN: Cerebrium (YC W22) – Serverless Infrastructure Platform for ML/AI"

za_mike157 — Wed, 18 Sep 2024 18:03:56 +0000

Ah I see they recently cut their pricing by 40% so you are correct - sorry about that. It seems we are more expensive compared to their new pricing