From 6ca794024787379616ee9e04e776977920682ceb Mon Sep 17 00:00:00 2001 From: Yaseen Hamdulay Date: Fri, 22 May 2026 10:43:22 +0200 Subject: [PATCH 1/6] feat: add checkpointing docs but dont link to it --- other-topics/checkpointing.mdx | 51 ++++++++++++++++++++++++++++++++++ 1 file changed, 51 insertions(+) create mode 100644 other-topics/checkpointing.mdx diff --git a/other-topics/checkpointing.mdx b/other-topics/checkpointing.mdx new file mode 100644 index 0000000..c25adbf --- /dev/null +++ b/other-topics/checkpointing.mdx @@ -0,0 +1,51 @@ +--- +title: "Memory Checkpointing" +description: "Faster cold starts with memory checkpointing" +--- + +## Introduction + +Memory checkpointing takes a snapshot of a containers GPU and CPU memory and uses that to speed up the startup of all future containers. Applications that perform a large amount of work at container start time benefit the most form this process. + +Machine Learning and LLM frameworks load model weights and compile various CUDA kernels at container start time taking many minutes. Loading a checkpoint that contains the already compiled CUDA kernels saves time. + +Cerebrium has checkpointing and restore built in to the platform. + +## How To Use + +Checkpointing is available on our v2 runtime environment. Add the following to your `cerebrium.toml` to upgrade. + +``` +[cerebrium.runtime] +container_runtime = "v2" +``` + +To create a checkpoint your application has to send a trigger to our runtime when it is ready. When the trigger is sent our runtime checks to see if a checkpoint should be created. If there is already an existing checkpoint at this build version or another container is already checkpointing it will not attempt to create another one. + +If a checkpoint should occur your container will be frozen for the duration of the process. GPU memory will be copied to CPU memory and then all of the container memory will be written to storage. This saved checkpoint will then be distributed to be able to run throughout the region. + +Send a POST request to http://169.254.169.253:8234/checkpoint from inside your container when the container is ready to checkpoint. + +If successful subsequent containers will be restored from this created checkpoint. You can tell that a container was restored from a checkpoint if it has `CEREBRIUM_RESTORED: container restored from checkpoint` as the first log line in the container. + +A checkpoint is tied to a single deployment. To disable restoring from checkpoints simply remove the POST request and redeploy your application. + +## Limitations + +Your memory request needs to be large enough to contain the GPU memory dump in addition to your regular memory use. + +When a container is restored from a checkpoint execution continues from the point where the http request is sent. If environment variables were read before this point they will remain the same as they were from the time of the checkpoint. + +Any TCP connections that were made before the checkpoint will have disconnected. For example if you connected to a database before the checkpoint you will have to reestablish that connection after restore. + +Any files written to disk before the checkpoint will not be copied to the restored container. Only memory is checkpointed. + +Not all Cerebrium providers are supported during this alpha release. + +## Platform specific recommendations + +### vLLM + +vLLM checkpointing support is not complete but still possible. See https://github.com/vllm-project/vllm/issues/34303 and other issues. + +If you are getting an EngineCoreDead exception add `async_scheduling=False` to your AsyncEngineArgs and it should succeed. From ff47311f7f8054e74e7c5eb898743dbd8006ab3a Mon Sep 17 00:00:00 2001 From: Yaseen Hamdulay Date: Fri, 22 May 2026 10:49:35 +0200 Subject: [PATCH 2/6] nits --- other-topics/checkpointing.mdx | 23 +++++++++++++---------- 1 file changed, 13 insertions(+), 10 deletions(-) diff --git a/other-topics/checkpointing.mdx b/other-topics/checkpointing.mdx index c25adbf..c931652 100644 --- a/other-topics/checkpointing.mdx +++ b/other-topics/checkpointing.mdx @@ -5,11 +5,11 @@ description: "Faster cold starts with memory checkpointing" ## Introduction -Memory checkpointing takes a snapshot of a containers GPU and CPU memory and uses that to speed up the startup of all future containers. Applications that perform a large amount of work at container start time benefit the most form this process. +Memory checkpointing takes a snapshot of a containers GPU and CPU memory and uses that to speed up the startup of all future containers. Applications that perform a large amount of work at container start time benefit the most from this process. -Machine Learning and LLM frameworks load model weights and compile various CUDA kernels at container start time taking many minutes. Loading a checkpoint that contains the already compiled CUDA kernels saves time. +For example, Machine Learning and LLM frameworks load massive model weights and compile various CUDA kernels at container start time, taking many minutes. Loading a checkpoint that already contains the already compiled CUDA kernels can skip this delay entirely. -Cerebrium has checkpointing and restore built in to the platform. +Cerebrium has native checkpointing and restore functionality built in to the platform. ## How To Use @@ -20,7 +20,10 @@ Checkpointing is available on our v2 runtime environment. Add the following to y container_runtime = "v2" ``` -To create a checkpoint your application has to send a trigger to our runtime when it is ready. When the trigger is sent our runtime checks to see if a checkpoint should be created. If there is already an existing checkpoint at this build version or another container is already checkpointing it will not attempt to create another one. +To create a checkpoint your application has to send a trigger to our runtime after it has performed its initialization and is ready. When this trigger is received, the runtime verifies if a new checkpoint is required. To save resources, the system will not create a new checkpoint if: + + 1. A checkpoint already exists for the current build version. + 2. Another container instance is already undergoing the checkpointing process. If a checkpoint should occur your container will be frozen for the duration of the process. GPU memory will be copied to CPU memory and then all of the container memory will be written to storage. This saved checkpoint will then be distributed to be able to run throughout the region. @@ -28,19 +31,19 @@ Send a POST request to http://169.254.169.253:8234/checkpoint from inside your c If successful subsequent containers will be restored from this created checkpoint. You can tell that a container was restored from a checkpoint if it has `CEREBRIUM_RESTORED: container restored from checkpoint` as the first log line in the container. -A checkpoint is tied to a single deployment. To disable restoring from checkpoints simply remove the POST request and redeploy your application. +A checkpoint is tightly coupled to a single deployment. To disable restoring from checkpoints simply remove the POST request and redeploy your application. ## Limitations -Your memory request needs to be large enough to contain the GPU memory dump in addition to your regular memory use. +**Memory Overhead:** Your memory request needs to be large enough to contain the GPU memory dump in addition to your regular memory use. -When a container is restored from a checkpoint execution continues from the point where the http request is sent. If environment variables were read before this point they will remain the same as they were from the time of the checkpoint. +**Execution Lifecycle:** When a container is restored from a checkpoint execution continues from the point where the http request is sent. If environment variables were read before this point they will remain the same as they were from the time of the checkpoint. -Any TCP connections that were made before the checkpoint will have disconnected. For example if you connected to a database before the checkpoint you will have to reestablish that connection after restore. +**Network Connections:** Any TCP connections that were made before the checkpoint will have disconnected. For example if you connected to a database before the checkpoint you will have to reestablish that connection after restore. -Any files written to disk before the checkpoint will not be copied to the restored container. Only memory is checkpointed. +**Ephemeral Filesystem:** Any files written to disk before the checkpoint will not be copied to the restored container. Only memory is checkpointed. -Not all Cerebrium providers are supported during this alpha release. +**Provider Availablility:** Not all Cerebrium providers are supported during this alpha release. ## Platform specific recommendations From 05e3ac0588e6449e52cac1516cba67e1cdca68b0 Mon Sep 17 00:00:00 2001 From: Yaseen Hamdulay Date: Fri, 22 May 2026 11:04:51 +0200 Subject: [PATCH 3/6] add a code snippet --- other-topics/checkpointing.mdx | 18 ++++++++++++++++++ 1 file changed, 18 insertions(+) diff --git a/other-topics/checkpointing.mdx b/other-topics/checkpointing.mdx index c931652..6636e12 100644 --- a/other-topics/checkpointing.mdx +++ b/other-topics/checkpointing.mdx @@ -33,6 +33,24 @@ If successful subsequent containers will be restored from this created checkpoin A checkpoint is tightly coupled to a single deployment. To disable restoring from checkpoints simply remove the POST request and redeploy your application. +### Example + +```python + from vllm import AsyncLLMEngine + from vllm.engine.arg_utils import AsyncEngineArgs + # Init vLLM engine + engine_args = AsyncEngineArgs( + model="Qwen/Qwen2.5-0.5B-Instruct", + async_scheduling=False + ) + AsyncLLMEngine.from_engine_args(engine_args) + + # Trigger checkpoint + urllib.request.urlopen("http://169.254.169.253:8234/checkpoint/", method="POST") + # Wait for it to complete + urllib.request.urlopen("http://169.254.169.253:8234/checkpoint/wait") +``` + ## Limitations **Memory Overhead:** Your memory request needs to be large enough to contain the GPU memory dump in addition to your regular memory use. From 409b0f0551cc4b162e94d88caeeaa48c371da7de Mon Sep 17 00:00:00 2001 From: Yaseen Hamdulay Date: Fri, 22 May 2026 11:05:54 +0200 Subject: [PATCH 4/6] remove indentation --- other-topics/checkpointing.mdx | 26 +++++++++++++------------- 1 file changed, 13 insertions(+), 13 deletions(-) diff --git a/other-topics/checkpointing.mdx b/other-topics/checkpointing.mdx index 6636e12..2524a38 100644 --- a/other-topics/checkpointing.mdx +++ b/other-topics/checkpointing.mdx @@ -36,19 +36,19 @@ A checkpoint is tightly coupled to a single deployment. To disable restoring fro ### Example ```python - from vllm import AsyncLLMEngine - from vllm.engine.arg_utils import AsyncEngineArgs - # Init vLLM engine - engine_args = AsyncEngineArgs( - model="Qwen/Qwen2.5-0.5B-Instruct", - async_scheduling=False - ) - AsyncLLMEngine.from_engine_args(engine_args) - - # Trigger checkpoint - urllib.request.urlopen("http://169.254.169.253:8234/checkpoint/", method="POST") - # Wait for it to complete - urllib.request.urlopen("http://169.254.169.253:8234/checkpoint/wait") +from vllm import AsyncLLMEngine +from vllm.engine.arg_utils import AsyncEngineArgs +# Init vLLM engine +engine_args = AsyncEngineArgs( + model="Qwen/Qwen2.5-0.5B-Instruct", + async_scheduling=False +) +AsyncLLMEngine.from_engine_args(engine_args) + +# Trigger checkpoint +urllib.request.urlopen("http://169.254.169.253:8234/checkpoint/", method="POST") +# Wait for it to complete +urllib.request.urlopen("http://169.254.169.253:8234/checkpoint/wait") ``` ## Limitations From 8503884289d6ddc0413b3fba81ab164c70a2052c Mon Sep 17 00:00:00 2001 From: Yaseen Hamdulay Date: Fri, 22 May 2026 11:26:54 +0200 Subject: [PATCH 5/6] nits --- other-topics/checkpointing.mdx | 8 ++++---- 1 file changed, 4 insertions(+), 4 deletions(-) diff --git a/other-topics/checkpointing.mdx b/other-topics/checkpointing.mdx index 2524a38..aae140b 100644 --- a/other-topics/checkpointing.mdx +++ b/other-topics/checkpointing.mdx @@ -5,9 +5,9 @@ description: "Faster cold starts with memory checkpointing" ## Introduction -Memory checkpointing takes a snapshot of a containers GPU and CPU memory and uses that to speed up the startup of all future containers. Applications that perform a large amount of work at container start time benefit the most from this process. +Memory checkpointing takes a snapshot of a container's GPU and CPU memory and uses that to speed up the startup of all future containers. Applications that perform a large amount of work at container start time benefit the most from this process. -For example, Machine Learning and LLM frameworks load massive model weights and compile various CUDA kernels at container start time, taking many minutes. Loading a checkpoint that already contains the already compiled CUDA kernels can skip this delay entirely. +For example, Machine Learning and LLM frameworks load massive model weights and compile various CUDA kernels at container start time, taking many minutes. Loading a checkpoint that already contains the compiled CUDA kernels can skip this delay entirely. Cerebrium has native checkpointing and restore functionality built in to the platform. @@ -53,7 +53,7 @@ urllib.request.urlopen("http://169.254.169.253:8234/checkpoint/wait") ## Limitations -**Memory Overhead:** Your memory request needs to be large enough to contain the GPU memory dump in addition to your regular memory use. +**Memory Overhead:** The container memory allocation needs to be large enough to contain the GPU memory dump in addition to your regular memory use. **Execution Lifecycle:** When a container is restored from a checkpoint execution continues from the point where the http request is sent. If environment variables were read before this point they will remain the same as they were from the time of the checkpoint. @@ -61,7 +61,7 @@ urllib.request.urlopen("http://169.254.169.253:8234/checkpoint/wait") **Ephemeral Filesystem:** Any files written to disk before the checkpoint will not be copied to the restored container. Only memory is checkpointed. -**Provider Availablility:** Not all Cerebrium providers are supported during this alpha release. +**Provider Availablity:** Not all Cerebrium providers are supported during this alpha release. ## Platform specific recommendations From 41d4286a5c37770296e3db8e66d355dd93370e17 Mon Sep 17 00:00:00 2001 From: Yaseen Hamdulay Date: Fri, 22 May 2026 11:27:48 +0200 Subject: [PATCH 6/6] provider availability --- other-topics/checkpointing.mdx | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/other-topics/checkpointing.mdx b/other-topics/checkpointing.mdx index aae140b..aa0e9d8 100644 --- a/other-topics/checkpointing.mdx +++ b/other-topics/checkpointing.mdx @@ -61,7 +61,7 @@ urllib.request.urlopen("http://169.254.169.253:8234/checkpoint/wait") **Ephemeral Filesystem:** Any files written to disk before the checkpoint will not be copied to the restored container. Only memory is checkpointed. -**Provider Availablity:** Not all Cerebrium providers are supported during this alpha release. +**Provider Availablity:** Checkpointing is only available on the AWS provider. More coming soon. ## Platform specific recommendations