Skip to content

Conversation

@rst0git
Copy link
Member

@rst0git rst0git commented May 11, 2025

The container checkpointing procedure in Kubernetes freezes running containers to create a consistent snapshot of both the runtime state and the rootfs of the container. However, when checkpointing a GPU container, the container must be unfrozen before invoking cuda-checkpoint. This is achieved in prepare_freezer_for_interrupt_only_mode(), which needs to be called before the PAUSE_DEVICES hook.

The patch introducing this functionality (in #2514) fixes this problem for containers with multiple processes. However, if the container has a single process, prepare_freezer_for_interrupt_only_mode() must be invoked immediately before the PAUSE_DEVICES hook.

Fixes: #2514

@rst0git rst0git requested a review from avagin May 11, 2025 10:45
@rst0git rst0git force-pushed the seize-fix branch 2 times, most recently from 4702b8f to 8239745 Compare May 11, 2025 11:05
@avagin
Copy link
Member

avagin commented May 14, 2025

LGTM, thanks!

The container checkpointing procedure in Kubernetes freezes running
containers to create a consistent snapshot of both the runtime state
and the rootfs of the container. However, when checkpointing a GPU
container, the container must be unfrozen before invoking the
cuda-checkpoint tool.

This is achieved in prepare_freezer_for_interrupt_only_mode(), which
needs to be called before the PAUSE_DEVICES hook. The patch introducing
this functionality fixes this problem for containers with multiple
processes. However, if the container has a single process,
prepare_freezer_for_interrupt_only_mode() must be invoked immediately
before the PAUSE_DEVICES hook.

Fixes: checkpoint-restore#2514

Signed-off-by: Radostin Stoyanov <[email protected]>
@avagin avagin merged commit c61329b into checkpoint-restore:criu-dev May 15, 2025
37 of 44 checks passed
@rst0git rst0git deleted the seize-fix branch May 15, 2025 18:13
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants