Skip to content

Add configurable GPU VRAM limit for shared GPU environments #440

@MwSpaceLLC

Description

@MwSpaceLLC

Description

Thank you for the excellent work on Docling and docling-serve!

I would like to request the addition of a GPU VRAM (video memory) limit configuration option for the Docker container. This would allow users to cap the maximum amount of GPU memory that docling-serve can allocate.

Use Case

In production environments where a single GPU must be shared across multiple services (e.g., docling-serve + embedding models + other ML workloads), there's currently no way to enforce hard VRAM limits per container.

Docker's --gpus flag allows selecting which GPU to use, but does not provide a mechanism to limit VRAM usage. This can lead to:

  • One service consuming all available VRAM and starving others
  • Unpredictable OOM errors across services
  • Inability to reliably co-locate multiple GPU workloads

Proposed Solution

Add a configuration option (environment variable or config file) to enforce a hard VRAM limit:

services:
  docling-serve:
    image: ds4sd/docling-serve:latest
    environment:
      - GPU_MEMORY_LIMIT=4GB  # Hard limit on VRAM usage
      # or
      - GPU_MEMORY_FRACTION=0.5  # Use max 50% of available VRAM

This could be implemented using:
• torch.cuda.set_per_process_memory_fraction() (PyTorch)
• tf.config.set_logical_device_configuration() (TensorFlow)
• Custom memory pool management
Current Workarounds
Existing approaches are insufficient:
• PYTORCH_CUDA_ALLOC_CONF=max_split_size_mb:512 - soft limit, can be exceeded
• Reducing MAX_WORKERS - indirect control, not VRAM-specific
• NVIDIA MPS - doesn’t enforce VRAM limits
• NVIDIA MIG - requires A100/H100 hardware
Benefits
• ✅ Enables predictable multi-tenant GPU deployments
• ✅ Prevents one service from starving others of VRAM
• ✅ Allows efficient resource utilization on shared hardware
• ✅ Reduces OOM crashes in production
Environment
• GPU: NVIDIA CUDA-enabled GPUs
• Deployment: Docker/Docker Compose
• Scenario: Single GPU shared across multiple containerized ML services
Would this feature be feasible to implement? Happy to provide more details or contribute if helpful!

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions