-
Notifications
You must be signed in to change notification settings - Fork 227
Description
Description
Thank you for the excellent work on Docling and docling-serve!
I would like to request the addition of a GPU VRAM (video memory) limit configuration option for the Docker container. This would allow users to cap the maximum amount of GPU memory that docling-serve can allocate.
Use Case
In production environments where a single GPU must be shared across multiple services (e.g., docling-serve + embedding models + other ML workloads), there's currently no way to enforce hard VRAM limits per container.
Docker's --gpus flag allows selecting which GPU to use, but does not provide a mechanism to limit VRAM usage. This can lead to:
- One service consuming all available VRAM and starving others
- Unpredictable OOM errors across services
- Inability to reliably co-locate multiple GPU workloads
Proposed Solution
Add a configuration option (environment variable or config file) to enforce a hard VRAM limit:
services:
docling-serve:
image: ds4sd/docling-serve:latest
environment:
- GPU_MEMORY_LIMIT=4GB # Hard limit on VRAM usage
# or
- GPU_MEMORY_FRACTION=0.5 # Use max 50% of available VRAMThis could be implemented using:
• torch.cuda.set_per_process_memory_fraction() (PyTorch)
• tf.config.set_logical_device_configuration() (TensorFlow)
• Custom memory pool management
Current Workarounds
Existing approaches are insufficient:
• PYTORCH_CUDA_ALLOC_CONF=max_split_size_mb:512 - soft limit, can be exceeded
• Reducing MAX_WORKERS - indirect control, not VRAM-specific
• NVIDIA MPS - doesn’t enforce VRAM limits
• NVIDIA MIG - requires A100/H100 hardware
Benefits
• ✅ Enables predictable multi-tenant GPU deployments
• ✅ Prevents one service from starving others of VRAM
• ✅ Allows efficient resource utilization on shared hardware
• ✅ Reduces OOM crashes in production
Environment
• GPU: NVIDIA CUDA-enabled GPUs
• Deployment: Docker/Docker Compose
• Scenario: Single GPU shared across multiple containerized ML services
Would this feature be feasible to implement? Happy to provide more details or contribute if helpful!