Add configurable GPU VRAM limit for shared GPU environments

## Description

Thank you for the excellent work on Docling and docling-serve!

I would like to request the addition of a **GPU VRAM (video memory) limit** configuration option for the Docker container. This would allow users to cap the maximum amount of GPU memory that docling-serve can allocate.

### Use Case

In production environments where a single GPU must be shared across multiple services (e.g., docling-serve + embedding models + other ML workloads), there's currently no way to enforce hard VRAM limits per container.

Docker's `--gpus` flag allows selecting which GPU to use, but **does not provide a mechanism to limit VRAM usage**. This can lead to:

- One service consuming all available VRAM and starving others
- Unpredictable OOM errors across services
- Inability to reliably co-locate multiple GPU workloads

### Proposed Solution

Add a configuration option (environment variable or config file) to enforce a hard VRAM limit:

```yaml
services:
  docling-serve:
    image: ds4sd/docling-serve:latest
    environment:
      - GPU_MEMORY_LIMIT=4GB  # Hard limit on VRAM usage
      # or
      - GPU_MEMORY_FRACTION=0.5  # Use max 50% of available VRAM
```

This could be implemented using:
	•	torch.cuda.set_per_process_memory_fraction() (PyTorch)
	•	tf.config.set_logical_device_configuration() (TensorFlow)
	•	Custom memory pool management
Current Workarounds
Existing approaches are insufficient:
	•	PYTORCH_CUDA_ALLOC_CONF=max_split_size_mb:512 - soft limit, can be exceeded
	•	Reducing MAX_WORKERS - indirect control, not VRAM-specific
	•	NVIDIA MPS - doesn’t enforce VRAM limits
	•	NVIDIA MIG - requires A100/H100 hardware
Benefits
	•	✅ Enables predictable multi-tenant GPU deployments
	•	✅ Prevents one service from starving others of VRAM
	•	✅ Allows efficient resource utilization on shared hardware
	•	✅ Reduces OOM crashes in production
Environment
	•	GPU: NVIDIA CUDA-enabled GPUs
	•	Deployment: Docker/Docker Compose
	•	Scenario: Single GPU shared across multiple containerized ML services
Would this feature be feasible to implement? Happy to provide more details or contribute if helpful!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add configurable GPU VRAM limit for shared GPU environments #440

Description

Use Case

Proposed Solution

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Add configurable GPU VRAM limit for shared GPU environments #440

Description

Description

Use Case

Proposed Solution

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions