This repo hosts a kubernetes operator that is responsible for creating and managing llama-stack server.
- Automated deployment of Llama Stack servers
- Support for multiple distributions (includes Ollama, vLLM, and others)
- Customizable server configurations
- Volume management for model storage
- Kubernetes-native resource management
You can install the operator directly from a released version or the latest main branch using kubectl apply -f.
To install the latest version from the main branch:
kubectl apply -f https://raw.githubusercontent.com/llamastack/llama-stack-k8s-operator/main/release/operator.yamlTo install a specific released version (e.g., v1.0.0), replace main with the desired tag:
kubectl apply -f https://raw.githubusercontent.com/llamastack/llama-stack-k8s-operator/v1.0.0/release/operator.yaml- Deploy the inference provider server (ollama, vllm)
Ollama Examples:
Deploy Ollama with default model llama3.2:1b
./hack/deploy-quickstart.shDeploy Ollama with other model:
./hack/deploy-quickstart.sh --provider ollama --model llama3.2:7bvLLM Examples:
This would require a secret "hf-token-secret" in namespace "vllm-dist" for HuggingFace token (required for downloading models) to be created in advance.
Deploy vLLM with default model (meta-llama/Llama-3.2-1B):
./hack/deploy-quickstart.sh --provider vllmDeploy vLLM with GPU support:
./hack/deploy-quickstart.sh --provider vllm --runtime-env "VLLM_TARGET_DEVICE=gpu,CUDA_VISIBLE_DEVICES=0"- Create LlamaStackDistribution CR to get the server running. Example:
apiVersion: llamastack.io/v1alpha1
kind: LlamaStackDistribution
metadata:
name: llamastackdistribution-sample
spec:
replicas: 1
server:
distribution:
name: ollama
containerSpec:
port: 8321
env:
- name: INFERENCE_MODEL
value: "llama3.2:1b"
- name: OLLAMA_URL
value: "http://ollama-server-service.ollama-dist.svc.cluster.local:11434"
storage:
size: "20Gi"
mountPath: "/home/lls/.lls"
- Verify the server pod is running in the user defined namespace.
A ConfigMap can be used to store run.yaml configuration for each LlamaStackDistribution. Updates to the ConfigMap will restart the Pod to load the new data.
Example to create a run.yaml ConfigMap, and a LlamaStackDistribution that references it:
kubectl apply -f config/samples/example-with-configmap.yaml
- Kubernetes cluster (v1.20 or later)
- Go version go1.24
- operator-sdk v1.39.2 (v4 layout) or newer
- kubectl configured to access your cluster
- A running inference server:
- For local development, you can use the provided script:
/hack/deploy-quickstart.sh
- For local development, you can use the provided script:
-
Prepare release files with specific versions
make release VERSION=0.2.1 LLAMASTACK_VERSION=0.2.12This command updates distribution configurations and generates release manifests with the specified versions.
-
Custom operator image can be built using your local repository
make image IMG=quay.io/<username>/llama-stack-k8s-operator:<custom-tag>The default image used is
quay.io/llamastack/llama-stack-k8s-operator:latestwhen not supply argument formake imageTo create a local filelocal.mkwith env variables can overwrite the default values set in theMakefile. -
Once the image is created, the operator can be deployed directly. For each deployment method a kubeconfig should be exported
export KUBECONFIG=<path to kubeconfig>
Deploying operator locally
-
Deploy the created image in your cluster using following command:
make deploy IMG=quay.io/<username>/llama-stack-k8s-operator:<custom-tag> -
To remove resources created during installation use:
make undeploy
The operator includes end-to-end (E2E) tests to verify the complete functionality of the operator. To run the E2E tests:
- Ensure you have a running Kubernetes cluster
- Run the E2E tests using one of the following commands:
- If you want to deploy the operator and run tests:
make deploy test-e2e - If the operator is already deployed:
make test-e2e
- If you want to deploy the operator and run tests:
The make target will handle prerequisites including deploying ollama server.
Please refer to api documentation