This comprehensive tutorial covers three major approaches for deploying Microsoft's Phi-4-mini-instruct model in containerized environments: vLLM, Ollama, and SLM Engine with ONNX Runtime. This 3.8B parameter model represents an optimal choice for reasoning tasks while maintaining efficiency for edge deployment.
- Introduction to Phi-4-mini Container Deployment
- Learning Objectives
- Understanding Phi-4-mini Classification
- vLLM Container Deployment
- Ollama Container Deployment
- SLM Engine with ONNX Runtime
- Comparison Framework
- Best Practices
Small Language Models (SLMs) represent a crucial advancement in EdgeAI, enabling sophisticated natural language processing capabilities on resource-constrained devices. This tutorial focuses on containerized deployment strategies for Microsoft's Phi-4-mini-instruct, a state-of-the-art reasoning model that balances capability with efficiency.
Phi-4-mini-instruct (3.8B parameters): Microsoft's latest lightweight instruction-tuned model designed for memory/compute-constrained environments with exceptional capabilities in:
- Mathematical reasoning and complex calculations
- Code generation, debugging, and analysis
- Logical problem solving and step-by-step reasoning
- Educational applications requiring detailed explanations
- Function calling and tool integration
Part of the "Small SLMs" category (1.5B - 13.9B parameters), Phi-4-mini strikes an optimal balance between reasoning capability and resource efficiency.
- Operational Efficiency: Fast inference for reasoning tasks with lower computational requirements
- Deployment Flexibility: On-device AI capabilities with enhanced privacy through local processing
- Cost Effectiveness: Reduced operational costs compared to larger models while maintaining quality
- Isolation: Clean separation between model instances and secure execution environments
- Scalability: Easy horizontal scaling for increased reasoning throughput
By the end of this tutorial, you will be able to:
- Deploy and optimize Phi-4-mini-instruct in various containerized environments
- Implement advanced quantization and compression strategies for different deployment scenarios
- Configure production-ready container orchestration for reasoning workloads
- Evaluate and select appropriate deployment frameworks based on specific use case requirements
- Apply security, monitoring, and scaling best practices for containerized SLM deployments
Technical Details:
- Parameters: 3.8 billion (Small SLM category)
- Architecture: Dense decoder-only Transformer with grouped-query attention
- Context Length: 128K tokens (32K recommended for optimal performance)
- Vocabulary: 200K tokens with multilingual support
- Training Data: 5T tokens of high-quality reasoning-dense content
| Deployment Type | Min RAM | Recommended RAM | VRAM (GPU) | Storage | Typical Use Cases |
|---|---|---|---|---|---|
| Development | 6GB | 8GB | - | 8GB | Local testing, prototyping |
| Production CPU | 8GB | 12GB | - | 10GB | Edge servers, cost-optimized deployment |
| Production GPU | 6GB | 8GB | 4-6GB | 8GB | High-throughput reasoning services |
| Edge Optimized | 4GB | 6GB | - | 6GB | Quantized deployment, IoT gateways |
- Mathematical Excellence: Advanced arithmetic, algebra, and calculus problem solving
- Code Intelligence: Python, JavaScript, and multi-language code generation with debugging
- Logical Reasoning: Step-by-step problem decomposition and solution construction
- Educational Support: Detailed explanations suitable for learning and teaching scenarios
- Function Calling: Native support for tool integration and API interactions
vLLM provides excellent support for Phi-4-mini-instruct with optimized inference performance and OpenAI-compatible APIs, making it ideal for production reasoning services.
# CPU-optimized deployment for development and testing
docker run --name phi4-mini-dev \
-e HUGGING_FACE_HUB_TOKEN=${HF_TOKEN} \
-p 8000:8000 \
--memory="8g" --cpus="4" \
vllm/vllm-openai:latest \
--model microsoft/Phi-4-mini-instruct \
--max-model-len 4096 \
--max-num-seqs 4 \
--trust-remote-code# GPU deployment for high-performance reasoning
docker run --runtime nvidia --gpus all \
--name phi4-mini-prod \
-e HUGGING_FACE_HUB_TOKEN=${HF_TOKEN} \
-p 8000:8000 \
vllm/vllm-openai:latest \
--model microsoft/Phi-4-mini-instruct \
--max-model-len 8192 \
--gpu-memory-utilization 0.8 \
--enable-auto-tool-choice \
--trust-remote-codeversion: '3.8'
services:
phi4-mini-reasoning:
image: vllm/vllm-openai:latest
container_name: phi4-mini-production
ports:
- "8000:8000"
volumes:
- ~/.cache/huggingface:/root/.cache/huggingface
- ./logs:/app/logs
environment:
- HUGGING_FACE_HUB_TOKEN=${HF_TOKEN}
- CUDA_VISIBLE_DEVICES=0
command: >
--model microsoft/Phi-4-mini-instruct
--host 0.0.0.0
--port 8000
--max-model-len 4096
--max-num-seqs 8
--gpu-memory-utilization 0.8
--trust-remote-code
--enable-auto-tool-choice
--quantization awq
deploy:
resources:
limits:
memory: 12G
reservations:
devices:
- driver: nvidia
count: 1
capabilities: [gpu]
restart: unless-stopped
healthcheck:
test: ["CMD", "curl", "-f", "http://localhost:8000/health"]
interval: 30s
timeout: 10s
retries: 3# Test mathematical reasoning
curl http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "microsoft/Phi-4-mini-instruct",
"messages": [
{"role": "system", "content": "You are a mathematical reasoning assistant. Show your work step by step."},
{"role": "user", "content": "A train travels 240 km in 3 hours. If it increases its speed by 20 km/h, how long would the same journey take?"}
],
"max_tokens": 200,
"temperature": 0.3
}'
# Test code generation
curl http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "microsoft/Phi-4-mini-instruct",
"messages": [
{"role": "user", "content": "Write a Python function to calculate the Fibonacci sequence using dynamic programming. Include comments explaining the approach."}
],
"max_tokens": 300,
"temperature": 0.5
}'
# Test function calling capability
curl http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "microsoft/Phi-4-mini-instruct",
"messages": [
{"role": "user", "content": "Calculate the area of a circle with radius 5 units"}
],
"tools": [
{
"type": "function",
"function": {
"name": "calculate_circle_area",
"description": "Calculate the area of a circle given its radius",
"parameters": {
"type": "object",
"properties": {
"radius": {"type": "number", "description": "The radius of the circle"}
},
"required": ["radius"]
}
}
}
],
"tool_choice": "auto"
}'Ollama provides excellent support for Phi-4-mini-instruct with simplified deployment and management, making it ideal for development and balanced production deployments.
# Deploy Ollama container with GPU support
docker run -d \
--name ollama-phi4 \
--gpus all \
-v ollama-data:/root/.ollama \
-p 11434:11434 \
--restart unless-stopped \
ollama/ollama:latest
# Pull Phi-4-mini-instruct model
docker exec ollama-phi4 ollama pull phi4-mini
# Test mathematical reasoning
docker exec ollama-phi4 ollama run phi4-mini \
"Solve this step by step: If compound interest on $5000 at 6% annually for 3 years, what is the final amount?"
# Test code generation
docker exec ollama-phi4 ollama run phi4-mini \
"Write a Python function to implement binary search with detailed comments"version: '3.8'
services:
ollama-phi4:
image: ollama/ollama:latest
container_name: ollama-phi4-production
ports:
- "11434:11434"
volumes:
- ollama-data:/root/.ollama
- ./modelfiles:/modelfiles
environment:
- OLLAMA_HOST=0.0.0.0
- OLLAMA_NUM_PARALLEL=4
- OLLAMA_MAX_LOADED_MODELS=1
- OLLAMA_FLASH_ATTENTION=1
deploy:
resources:
limits:
memory: 12G
reservations:
devices:
- driver: nvidia
count: all
capabilities: [gpu]
restart: unless-stopped
healthcheck:
test: ["CMD", "curl", "-f", "http://localhost:11434/api/tags"]
interval: 30s
timeout: 10s
retries: 3
# Web UI for interactive reasoning tasks
open-webui:
image: ghcr.io/open-webui/open-webui:main
container_name: phi4-webui
ports:
- "3000:8080"
environment:
- OLLAMA_BASE_URL=http://ollama-phi4:11434
- DEFAULT_MODELS=phi4-mini
depends_on:
- ollama-phi4
volumes:
- open-webui-data:/app/backend/data
volumes:
ollama-data:
open-webui-data:# Create reasoning-optimized variant
cat > /tmp/phi4-reasoning << EOF
FROM phi4-mini
PARAMETER temperature 0.3
PARAMETER top_p 0.8
SYSTEM """You are an expert reasoning assistant specialized in mathematics, logic, and code analysis.
Always think step by step and show your work clearly.
For mathematical problems, break down each calculation.
For coding problems, explain your approach and include comments."""
EOF
docker exec ollama-phi4 ollama create phi4-mini-reasoning -f /tmp/phi4-reasoning
# Create code-focused variant
cat > /tmp/phi4-coder << EOF
FROM phi4-mini
PARAMETER temperature 0.5
PARAMETER top_p 0.9
SYSTEM """You are a coding assistant specialized in writing clean, efficient, and well-documented code.
Always include detailed comments explaining your approach.
Follow best practices for the target programming language.
Provide examples and test cases when helpful."""
EOF
docker exec ollama-phi4 ollama create phi4-mini-coder -f /tmp/phi4-coder# Mathematical reasoning via API
curl http://localhost:11434/api/generate -d '{
"model": "phi4-mini-reasoning",
"prompt": "A rectangle has length 15cm and width 8cm. If we increase both dimensions by 20%, what is the percentage increase in area?",
"stream": false,
"options": {
"temperature": 0.3,
"top_p": 0.8,
"num_ctx": 4096
}
}'
# Code generation via API
curl http://localhost:11434/api/generate -d '{
"model": "phi4-mini-coder",
"prompt": "Create a Python class for a binary tree with methods for insertion, deletion, and in-order traversal. Include comprehensive docstrings.",
"stream": false,
"options": {
"temperature": 0.5,
"top_p": 0.9,
"num_ctx": 4096
}
}'ONNX Runtime provides optimal performance for edge deployment of Phi-4-mini-instruct with advanced optimization and cross-platform compatibility.
# Dockerfile for ONNX-optimized Phi-4-mini
FROM python:3.11-slim
RUN pip install --no-cache-dir \
onnxruntime-gpu \
optimum[onnxruntime] \
transformers \
fastapi \
uvicorn
COPY app/ /app/
WORKDIR /app
EXPOSE 8080
CMD ["python", "server.py"]# app/server.py - Optimized for Phi-4-mini reasoning tasks
import os
import time
import onnxruntime as ort
from transformers import AutoTokenizer
from fastapi import FastAPI
from pydantic import BaseModel
app = FastAPI(title="Phi-4-mini ONNX Engine")
class ReasoningRequest(BaseModel):
prompt: str
task_type: str = "reasoning" # reasoning, coding, math
max_length: int = 200
temperature: float = 0.3
class Phi4MiniEngine:
def __init__(self):
self.model = None
self.tokenizer = None
self.load_model()
def load_model(self):
model_path = "/app/models/phi4-mini-instruct.onnx"
if os.path.exists(model_path):
# Optimized providers for reasoning tasks
providers = [
('CUDAExecutionProvider', {
'arena_extend_strategy': 'kSameAsRequested',
'cudnn_conv_algo_search': 'HEURISTIC',
}),
('CPUExecutionProvider', {
'intra_op_num_threads': 4,
'inter_op_num_threads': 2,
})
]
self.model = ort.InferenceSession(model_path, providers=providers)
self.tokenizer = AutoTokenizer.from_pretrained("microsoft/Phi-4-mini-instruct")
print("✓ Phi-4-mini model loaded successfully")
else:
print("✗ Model file not found. Please convert the model first.")
def generate_reasoning(self, request: ReasoningRequest):
if not self.model:
raise ValueError("Model not loaded")
# Task-specific prompting for better reasoning
task_prompts = {
"reasoning": "Think step by step and show your reasoning clearly:",
"math": "Solve this mathematical problem step by step:",
"coding": "Write clean, well-commented code for this task:"
}
system_prompt = task_prompts.get(request.task_type, "")
full_prompt = f"{system_prompt}\n{request.prompt}"
# Tokenize and run inference
inputs = self.tokenizer.encode(full_prompt, return_tensors="np", max_length=2048, truncation=True)
start_time = time.time()
outputs = self.model.run(None, {"input_ids": inputs})
inference_time = time.time() - start_time
# Decode response
generated_text = self.tokenizer.decode(outputs[0][0], skip_special_tokens=True)
return {
"generated_text": generated_text,
"task_type": request.task_type,
"inference_time": inference_time,
"model": "phi4-mini-instruct-onnx"
}
# Initialize engine
engine = Phi4MiniEngine()
@app.post("/reasoning")
async def generate_reasoning(request: ReasoningRequest):
try:
return engine.generate_reasoning(request)
except Exception as e:
return {"error": str(e)}
@app.get("/health")
async def health():
return {
"status": "healthy" if engine.model else "model_not_loaded",
"model": "phi4-mini-instruct",
"capabilities": ["reasoning", "math", "coding"]
}
@app.get("/")
async def root():
return {
"name": "Phi-4-mini ONNX Engine",
"model": "microsoft/Phi-4-mini-instruct",
"endpoints": ["/reasoning", "/health"],
"capabilities": ["mathematical reasoning", "code generation", "logical problem solving"]
}# convert_phi4_mini.py - Convert Phi-4-mini to optimized ONNX
import os
from pathlib import Path
from optimum.onnxruntime import ORTModelForCausalLM, ORTOptimizer, ORTQuantizer
from optimum.onnxruntime.configuration import AutoQuantizationConfig, AutoOptimizationConfig
from transformers import AutoTokenizer
def convert_phi4_mini():
print("Converting Phi-4-mini-instruct to optimized ONNX...")
model_name = "microsoft/Phi-4-mini-instruct"
output_dir = Path("./models/phi4-mini-onnx")
output_dir.mkdir(parents=True, exist_ok=True)
# Step 1: Convert to ONNX
print("Step 1: Converting to ONNX format...")
model = ORTModelForCausalLM.from_pretrained(
model_name,
export=True,
provider="CPUExecutionProvider",
use_cache=True
)
# Step 2: Apply optimizations for reasoning tasks
print("Step 2: Applying reasoning-specific optimizations...")
optimization_config = AutoOptimizationConfig.with_optimization_level(
optimization_level="O3",
optimize_for_gpu=True,
fp16=True
)
optimizer = ORTOptimizer.from_pretrained(model)
optimizer.optimize(save_dir=output_dir, optimization_config=optimization_config)
# Step 3: Apply quantization for edge deployment
print("Step 3: Applying quantization...")
quantization_config = AutoQuantizationConfig.avx512_vnni(
is_static=False,
per_channel=True
)
quantizer = ORTQuantizer.from_pretrained(output_dir)
quantizer.quantize(
save_dir=output_dir / "quantized",
quantization_config=quantization_config
)
# Step 4: Save tokenizer
tokenizer = AutoTokenizer.from_pretrained(model_name)
tokenizer.save_pretrained(output_dir)
# Step 5: Create final optimized model
final_model_path = Path("./models/phi4-mini-instruct.onnx")
quantized_files = list((output_dir / "quantized").glob("*.onnx"))
if quantized_files:
import shutil
shutil.copy2(quantized_files[0], final_model_path)
print(f"✓ Phi-4-mini converted and optimized: {final_model_path}")
return final_model_path
if __name__ == "__main__":
convert_phi4_mini()version: '3.8'
services:
# Model conversion service (run once)
phi4-converter:
build: .
container_name: phi4-converter
volumes:
- ./models:/app/models
- ~/.cache/huggingface:/root/.cache/huggingface
environment:
- HUGGING_FACE_HUB_TOKEN=${HF_TOKEN}
command: python convert_phi4_mini.py
profiles: ["convert"]
# Main reasoning engine
phi4-onnx:
build: .
container_name: phi4-onnx-engine
ports:
- "8080:8080"
volumes:
- ./models:/app/models:ro
environment:
- LOG_LEVEL=INFO
deploy:
resources:
limits:
memory: 8G
cpus: '4'
reservations:
devices:
- driver: nvidia
count: 1
capabilities: [gpu]
restart: unless-stopped
healthcheck:
test: ["CMD", "curl", "-f", "http://localhost:8080/health"]
interval: 30s
timeout: 10s
retries: 3# Test mathematical reasoning
curl -X POST http://localhost:8080/reasoning \
-H "Content-Type: application/json" \
-d '{
"prompt": "If a car travels at 60 mph for 2.5 hours, then at 80 mph for 1.5 hours, what is the total distance traveled?",
"task_type": "math",
"max_length": 150,
"temperature": 0.3
}'
# Test code generation
curl -X POST http://localhost:8080/reasoning \
-H "Content-Type: application/json" \
-d '{
"prompt": "Create a Python function to find the greatest common divisor of two numbers using the Euclidean algorithm",
"task_type": "coding",
"max_length": 250,
"temperature": 0.5
}'
# Test logical reasoning
curl -X POST http://localhost:8080/reasoning \
-H "Content-Type: application/json" \
-d '{
"prompt": "All cats are mammals. Some mammals are carnivores. Can we conclude that some cats are carnivores?",
"task_type": "reasoning",
"max_length": 200,
"temperature": 0.3
}'| Feature | vLLM | Ollama | ONNX Runtime |
|---|---|---|---|
| Setup Complexity | Moderate | Easy | Complex |
| Performance (GPU) | Excellent (~25 tok/s) | Very Good (~20 tok/s) | Good (~15 tok/s) |
| Performance (CPU) | Good (~8 tok/s) | Very Good (~12 tok/s) | Excellent (~15 tok/s) |
| Memory Usage | 8-12GB | 6-10GB | 4-8GB |
| API Compatibility | OpenAI Compatible | Custom REST | Custom FastAPI |
| Function Calling | ✅ Native | ✅ Supported | |
| Quantization Support | AWQ, GPTQ | Q4_0, Q5_1, Q8_0 | ONNX Quantization |
| Production Ready | ✅ Excellent | ✅ Very Good | ✅ Good |
| Edge Deployment | Good | Excellent | Outstanding |
- Microsoft Phi-4 Model Card: Detailed specifications and usage guidelines
- vLLM Documentation: Advanced configuration and optimization options
- Ollama Model Library: Community models and customization examples
- ONNX Runtime Guides: Performance optimization and deployment strategies
- Hugging Face Transformers: For model interaction and customization
- OpenAI API Specification: For vLLM compatibility testing
- Docker Best Practices: Container security and optimization guidelines
- Kubernetes Deployment: Orchestration patterns for production scaling
- SLM Performance Benchmarking: Comparative analysis methodologies
- Edge AI Deployment: Best practices for resource-constrained environments
- Reasoning Task Optimization: Prompting strategies for mathematical and logical problems
- Container Security: Hardening practices for AI model deployments
After completing this module, you will be able to:
- Deploy Phi-4-mini-instruct model in containerized environments using multiple frameworks
- Configure and optimize SLM deployments for different hardware environments
- Implement security best practices for containerized AI deployments
- Compare and select appropriate deployment frameworks based on specific use case requirements
- Apply monitoring and scaling strategies for production-grade SLM services