This project provides a FastAPI server for generating text embeddings using the Jina Embeddings V3 model.
I just had to create this to offload the embedding generation from my notebook (fans go brrr) to a server with a GPU and couldn't find a simple solution that met my needs. This server is designed to be easy to deploy and use with no extra frills.
- HTTP API for generating text embeddings
- Support for task-specific embeddings (retrieval, classification, etc.)
- GPU acceleration support via CUDA
- Detailed performance timing metrics
- Interactive demo page
Out of scope:
- No authentication or rate limiting If you need these features, consider using a reverse proxy like Nginx or Traefik.
- No caching - embeddings are always regenerated
- Python 3.12+
- uv package manager (recommended)
- NVIDIA drivers for GPU support (optional)
# For CPU-only
uv sync --extra cpu
# For NVIDIA GPU support (CUDA 12.6+)
uv sync --extra cuda
uvicorn jina_server.serve:app --host 0.0.0.0 --port 8000
Visit http://localhost:8000
to access the demo Web UI
docker build -t jina-server .
docker run -p 80:80 jina-server
docker build -f Dockerfile.nvidia -t jina-server-gpu .
docker run --gpus all -p 80:80 jina-server-gpu
docker-compose -f docker-compose.cpu.yml up
docker-compose -f docker-compose.nvidia.yml up
Endpoint: POST /v1/embed
Request Format:
{
"text": "Your text to embed",
"task": "retrieval.query" // Optional fine-tuned task parameter, field can be omitted
}
Valid Task Options:
retrieval.query
retrieval.passage
separation
classification
text-matching
curl -X POST "http://localhost/v1/embed" \
-H "Content-Type: application/json" \
-d '{"text": "Sample text for embedding"}'
Response Example:
{
"embedding": [0.9876, ...],
"model_name": "jinaai/jina-embeddings-v3",
"queue_duration_seconds": 0.002541,
"embedding_duration_seconds": 0.041536
}