Setup guide for enabling the AI assistant in CISO Assistant.
The chat feature is disabled by default. Set the ENABLE_CHAT environment variable to enable it:
export ENABLE_CHAT=trueFor Docker deployments, add it to your docker-compose.yml or .env file. This controls:
- Visibility of the
chat_modefeature flag in Settings - Visibility of the Chat/AI settings section (LLM provider, model, etc.)
- Signal handlers for RAG indexing
- Knowledge graph pre-warming at startup
Without ENABLE_CHAT=true, the chat app is installed (for migrations) but completely dormant.
| Component | Purpose | Required? |
|---|---|---|
| LLM server | Local LLM inference — Ollama, LM Studio, MLX, or llama.cpp | Yes (one of them) |
| Qdrant | Vector database for RAG | Yes |
| Huey worker | Background indexing tasks | Recommended |
# macOS
brew install ollama
ollama serve
# Pull a model (mistral is the default)
ollama pull mistral
# Optional: pull an embedding model
ollama pull snowflake-arctic-embed2You can use any model Ollama supports. Smaller models (mistral, phi3) work but are less reliable at tool selection. Larger models (llama3, mixtral) give better results.
- Download from lmstudio.ai
- Load a model and start the local server (default:
http://localhost:1234/v1) - Set
llm_providertoopenai_compatiblein settings (see step 3)
Best performance on Mac — uses Metal natively.
pip install mlx-lm
mlx_lm.server --model mlx-community/gpt-oss-20b-MXFP4-Q4 --port 8080Set llm_provider to openai_compatible and openai_api_base to http://localhost:8080/v1.
Lightweight, supports GGUF models.
brew install llama.cpp
llama-server -m ./models/your-model.gguf -c 8192 -ngl 999 --port 8081Set llm_provider to openai_compatible and openai_api_base to http://localhost:8081/v1.
docker run -d --name qdrant -p 6333:6333 -v qdrant_data:/qdrant/storage qdrant/qdrantDefault URL: http://localhost:6333. Override with the QDRANT_URL environment variable if needed.
# Create the Qdrant collection with proper indexes
.venv/bin/python backend/manage.py init_qdrant
# If you need to reset it
.venv/bin/python backend/manage.py init_qdrant --recreate# Index existing objects (risks, controls, assets, etc.)
.venv/bin/python backend/manage.py index_objects
# Index all framework libraries (150+ frameworks → Qdrant)
.venv/bin/python backend/manage.py index_libraries --sync
index_librariesparses all YAML files inbackend/library/libraries/and indexes requirement nodes, threats, and reference controls. This can take a few minutes.
New objects are indexed automatically via Django signals, but this requires Huey:
cd backend
poetry run python manage.py run_huey -w 2 -k processGo to the admin panel or use the API to update settings.
In Settings > Feature Flags, enable Chat Mode.
In Settings > General, set:
| Setting | Default | Description |
|---|---|---|
llm_provider |
ollama |
ollama or openai_compatible |
ollama_base_url |
http://localhost:11434 |
Ollama server URL |
ollama_model |
mistral |
Model name for chat generation |
ollama_embed_model |
snowflake-arctic-embed2 |
Model for embeddings (if using Ollama embeddings) |
embedding_backend |
sentence-transformers |
sentence-transformers (CPU, no setup) or ollama |
openai_api_base |
http://localhost:1234/v1 |
For LM Studio / vLLM / llama.cpp |
openai_model |
(empty) | Model identifier for OpenAI-compatible servers |
openai_api_key |
(empty) | API key for authenticated endpoints (optional) |
chat_system_prompt |
(empty) | Custom system prompt (overrides the built-in GRC prompt) |
Set embedding_backend to sentence-transformers. This uses paraphrase-multilingual-MiniLM-L12-v2 locally on CPU. The model (~130 MB) downloads automatically on first use. Multilingual support included.
This is the default and requires no external service.
Set embedding_backend to ollama and pull an embedding model:
ollama pull snowflake-arctic-embed2Higher quality than sentence-transformers but requires the Ollama server to be running.
If Ollama embeddings fail, the system automatically falls back to sentence-transformers.
GET /api/chat/status/
Returns the health of the LLM and embedding backends.
GET /api/chat/ollama-models/
Lists models available on your Ollama server.
Click the chat widget in the bottom-right corner of the app (only visible when chat mode is enabled).
- Check that
ENABLE_CHAT=trueis set in your environment - Check that the
chat_modefeature flag is enabled in Settings > Feature Flags - Hard-refresh the browser (Ctrl+Shift+R)
ENABLE_CHATenv var must be set totrue— the settings section is hidden otherwise
The system degrades gracefully: if no LLM is reachable, it returns retrieved context without generation. Check:
- Ollama is running:
curl http://localhost:11434/api/tags - Or LM Studio server is running:
curl http://localhost:1234/v1/models
- Verify Qdrant is running:
curl http://localhost:6333/collections - Check the collection exists:
curl http://localhost:6333/collections/ciso_assistant - Re-run indexing:
python manage.py init_qdrant --recreate && python manage.py index_objects && python manage.py index_libraries --sync
The knowledge graph builds lazily on first use from YAML files (~27s). If framework queries return empty results:
- Check that library YAML files exist in
backend/library/libraries/ - The graph is independent of Qdrant — it reads YAML directly
Small models (< 7B params) struggle with function calling. Options:
- Use a larger model (
ollama pull mixtralorllama3) - The system has deterministic keyword-based pre-routing for common workflows (suggest controls, risk treatment, evidence guidance) that bypasses the LLM
Chat sessions accumulate over time. Use the management command to purge old ones:
# Preview what would be deleted
.venv/bin/python backend/manage.py cleanup_sessions --days 90 --dry-run
# Delete sessions older than 90 days
.venv/bin/python backend/manage.py cleanup_sessions --days 90Messages cascade-delete with their sessions.
See ARCHITECTURE.md for the full design, diagrams, and component reference.