Skip to content

Commit 89499ae

Browse files
authored
Feature/vllm support (#2981)
1 parent 386d8b8 commit 89499ae

File tree

10 files changed

+430
-1
lines changed

10 files changed

+430
-1
lines changed

docs/components/llms/config.mdx

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -58,6 +58,7 @@ config = {
5858

5959
m = Memory.from_config(config)
6060
m.add("Your text here", user_id="user", metadata={"category": "example"})
61+
6162
```
6263

6364
```typescript TypeScript
@@ -76,6 +77,7 @@ const config = {
7677
const memory = new Memory(config);
7778
await memory.add("Your text here", { userId: "user123", metadata: { category: "example" } });
7879
```
80+
7981
</CodeGroup>
8082

8183
## Why is Config Needed?
Lines changed: 109 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,109 @@
1+
---
2+
title: vLLM
3+
---
4+
5+
<Snippet file="paper-release.mdx" />
6+
7+
[vLLM](https://docs.vllm.ai/) is a high-performance inference engine for large language models that provides significant performance improvements for local inference. It's designed to maximize throughput and memory efficiency for serving LLMs.
8+
9+
## Prerequisites
10+
11+
1. **Install vLLM**:
12+
13+
```bash
14+
pip install vllm
15+
```
16+
17+
2. **Start vLLM server**:
18+
19+
```bash
20+
# For testing with a small model
21+
vllm serve microsoft/DialoGPT-medium --port 8000
22+
23+
# For production with a larger model (requires GPU)
24+
vllm serve Qwen/Qwen2.5-32B-Instruct --port 8000
25+
```
26+
27+
## Usage
28+
29+
```python
30+
import os
31+
from mem0 import Memory
32+
33+
os.environ["OPENAI_API_KEY"] = "your-api-key" # used for embedding model
34+
35+
config = {
36+
"llm": {
37+
"provider": "vllm",
38+
"config": {
39+
"model": "Qwen/Qwen2.5-32B-Instruct",
40+
"vllm_base_url": "http://localhost:8000/v1",
41+
"temperature": 0.1,
42+
"max_tokens": 2000,
43+
}
44+
}
45+
}
46+
47+
m = Memory.from_config(config)
48+
messages = [
49+
{"role": "user", "content": "I'm planning to watch a movie tonight. Any recommendations?"},
50+
{"role": "assistant", "content": "How about thriller movies? They can be quite engaging."},
51+
{"role": "user", "content": "I'm not a big fan of thrillers, but I love sci-fi movies."},
52+
{"role": "assistant", "content": "Got it! I'll avoid thrillers and suggest sci-fi movies instead."}
53+
]
54+
m.add(messages, user_id="alice", metadata={"category": "movies"})
55+
```
56+
57+
## Configuration Parameters
58+
59+
| Parameter | Description | Default | Environment Variable |
60+
| --------------- | --------------------------------- | ----------------------------- | -------------------- |
61+
| `model` | Model name running on vLLM server | `"Qwen/Qwen2.5-32B-Instruct"` | - |
62+
| `vllm_base_url` | vLLM server URL | `"http://localhost:8000/v1"` | `VLLM_BASE_URL` |
63+
| `api_key` | API key (dummy for local) | `"vllm-api-key"` | `VLLM_API_KEY` |
64+
| `temperature` | Sampling temperature | `0.1` | - |
65+
| `max_tokens` | Maximum tokens to generate | `2000` | - |
66+
67+
## Environment Variables
68+
69+
You can set these environment variables instead of specifying them in config:
70+
71+
```bash
72+
export VLLM_BASE_URL="http://localhost:8000/v1"
73+
export VLLM_API_KEY="your-vllm-api-key"
74+
export OPENAI_API_KEY="your-openai-api-key" # for embeddings
75+
```
76+
77+
## Benefits
78+
79+
- **High Performance**: 2-24x faster inference than standard implementations
80+
- **Memory Efficient**: Optimized memory usage with PagedAttention
81+
- **Local Deployment**: Keep your data private and reduce API costs
82+
- **Easy Integration**: Drop-in replacement for other LLM providers
83+
- **Flexible**: Works with any model supported by vLLM
84+
85+
## Troubleshooting
86+
87+
1. **Server not responding**: Make sure vLLM server is running
88+
89+
```bash
90+
curl http://localhost:8000/health
91+
```
92+
93+
2. **404 errors**: Ensure correct base URL format
94+
95+
```python
96+
"vllm_base_url": "http://localhost:8000/v1" # Note the /v1
97+
```
98+
99+
3. **Model not found**: Check model name matches server
100+
101+
4. **Out of memory**: Try smaller models or reduce `max_model_len`
102+
103+
```bash
104+
vllm serve Qwen/Qwen2.5-32B-Instruct --max-model-len 4096
105+
```
106+
107+
## Config
108+
109+
All available parameters for the `vllm` config are present in [Master List of All Params in Config](../config).

docs/docs.json

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -117,7 +117,8 @@
117117
"components/llms/models/xAI",
118118
"components/llms/models/sarvam",
119119
"components/llms/models/lmstudio",
120-
"components/llms/models/langchain"
120+
"components/llms/models/langchain",
121+
"components/llms/models/vllm"
121122
]
122123
}
123124
]

examples/misc/vllm_example.py

Lines changed: 144 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,144 @@
1+
"""
2+
Example of using vLLM with mem0 for high-performance memory operations.
3+
4+
SETUP INSTRUCTIONS:
5+
1. Install vLLM:
6+
pip install vllm
7+
8+
2. Start vLLM server (in a separate terminal):
9+
vllm serve microsoft/DialoGPT-small --port 8000
10+
11+
Wait for the message: "Uvicorn running on http://0.0.0.0:8000"
12+
(Small model: ~500MB download, much faster!)
13+
14+
3. Verify server is running:
15+
curl http://localhost:8000/health
16+
17+
4. Run this example:
18+
python examples/misc/vllm_example.py
19+
20+
Optional environment variables:
21+
export VLLM_BASE_URL="http://localhost:8000/v1"
22+
export VLLM_API_KEY="vllm-api-key"
23+
"""
24+
25+
from mem0 import Memory
26+
27+
# Configuration for vLLM integration
28+
config = {
29+
"llm": {
30+
"provider": "vllm",
31+
"config": {
32+
"model": "Qwen/Qwen2.5-32B-Instruct",
33+
"vllm_base_url": "http://localhost:8000/v1",
34+
"api_key": "vllm-api-key",
35+
"temperature": 0.7,
36+
"max_tokens": 100,
37+
}
38+
},
39+
"embedder": {
40+
"provider": "openai",
41+
"config": {
42+
"model": "text-embedding-3-small"
43+
}
44+
},
45+
"vector_store": {
46+
"provider": "qdrant",
47+
"config": {
48+
"collection_name": "vllm_memories",
49+
"host": "localhost",
50+
"port": 6333
51+
}
52+
}
53+
}
54+
55+
def main():
56+
"""
57+
Demonstrate vLLM integration with mem0
58+
"""
59+
print("--> Initializing mem0 with vLLM...")
60+
61+
# Initialize memory with vLLM
62+
memory = Memory.from_config(config)
63+
64+
print("--> Memory initialized successfully!")
65+
66+
# Example conversations to store
67+
conversations = [
68+
{
69+
"messages": [
70+
{"role": "user", "content": "I love playing chess on weekends"},
71+
{"role": "assistant", "content": "That's great! Chess is an excellent strategic game that helps improve critical thinking."}
72+
],
73+
"user_id": "user_123"
74+
},
75+
{
76+
"messages": [
77+
{"role": "user", "content": "I'm learning Python programming"},
78+
{"role": "assistant", "content": "Python is a fantastic language for beginners! What specific areas are you focusing on?"}
79+
],
80+
"user_id": "user_123"
81+
},
82+
{
83+
"messages": [
84+
{"role": "user", "content": "I prefer working late at night, I'm more productive then"},
85+
{"role": "assistant", "content": "Many people find they're more creative and focused during nighttime hours. It's important to maintain a consistent schedule that works for you."}
86+
],
87+
"user_id": "user_123"
88+
}
89+
]
90+
91+
print("\n--> Adding memories using vLLM...")
92+
93+
# Add memories - now powered by vLLM's high-performance inference
94+
for i, conversation in enumerate(conversations, 1):
95+
result = memory.add(
96+
messages=conversation["messages"],
97+
user_id=conversation["user_id"]
98+
)
99+
print(f"Memory {i} added: {result}")
100+
101+
print("\n🔍 Searching memories...")
102+
103+
# Search memories - vLLM will process the search and memory operations
104+
search_queries = [
105+
"What does the user like to do on weekends?",
106+
"What is the user learning?",
107+
"When is the user most productive?"
108+
]
109+
110+
for query in search_queries:
111+
print(f"\nQuery: {query}")
112+
memories = memory.search(
113+
query=query,
114+
user_id="user_123"
115+
)
116+
117+
for memory_item in memories:
118+
print(f" - {memory_item['memory']}")
119+
120+
print("\n--> Getting all memories for user...")
121+
all_memories = memory.get_all(user_id="user_123")
122+
print(f"Total memories stored: {len(all_memories)}")
123+
124+
for memory_item in all_memories:
125+
print(f" - {memory_item['memory']}")
126+
127+
print("\n--> vLLM integration demo completed successfully!")
128+
print("\nBenefits of using vLLM:")
129+
print(" -> 2.7x higher throughput compared to standard implementations")
130+
print(" -> 5x faster time-per-output-token")
131+
print(" -> Efficient memory usage with PagedAttention")
132+
print(" -> Simple configuration, same as other providers")
133+
134+
135+
if __name__ == "__main__":
136+
try:
137+
main()
138+
except Exception as e:
139+
print(f"=> Error: {e}")
140+
print("\nTroubleshooting:")
141+
print("1. Make sure vLLM server is running: vllm serve microsoft/DialoGPT-small --port 8000")
142+
print("2. Check if the model is downloaded and accessible")
143+
print("3. Verify the base URL and port configuration")
144+
print("4. Ensure you have the required dependencies installed")

mem0/configs/llms/base.py

Lines changed: 7 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -44,6 +44,8 @@ def __init__(
4444
# LM Studio specific
4545
lmstudio_base_url: Optional[str] = "http://localhost:1234/v1",
4646
lmstudio_response_format: dict = None,
47+
# vLLM specific
48+
vllm_base_url: Optional[str] = "http://localhost:8000/v1",
4749
# AWS Bedrock specific
4850
aws_access_key_id: Optional[str] = None,
4951
aws_secret_access_key: Optional[str] = None,
@@ -98,6 +100,8 @@ def __init__(
98100
:type lmstudio_base_url: Optional[str], optional
99101
:param lmstudio_response_format: LM Studio response format to be use, defaults to None
100102
:type lmstudio_response_format: Optional[Dict], optional
103+
:param vllm_base_url: vLLM base URL to be use, defaults to "http://localhost:8000/v1"
104+
:type vllm_base_url: Optional[str], optional
101105
"""
102106

103107
self.model = model
@@ -139,6 +143,9 @@ def __init__(
139143
self.lmstudio_base_url = lmstudio_base_url
140144
self.lmstudio_response_format = lmstudio_response_format
141145

146+
# vLLM specific
147+
self.vllm_base_url = vllm_base_url
148+
142149
# AWS Bedrock specific
143150
self.aws_access_key_id = aws_access_key_id
144151
self.aws_secret_access_key = aws_secret_access_key

mem0/llms/configs.py

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -26,6 +26,7 @@ def validate_config(cls, v, values):
2626
"xai",
2727
"sarvam",
2828
"lmstudio",
29+
"vllm",
2930
"langchain",
3031
):
3132
return v

0 commit comments

Comments
 (0)