|
| 1 | +### Context-Enabled Semantic Caching with Spring AI Demo |
| 2 | + |
| 3 | +Semantic Caching is a technique that enhances Large Language Model (LLM) applications by caching responses based on the semantic meaning of queries rather than exact matches. |
| 4 | + |
| 5 | +Even though Semantic Caching can help us save costs and time, it may come with downsides depending on the business on which its applied. |
| 6 | + |
| 7 | +Sometimes, prompts may be similar, but refer to different contexts. For example: `What kind of beer goes well with meat?` and `What kind of beer goes well with Pizza?` |
| 8 | + |
| 9 | +These two prompts are semantically similar, but refer to two different context: `Pizza` and `Meat` - This is where Context Enabled Semantic Caching may help. |
| 10 | + |
| 11 | +Instead of relying solely on the semantic caching, we can serve the cached response to a less capable, cheaper, and faster model with the new provided information so that it can generate a response that satisfies the prompt with information, tone, or other characteristics that came from the more capable model. |
| 12 | + |
| 13 | +This demo showcases how to implement Context-Enabled Semantic Caching using Spring AI and Redis Vector Store to improve performance and reduce costs in a beer recommendation system. |
| 14 | + |
| 15 | +## Learning resources: |
| 16 | + |
| 17 | +- Video: [What is semantic caching?](https://www.youtube.com/watch?v=AtVTT_s8AGc) |
| 18 | +- Video: [What is an embedding model?](https://youtu.be/0U1S0WSsPuE) |
| 19 | +- Video: [Exact vs Approximate Nearest Neighbors - What's the difference?](https://youtu.be/9NvO-VdjY80) |
| 20 | +- Video: [What is a vector database?](https://youtu.be/Yhv19le0sBw) |
| 21 | + |
| 22 | +## Requirements |
| 23 | + |
| 24 | +To run this demo, you’ll need the following installed on your system: |
| 25 | +- Docker – [Install Docker](https://docs.docker.com/get-docker/) |
| 26 | +- Docker Compose – Included with Docker Desktop or available via CLI installation guide |
| 27 | +- An OpenAI API Key – You can get one from [platform.openai.com](https://platform.openai.com) |
| 28 | + |
| 29 | +## Running the demo |
| 30 | + |
| 31 | +The easiest way to run the demo is with Docker Compose, which sets up all required services in one command. |
| 32 | + |
| 33 | +### Step 1: Clone the repository |
| 34 | + |
| 35 | +If you haven’t already: |
| 36 | + |
| 37 | +```bash |
| 38 | +git clone https://github.com/redis-developer/redis-springboot-recipes.git |
| 39 | +cd redis-springboot-recipes/artificial-intelligence/semantic-caching-with-spring-ai |
| 40 | +``` |
| 41 | + |
| 42 | +### Step 2: Configure your environment |
| 43 | + |
| 44 | +You can pass your OpenAI API key in two ways: |
| 45 | + |
| 46 | +#### Option 1: Export the key via terminal |
| 47 | + |
| 48 | +```bash |
| 49 | +export OPENAI_API_KEY=sk-your-api-key |
| 50 | +``` |
| 51 | + |
| 52 | +#### Option 2: Use a .env file |
| 53 | + |
| 54 | +Create a `.env` file in the same directory as the `docker-compose.yml` file: |
| 55 | + |
| 56 | +```env |
| 57 | +OPENAI_API_KEY=sk-your-api-key |
| 58 | +``` |
| 59 | + |
| 60 | +### Step 3: Start the services |
| 61 | + |
| 62 | +```bash |
| 63 | +docker compose up --build |
| 64 | +``` |
| 65 | + |
| 66 | +This will start: |
| 67 | + |
| 68 | +- redis: for storing both vector embeddings and chat history |
| 69 | +- redis-insight: a UI to explore the Redis data |
| 70 | +- semantic-caching-app: the Spring Boot app that implements the RAG application |
| 71 | + |
| 72 | +## Using the demo |
| 73 | + |
| 74 | +When all of your services are up and running. Go to `localhost:8080` to access the demo. |
| 75 | + |
| 76 | + |
| 77 | + |
| 78 | +If you click on `Start Chat`, it may be that the embeddings are still being created, and you get a message asking for this operation to complete. This is the operation where the documents we'll search through will be turned into vectors and then stored in the database. It is done only the first time the app starts up and is required regardless of the vector database you use. |
| 79 | + |
| 80 | + |
| 81 | + |
| 82 | +Once all the embeddings have been created, you can start asking your chatbot questions. It will semantically search through the documents we have stored, try to find the best answer for your questions, and cache the responses semantically in Redis: |
| 83 | + |
| 84 | + |
| 85 | + |
| 86 | +If you ask something similar to a question had already been asked, your chatbot will retrieve it from the cache instead of sending the query to the LLM. Retrieving an answer much faster now. |
| 87 | + |
| 88 | + |
| 89 | + |
| 90 | +## How It Is Implemented |
| 91 | + |
| 92 | +The application uses Spring AI's `RedisVectorStore` to store and retrieve responses from a semantic cache. |
| 93 | + |
| 94 | +### Configuring the Chat Models |
| 95 | + |
| 96 | +```kotlin |
| 97 | +@Bean |
| 98 | +fun openAiExpensiveChatModel(): OpenAiChatModel { |
| 99 | + val modelName = "gpt-5-2025-08-07" |
| 100 | + return openAiChatModel(modelName) |
| 101 | +} |
| 102 | + |
| 103 | +@Bean |
| 104 | +fun openAiCheapChatModel(): OpenAiChatModel { |
| 105 | + val modelName = "gpt-5-nano-2025-08-07" |
| 106 | + return openAiChatModel(modelName) |
| 107 | +} |
| 108 | + |
| 109 | +private fun openAiChatModel(modelName: String): OpenAiChatModel { |
| 110 | + val openAiApi = OpenAiApi.builder() |
| 111 | + .apiKey(System.getenv("OPENAI_API_KEY")) |
| 112 | + .build() |
| 113 | + val openAiChatOptions = OpenAiChatOptions.builder() |
| 114 | + .model(modelName) |
| 115 | + .temperature(0.4) |
| 116 | + .build() |
| 117 | + |
| 118 | + return OpenAiChatModel.builder() |
| 119 | + .openAiApi(openAiApi) |
| 120 | + .defaultOptions(openAiChatOptions) |
| 121 | + .build() |
| 122 | +} |
| 123 | +``` |
| 124 | + |
| 125 | +### Configuring the Semantic Cache |
| 126 | + |
| 127 | +```kotlin |
| 128 | +@Bean |
| 129 | +fun semanticCachingVectorStore( |
| 130 | + embeddingModel: TransformersEmbeddingModel, |
| 131 | + jedisPooled: JedisPooled |
| 132 | +): RedisVectorStore { |
| 133 | + return RedisVectorStore.builder(jedisPooled, embeddingModel) |
| 134 | + .indexName("semanticCachingIdx") |
| 135 | + .contentFieldName("content") |
| 136 | + .embeddingFieldName("embedding") |
| 137 | + .metadataFields( |
| 138 | + RedisVectorStore.MetadataField("answer", Schema.FieldType.TEXT), |
| 139 | + ) |
| 140 | + .prefix("semantic-caching:") |
| 141 | + .initializeSchema(true) |
| 142 | + .vectorAlgorithm(RedisVectorStore.Algorithm.HSNW) |
| 143 | + .build() |
| 144 | +} |
| 145 | +``` |
| 146 | + |
| 147 | +Let's break this down: |
| 148 | + |
| 149 | +- **Index Name**: `semanticCachingIdx` - Redis will create an index with this name for searching cached responses |
| 150 | +- **Content Field**: `content` - The raw prompt that will be embedded |
| 151 | +- **Embedding Field**: `embedding` - The field that will store the resulting vector embedding |
| 152 | +- **Metadata Fields**: `answer` - A TEXT field to store the LLM's response |
| 153 | +- **Prefix**: `semantic-caching:` - All keys in Redis will be prefixed with this to organize the data |
| 154 | +- **Vector Algorithm**: `HSNW` - Hierarchical Navigable Small World algorithm for efficient approximate nearest neighbor search |
| 155 | + |
| 156 | +### Storing Responses in the Semantic Cache |
| 157 | + |
| 158 | +When a user asks a question and the system generates a response, it stores the prompt and response in the semantic cache: |
| 159 | + |
| 160 | +```kotlin |
| 161 | +fun storeInCache(prompt: String, answer: String) { |
| 162 | + semanticCachingVectorStore.add(listOf(Document( |
| 163 | + prompt, |
| 164 | + mapOf( |
| 165 | + "answer" to answer |
| 166 | + ) |
| 167 | + ))) |
| 168 | +} |
| 169 | +``` |
| 170 | + |
| 171 | +This method: |
| 172 | +1. Creates a `Document` with the prompt as the content |
| 173 | +2. Adds the answer as metadata |
| 174 | +3. Stores the document in the vector store, which automatically generates and stores the embedding |
| 175 | + |
| 176 | +### Retrieving Responses from the Semantic Cache |
| 177 | + |
| 178 | +When a user asks a question, the system first checks if there's a semantically similar question in the cache: |
| 179 | + |
| 180 | +```kotlin |
| 181 | +fun getFromCache(prompt: String, similarityThreshold: Double): String? { |
| 182 | + val results = semanticCachingVectorStore.similaritySearch( |
| 183 | + SearchRequest.builder() |
| 184 | + .query(prompt) |
| 185 | + .topK(1) |
| 186 | + .build() |
| 187 | + ) |
| 188 | + |
| 189 | + if (results?.isNotEmpty() == true) { |
| 190 | + if (similarityThreshold < (results[0].score ?: 0.0)) { |
| 191 | + logger.info("Returning cached answer. Similarity score: ${results[0].score}") |
| 192 | + return results[0].metadata["answer"] as String |
| 193 | + } |
| 194 | + } |
| 195 | + |
| 196 | + return null |
| 197 | +} |
| 198 | +``` |
| 199 | + |
| 200 | +This method: |
| 201 | +1. Performs a vector similarity search for the most similar prompt in the cache |
| 202 | +2. Checks if the similarity score is above the threshold (typically 0.8) |
| 203 | +3. If a match is found, the system uses the cheaper model to compute the new response based on the new knowledge and the previously generated response. |
| 204 | + |
| 205 | +### Integrating with the RAG System |
| 206 | + |
| 207 | +The RAG service integrates the semantic cache with the RAG system: |
| 208 | + |
| 209 | +```kotlin |
| 210 | +// Regular prompt and prompt suffix in case of cache hit |
| 211 | + |
| 212 | + private val systemBeerPrompt = """ |
| 213 | + You're assisting with questions about products in a beer catalog. |
| 214 | + Use the information from the DOCUMENTS section to provide accurate answers. |
| 215 | + The answer involves referring to the ABV or IBU of the beer, include the beer name in the response. |
| 216 | + If unsure, simply state that you don't know. |
| 217 | +
|
| 218 | + DOCUMENTS: |
| 219 | + {documents} |
| 220 | + """.trimIndent() |
| 221 | + |
| 222 | +private val semanticCachedAnswerPromptSuffix = """ |
| 223 | + A similar prompt has been processed before. Use it as the base for your response with the new document selection and new prompt: |
| 224 | + |
| 225 | + SIMILAR PROMPT ALREADY PROCESSED: |
| 226 | + SIMILAR PROMPT: |
| 227 | + {similarPrompt} |
| 228 | + |
| 229 | + SIMILAR ANSWER: |
| 230 | + {similarAnswer} |
| 231 | + """.trimIndent() |
| 232 | + |
| 233 | + |
| 234 | + fun retrieve(message: String): RagResult { |
| 235 | + // Get documents |
| 236 | + val docs = getDocuments(message) |
| 237 | + |
| 238 | + // Get potential cached answer |
| 239 | + val (cachedQuestion, cachedAnswer) = semanticCachingService.getFromCache(message, 0.8) |
| 240 | + |
| 241 | + // Generate System Prompt |
| 242 | + val systemMessage = if (cachedQuestion != null && cachedAnswer != null) { |
| 243 | + getSystemMessage(docs, cachedQuestion, cachedAnswer) |
| 244 | + } else { |
| 245 | + getSystemMessage(docs) |
| 246 | + } |
| 247 | + |
| 248 | + val userMessage = UserMessage(message) |
| 249 | + |
| 250 | + val prompt = Prompt(listOf(systemMessage, userMessage)) |
| 251 | + |
| 252 | + // Call the expensive or cheap model accordingly |
| 253 | + val response: ChatResponse = if (cachedQuestion != null && cachedAnswer != null) { |
| 254 | + openAiCheapChatModel.call(prompt) |
| 255 | + } else { |
| 256 | + openAiExpensiveChatModel.call(prompt) |
| 257 | + } |
| 258 | + |
| 259 | + // Store in semantic caching |
| 260 | + semanticCachingService.storeInCache(message, response.result.output.text.toString()) |
| 261 | + |
| 262 | + return RagResult( |
| 263 | + generation = response.result |
| 264 | + ) |
| 265 | + } |
| 266 | +``` |
| 267 | + |
| 268 | +This orchestrates the entire process: |
| 269 | +1. Check if there's a semantically similar prompt in the cache |
| 270 | +2. If found, return the cached answer immediately |
| 271 | +3. If not found, perform the standard RAG process: |
| 272 | + - Retrieve relevant documents using vector similarity search |
| 273 | + - Generate a response using the LLM |
| 274 | + - Store the prompt and response in the semantic cache for future use |
| 275 | + |
| 276 | +This approach significantly improves performance and reduces costs by avoiding unnecessary LLM calls for semantically similar queries, while still providing accurate and contextually relevant responses. |
0 commit comments