|
| 1 | +# ADR-148: Brain Hypothesis Engine — Self-Improving Knowledge System with Gemini, DiskANN, and Auto-Experimentation |
| 2 | + |
| 3 | +## Status |
| 4 | + |
| 5 | +Proposed |
| 6 | + |
| 7 | +## Date |
| 8 | + |
| 9 | +2026-04-13 |
| 10 | + |
| 11 | +## Context |
| 12 | + |
| 13 | +The pi.ruv.io brain (10,300+ memories, 38M graph edges, LoRA epoch 41) stores and retrieves knowledge but cannot: |
| 14 | +1. Generate hypotheses from cross-domain connections |
| 15 | +2. Evaluate quality beyond embedding similarity (quality scores mostly 0.0) |
| 16 | +3. Filter noise from curated knowledge (random IEEE events alongside real patterns) |
| 17 | +4. Measure whether LoRA training actually improves search quality |
| 18 | + |
| 19 | +The brain runs on Google Cloud Run (`ruvbrain` service, us-central1) backed by `crates/mcp-brain-server/` (Rust/Axum). Current embedding: `ruvllm::RlmEmbedder` at 128-dim. Current index: flat HNSW. |
| 20 | + |
| 21 | +## Decision |
| 22 | + |
| 23 | +Add four capabilities as **additive layers** — no changes to the running brain's core path. All new code is behind feature flags or in separate Cloud Run services. |
| 24 | + |
| 25 | +### Architecture: Three New Components |
| 26 | + |
| 27 | +``` |
| 28 | +┌─────────────────────────────────────────────────────────┐ |
| 29 | +│ EXISTING (untouched) │ |
| 30 | +│ mcp-brain-server: store, search, graph, drift, LoRA │ |
| 31 | +│ Embedder: ruvllm::RlmEmbedder (128-dim) │ |
| 32 | +│ Index: flat HNSW │ |
| 33 | +└──────────────┬──────────────────────────────────────────┘ |
| 34 | + │ (reads from, writes back to) |
| 35 | + v |
| 36 | +┌─────────────────────────────────────────────────────────┐ |
| 37 | +│ NEW: Hypothesis Engine (separate Cloud Run service) │ |
| 38 | +│ │ |
| 39 | +│ 1. HYPOTHESIS GENERATOR │ |
| 40 | +│ - Watches for new cross-domain graph edges │ |
| 41 | +│ - Templates: "If X works in domain A, │ |
| 42 | +│ then X should work in domain B" │ |
| 43 | +│ - Uses Gemini 2.5 Flash for hypothesis formulation │ |
| 44 | +│ and experiment design │ |
| 45 | +│ - Stores hypotheses as "untested" memories │ |
| 46 | +│ │ |
| 47 | +│ 2. QUALITY SCORER │ |
| 48 | +│ - DiskANN index over all 10K+ memory embeddings │ |
| 49 | +│ - PageRank via ruvector-solver ForwardPush │ |
| 50 | +│ - Multi-signal: centrality + citations + verdicts │ |
| 51 | +│ + contributor rep + temporal + surprise │ |
| 52 | +│ - Updates quality field via brain API │ |
| 53 | +│ │ |
| 54 | +│ 3. NOISE FILTER │ |
| 55 | +│ - Ingestion gate: regex + embedding dedup │ |
| 56 | +│ - Weekly cleanup: archive orphan low-quality │ |
| 57 | +│ - Meta-mincut: ruvector-mincut on knowledge graph │ |
| 58 | +│ to find noise partition │ |
| 59 | +│ │ |
| 60 | +│ 4. BENCHMARK SUITE │ |
| 61 | +│ - 50 curated test queries with known-good answers │ |
| 62 | +│ - Runs before/after each LoRA epoch │ |
| 63 | +│ - Tracks MRR, precision@5, cross-domain recall │ |
| 64 | +│ - Auto-rollback if MRR drops > 5% │ |
| 65 | +│ │ |
| 66 | +└─────────────────────────────────────────────────────────┘ |
| 67 | +``` |
| 68 | + |
| 69 | +### Component Details |
| 70 | + |
| 71 | +#### Gemini 2.5 Flash for Hypothesis Generation |
| 72 | + |
| 73 | +**Why Gemini, not local LLM:** |
| 74 | +- Hypothesis generation is infrequent (triggered by new cross-domain edges, ~10/day) |
| 75 | +- Requires reasoning about domain transfer ("if mincut detects seizures, could it detect X?") |
| 76 | +- Gemini 2.5 Flash: fast, cheap (~$0.15/1M input tokens), 1M context window |
| 77 | +- Local RLM embedder stays for indexing (it's tuned to the corpus) — Gemini is for reasoning only |
| 78 | + |
| 79 | +**API integration:** |
| 80 | +```rust |
| 81 | +// New module: crates/mcp-brain-server/src/hypothesis.rs |
| 82 | +// Feature-gated: #[cfg(feature = "hypothesis")] |
| 83 | + |
| 84 | +use google_generativeai::Client; // or raw REST via reqwest |
| 85 | + |
| 86 | +async fn generate_hypothesis(edge: &CrossDomainEdge) -> Hypothesis { |
| 87 | + let prompt = format!( |
| 88 | + "Given this cross-domain connection:\n\ |
| 89 | + Domain A: {}\nDomain B: {}\nBridge concept: {}\n\n\ |
| 90 | + Generate a testable hypothesis: if the pattern from domain A \ |
| 91 | + works, what specific prediction does it make in domain B? \ |
| 92 | + Include: hypothesis statement, test method, expected outcome, \ |
| 93 | + null hypothesis, required data.", |
| 94 | + edge.domain_a, edge.domain_b, edge.bridge_concept |
| 95 | + ); |
| 96 | + // Call Gemini 2.5 Flash |
| 97 | + let response = gemini_client.generate(&prompt).await?; |
| 98 | + parse_hypothesis(response) |
| 99 | +} |
| 100 | +``` |
| 101 | + |
| 102 | +**Cost estimate:** ~10 hypotheses/day × ~500 tokens each = ~5K tokens/day = ~$0.001/day. Negligible. |
| 103 | + |
| 104 | +#### DiskANN for Scalable Quality Scoring |
| 105 | + |
| 106 | +**Why DiskANN, not current flat HNSW:** |
| 107 | +- Current HNSW is in-memory, fine for 10K memories |
| 108 | +- At 100K+ memories (projected within months), memory pressure becomes real |
| 109 | +- DiskANN stores the graph on SSD, loads only neighbors on demand |
| 110 | +- Product Quantization (PQ) compresses vectors 4-8x for candidate filtering |
| 111 | +- `ruvector-diskann` already implements Vamana graph + PQ (ADR-146) |
| 112 | + |
| 113 | +**Integration plan:** |
| 114 | +```rust |
| 115 | +// New module: crates/mcp-brain-server/src/diskann_index.rs |
| 116 | +// Feature-gated: #[cfg(feature = "diskann")] |
| 117 | + |
| 118 | +use ruvector_diskann::{DiskAnnIndex, DiskAnnConfig}; |
| 119 | + |
| 120 | +pub struct HybridIndex { |
| 121 | + hnsw: HnswIndex, // Existing, stays as primary for <50K |
| 122 | + diskann: DiskAnnIndex, // New, activates at >50K memories |
| 123 | + threshold: usize, // Switch point (default: 50_000) |
| 124 | +} |
| 125 | + |
| 126 | +impl HybridIndex { |
| 127 | + pub fn search(&self, query: &[f32], k: usize) -> Vec<(usize, f32)> { |
| 128 | + if self.hnsw.len() < self.threshold { |
| 129 | + self.hnsw.search(query, k) |
| 130 | + } else { |
| 131 | + self.diskann.search(query, k) |
| 132 | + } |
| 133 | + } |
| 134 | +} |
| 135 | +``` |
| 136 | + |
| 137 | +**Benchmark plan:** Run both HNSW and DiskANN on the current 10K corpus, measure: |
| 138 | +- Recall@10 (should be >95% for both) |
| 139 | +- Query latency (HNSW: ~1ms, DiskANN: ~5-10ms expected) |
| 140 | +- Memory usage (HNSW: ~50MB, DiskANN: ~5MB + SSD) |
| 141 | +- Index build time |
| 142 | + |
| 143 | +#### Quality Scorer with ForwardPush PageRank |
| 144 | + |
| 145 | +```rust |
| 146 | +// crates/mcp-brain-server/src/quality.rs |
| 147 | + |
| 148 | +pub fn compute_quality_scores(brain: &Brain) -> Vec<(MemoryId, f64)> { |
| 149 | + // 1. Build CSR graph from memory edges |
| 150 | + let graph = brain.graph_to_csr(); |
| 151 | + |
| 152 | + // 2. Run ForwardPush PageRank (sublinear, O(1/epsilon)) |
| 153 | + let pr = ForwardPushSolver::new(0.85, 0.001); |
| 154 | + let pagerank = pr.solve(&graph)?; |
| 155 | + |
| 156 | + // 3. Compute multi-signal quality |
| 157 | + brain.memories().map(|m| { |
| 158 | + let centrality = pagerank[m.id]; |
| 159 | + let citations = m.inbound_edge_count as f64 / max_citations; |
| 160 | + let verdict = match m.verdict { |
| 161 | + Confirmed => 1.0, |
| 162 | + Refuted => -0.5, |
| 163 | + Untested => 0.0, |
| 164 | + }; |
| 165 | + let surprise = 1.0 - m.max_similarity_to_existing; |
| 166 | + let temporal = recency_weight(m.created_at); |
| 167 | + let bridge = if m.crosses_domains { 0.3 } else { 0.0 }; |
| 168 | + |
| 169 | + let quality = 0.25 * centrality |
| 170 | + + 0.20 * citations |
| 171 | + + 0.20 * verdict |
| 172 | + + 0.15 * surprise |
| 173 | + + 0.10 * temporal |
| 174 | + + 0.10 * bridge; |
| 175 | + |
| 176 | + (m.id, quality.clamp(0.0, 1.0)) |
| 177 | + }).collect() |
| 178 | +} |
| 179 | +``` |
| 180 | + |
| 181 | +### Safety Constraints (don't break the running system) |
| 182 | + |
| 183 | +1. **All new code is feature-gated.** The existing `mcp-brain-server` binary is unchanged unless `--features hypothesis,diskann,benchmark` is explicitly enabled. |
| 184 | + |
| 185 | +2. **Hypothesis engine runs as a SEPARATE Cloud Run service.** It calls the brain's API; it doesn't modify the brain's process. If it crashes, the brain keeps running. |
| 186 | + |
| 187 | +3. **DiskANN is a fallback, not a replacement.** HNSW stays as primary for <50K memories. DiskANN only activates when memory count exceeds the threshold. Both can be queried in parallel for benchmark comparison. |
| 188 | + |
| 189 | +4. **Quality scores are written to a NEW field (`quality_v2`).** The existing `quality` field is untouched until v2 scores are validated. |
| 190 | + |
| 191 | +5. **Noise filtering is archive-only.** Memories are archived (moved to cold storage), never deleted. Full rollback possible. |
| 192 | + |
| 193 | +6. **Benchmark auto-rollback.** If LoRA epoch N+1 degrades MRR by >5%, the epoch is discarded and the EWC checkpoint is restored automatically. |
| 194 | + |
| 195 | +7. **Gemini API key stored in gcloud secrets.** Already available as `GEMINI_API_KEY`. Rate-limited to 10 calls/hour to avoid cost surprises. |
| 196 | + |
| 197 | +### Implementation Phases |
| 198 | + |
| 199 | +| Phase | What | Risk | Timeline | |
| 200 | +|-------|------|------|----------| |
| 201 | +| **P0: ADR + Branch** | This document + feature branch | None | Done | |
| 202 | +| **P1: Benchmark suite** | 50 test queries, MRR tracking | None (read-only) | 3 days | |
| 203 | +| **P2: Quality scorer** | PageRank + multi-signal scoring | Low (writes to new field) | 1 week | |
| 204 | +| **P3: Noise filter** | Ingestion gate + weekly cleanup | Low (archive-only) | 3 days | |
| 205 | +| **P4: DiskANN integration** | Hybrid index behind feature flag | Low (fallback only) | 1 week | |
| 206 | +| **P5: Hypothesis engine** | Gemini integration + auto-test | Medium (new service) | 2 weeks | |
| 207 | + |
| 208 | +**Total: ~5 weeks, phased. P1-P3 can run in parallel.** |
| 209 | + |
| 210 | +## Consequences |
| 211 | + |
| 212 | +### Positive |
| 213 | +- Brain evolves from "smart database" to "scientific reasoner" |
| 214 | +- Quality scores become meaningful (currently all 0.0) |
| 215 | +- Noise filtering reduces graph pollution |
| 216 | +- LoRA training becomes measurable and rollback-safe |
| 217 | +- DiskANN prepares for 100K+ memory scale |
| 218 | +- Gemini hypothesis generation is the first step toward autonomous discovery |
| 219 | + |
| 220 | +### Negative |
| 221 | +- New dependency: Google Gemini API (adds cost, ~$0.03/day estimated) |
| 222 | +- DiskANN adds complexity to the index path |
| 223 | +- Hypothesis engine needs curation — false hypotheses could pollute if not filtered |
| 224 | +- More Cloud Run services to monitor |
| 225 | + |
| 226 | +### Risks |
| 227 | +- Gemini may generate low-quality hypotheses → mitigated by verdict system (untested until confirmed) |
| 228 | +- DiskANN recall may be lower than HNSW at small corpus → mitigated by hybrid approach with threshold |
| 229 | +- Quality scoring may be gamed by circular citations → mitigated by PageRank dampening |
| 230 | + |
| 231 | +## References |
| 232 | + |
| 233 | +- ADR-146: DiskANN Vamana Implementation |
| 234 | +- ADR-131: Consciousness Metrics Crate |
| 235 | +- ADR-048: Sublinear Graph Attention |
| 236 | +- Subramanya et al., "DiskANN: Fast Accurate Billion-point Nearest Neighbor Search" (NeurIPS 2019) |
| 237 | +- Google Gemini API: https://ai.google.dev/gemini-api |
| 238 | +- ForwardPush PPR: Andersen, Chung, Lang 2006 |
0 commit comments