This document tracks the implementation progress of ADR-002: Crawl Reliability, Provenance Tracking & Validation.
Branch: feature/crawl-checkpoint-resume
Date: 2026-02-22
Status: ✅ Fully Implemented
Files Modified:
-
✅
python/src/server/services/crawling/crawling_service.py- Added
_filter_already_processed_urls()helper method (lines 857-889) - Updated
_crawl_by_url_type()signature to acceptsource_idandhas_existing_state - Applied resume filtering to sitemap crawling (lines 1101-1120)
- Applied resume filtering to link collection batch crawling (lines 1046-1051)
- Applied resume filtering to recursive crawling (lines 1066-1073, 1155-1162)
- Updated call sites to pass source_id and has_existing_state parameters
- Added
-
✅
python/src/server/services/crawling/strategies/recursive.py- Updated
crawl_recursive_with_progress()signature to acceptsource_idandurl_state_service - Pre-populated visited set with already-embedded URLs (lines 158-165)
- Prevents re-crawling of completed URLs during recursive depth traversal
- Updated
-
✅ Infrastructure Already Complete (from previous work)
archon_crawl_url_statetable existsCrawlUrlStateServicewith full CRUD operations- Integration with document storage operations
How It Works:
- Detection: When
orchestrate_crawl()starts, it checks for existing crawl state usingurl_state_service.has_existing_state() - Logging: If state exists with pending/failed URLs, logs resume information
- Filtering: Before crawling strategies execute:
- Sitemap: Filters URLs before batch crawl
- Link Collection: Filters extracted links before batch crawl
- Recursive: Pre-populates visited set to skip embedded URLs
- Resume: Only unprocessed URLs are crawled, preventing duplicates
Testing Verification:
# Test scenario:
# 1. Start crawl of sitemap with 100 URLs
# 2. Kill server after 30 URLs embedded
# 3. Check archon_crawl_url_state shows 30 embedded, 70 pending
# 4. Restart server and re-trigger crawl
# 5. Verify logs show "Resume filtering | skipped=30 already-embedded URLs"
# 6. Verify only 70 new URLs are processedStatus: ✅ Fully Implemented
Database Migration:
✅ migration/0.1.0/013_add_provenance_tracking.sql
- Adds 7 new columns to
archon_sources:embedding_model(TEXT) - e.g., "text-embedding-3-small"embedding_dimensions(INTEGER) - e.g., 1536embedding_provider(TEXT) - e.g., "openai"vectorizer_settings(JSONB) - chunk_size, use_contextual, use_hybridsummarization_model(TEXT) - e.g., "gpt-4o-mini"last_crawled_at(TIMESTAMPTZ)last_vectorized_at(TIMESTAMPTZ)
- Creates indexes on
embedding_modelandembedding_provider - Adds column comments for documentation
Files Modified:
-
✅
python/src/server/services/source_management_service.py- Updated
update_source_info()signature to accept provenance parameters (lines 214-232) - Added provenance fields to existing source upsert (lines 294-313)
- Added provenance fields to new source creation (lines 378-402)
- Sets
last_crawled_atandlast_vectorized_attimestamps
- Updated
-
✅
python/src/server/services/crawling/document_storage_operations.py- Captures embedding configuration from credential service (lines 376-392)
- Retrieves: embedding_provider, embedding_model, embedding_dimensions
- Retrieves summarization_model from RAG strategy settings
- Passes all provenance to
update_source_info()during crawl
How It Works:
- Capture: During
_create_source_records(), reads current provider configuration - Store: Passes configuration to
update_source_info()which upserts to database - Timestamps: Automatically sets
last_crawled_atandlast_vectorized_atto current time - Persistence: All sources now track which models/settings were used
Status: ⏳ PENDING
Files to Modify:
-
⏳
archon-ui-main/src/features/knowledge/types/knowledge.tsexport interface KnowledgeSource { source_id: string; // ... existing fields ... embedding_model?: string; embedding_dimensions?: number; embedding_provider?: string; vectorizer_settings?: { use_contextual?: boolean; use_hybrid?: boolean; chunk_size?: number; }; summarization_model?: string; last_crawled_at?: string; last_vectorized_at?: string; }
-
⏳
archon-ui-main/src/features/knowledge/components/KnowledgeCard.tsx- Add expandable "Processing Details" section using Radix Collapsible
- Display embedding_provider/embedding_model (embedding_dimensions D)
- Display summarization_model
- Display formatted last_crawled_at timestamp
- Use Tron-inspired glassmorphism styling
UI Design:
<Collapsible.Root>
<Collapsible.Trigger className="flex items-center gap-2 text-sm text-gray-400 hover:text-cyan-400">
<ChevronRight className="transition-transform" />
Processing Details
</Collapsible.Trigger>
<Collapsible.Content className="mt-2 text-xs text-gray-400 space-y-1 pl-6">
<div>Embeddings: {embedding_provider}/{embedding_model} ({embedding_dimensions}D)</div>
<div>Summarization: {summarization_model}</div>
<div>Last crawled: {formatDate(last_crawled_at)}</div>
</Collapsible.Content>
</Collapsible.Root>Status: ❌ Not Started
Files to Create:
-
❌
python/src/server/api_routes/knowledge_api.py(or modify existing)- Add
GET /api/knowledge-items/{source_id}/validateendpoint - Checks:
- Missing chunks (URLs marked embedded but no chunks exist)
- Zero-vector embeddings (null or all-zero vectors)
- Dimension mismatches (mixed embedding dimensions)
- Orphaned pages (page_metadata without chunks)
- Failed URLs that never recovered
- Returns:
{ valid: bool, issues: Issue[], total_issues: int }
- Add
-
❌
migration/0.1.0/014_add_validation_functions.sqlCREATE OR REPLACE FUNCTION count_zero_vectors(src_id TEXT) RETURNS INTEGER AS $$ SELECT COUNT(*) FROM archon_documents WHERE source_id = src_id AND embedding IS NOT NULL AND array_length(embedding, 1) > 0 AND embedding = array_fill(0::float, ARRAY[array_length(embedding, 1)]); $$ LANGUAGE SQL;
Status: ❌ Not Started
Files to Modify:
- ❌
python/src/mcp_server/features/rag/rag_tools.py- Add
rag_validate_source(source_id: str)tool - Calls validation API endpoint
- Returns summary: valid, error_count, warning_count, issues_summary, recommendation
- Read-only (no writes, no fixes)
- Add
Tool Usage Example:
@mcp.tool()
async def rag_validate_source(source_id: str) -> dict:
"""Check knowledge source health before using for RAG."""
# Calls GET /api/knowledge-items/{source_id}/validate
# Returns summary for agent decision-makingStatus: ❌ Not Started
Files to Create:
-
❌
archon-ui-main/src/features/knowledge/components/ValidationPanel.tsx- "Validate" button on knowledge item action menu
- Opens expandable panel or modal with validation results
- Color-coded issues (red=error, yellow=warning, blue=info)
- "Fix" buttons for fixable issues
-
❌
archon-ui-main/src/features/knowledge/hooks/useValidateSource.ts- TanStack Query hook for validation endpoint
useValidateSource(sourceId)→ returns validation data
Status: ❌ Not Started
Files to Create/Modify:
-
❌
python/src/server/services/credential_service.py- Add methods to get code summarization settings
-
❌
python/src/server/api_routes/knowledge_api.py- Add
POST /api/knowledge-items/{source_id}/revectorizeendpoint - Add
POST /api/knowledge-items/{source_id}/resummarizeendpoint
- Add
-
❌
python/src/server/services/storage/document_storage_service.py- Add
revectorize_source(source_id)method - Add
resummarize_source(source_id)method
- Add
Status: ❌ Not Started
Files to Create/Modify:
-
❌
archon-ui-main/src/services/credentialsService.ts- Add
CODE_SUMMARIZATION_MODEL,CODE_SUMMARIZATION_PROVIDERto RagSettings
- Add
-
❌
archon-ui-main/src/components/settings/RAGSettings.tsx- Add "Code Summarization Agent" section
-
❌
archon-ui-main/src/features/knowledge/services/knowledgeService.ts- Add
revectorizeKnowledgeItem()method - Add
resummarizeKnowledgeItem()method
- Add
-
❌
archon-ui-main/src/features/knowledge/hooks/useKnowledgeQueries.ts- Add
useRevectorizeKnowledgeItem()hook - Add
useResummarizeKnowledgeItem()hook
- Add
-
❌
archon-ui-main/src/features/knowledge/components/KnowledgeCardActions.tsx- Add "Re-vectorize" dropdown action
- Add "Re-summarize" dropdown action
-
❌
archon-ui-main/src/features/knowledge/components/KnowledgeCard.tsx- Add "Needs re-vectorization" badge when settings changed
- Start sitemap crawl with 100 URLs
- Kill process at 30% complete
- Verify
archon_crawl_url_stateshows mix of embedded/pending - Restart and re-trigger crawl
- Verify only pending URLs processed
- Verify no duplicates in final data
- Check logs show "Resume filtering | skipped=X"
- Backend: Migration created
- Backend: Service layer updated
- Backend: Provenance captured during crawl
- Frontend: Types updated
- Frontend: UI displays provenance
- Test: Crawl a source
- Test: Query source record
- Test: Verify provenance fields populated
- Backend: Validation endpoint created
- Backend: Database functions created
- MCP: Validation tool implemented
- Frontend: Validation UI created
- Test: Insert corrupted data (zero vector)
- Test: Validation detects issues
- Test: MCP tool returns correct summary
- Backend: Add code summarization settings to credential service
- Backend: Add re-vectorize endpoint
- Backend: Add re-summarize endpoint
- Backend: Add revectorize/resummarize service methods
- Frontend: Add code summarization settings UI
- Frontend: Add re-vectorize service and hook
- Frontend: Add re-summarize service and hook
- Frontend: Add dropdown actions
- Frontend: Add needs_revectorization indicator
- Test: Change embedding settings, verify indicator shows
- Test: Re-vectorize source, verify embeddings updated
- Test: Re-summarize source, verify summaries updated
Required Database Migrations:
- ✅
013_add_provenance_tracking.sql- Ready to deploy - ❌
014_add_validation_functions.sql- Not created yet - ❌
015_add_code_summarization_settings.sql- Not created yet (optional, settings stored in archon_settings table)
Deployment Steps:
# Apply provenance tracking migration
supabase db push
# Or manually run the SQL in Supabase dashboardRollback Plan:
-- If needed, rollback provenance columns:
ALTER TABLE archon_sources
DROP COLUMN IF EXISTS embedding_model,
DROP COLUMN IF EXISTS embedding_dimensions,
DROP COLUMN IF EXISTS embedding_provider,
DROP COLUMN IF EXISTS vectorizer_settings,
DROP COLUMN IF EXISTS summarization_model,
DROP COLUMN IF EXISTS last_crawled_at,
DROP COLUMN IF EXISTS last_vectorized_at;
DROP INDEX IF EXISTS idx_archon_sources_embedding_model;
DROP INDEX IF EXISTS idx_archon_sources_embedding_provider;- Update frontend types for provenance fields
- Add provenance display to KnowledgeCard component
- Test end-to-end provenance tracking
- Add code summarization settings (backend + frontend)
- Add re-vectorize endpoint and service method
- Add re-summarize endpoint and service method
- Add needs_revectorization indicator
- Test reprocessing end-to-end
- Create validation API endpoint
- Create database validation functions
- Build validation UI component
- Add read-only MCP validation tool
-
Provenance Settings: Currently using placeholder values for
vectorizer_settings. These should be populated from actual RAG strategy configuration when contextual embeddings or hybrid search are implemented. -
Recursive Crawl Resume: The current implementation pre-populates the visited set with embedded URLs. This works well but doesn't distinguish between "already visited in this session" vs "embedded in previous session". This is acceptable for now.
-
Type Safety: Some type warnings in
source_management_service.pyrelated to optional parameters. These are safe to ignore as the functions handle None values correctly. -
Migration Order: The provenance migration (013) must be run before the validation migration (014) when it's created.
Immediate:
- Apply database migration
013_add_provenance_tracking.sql - Test checkpoint/resume functionality end-to-end
- Update frontend types and UI for provenance display
Short Term: 4. Add code summarization settings 5. Implement re-vectorize endpoint and service 6. Implement re-summarize endpoint and service 7. Add needs_revectorization indicator
Medium Term: 8. Implement validation API endpoint and database functions 9. Build validation UI component
Future Enhancements:
- Bulk loading UI/API (separate ADR)
- Manifest import capability (separate ADR)
- Re-vectorization tooling using provenance data
- Provenance-based source filtering in UI