A multi-modal embedding service for SlimRAG that processes various file types and generates embeddings for vector search and retrieval applications.
- Multi-format Support: Process documents, images, and videos
- Multiple Embedding Models: bge-m3, Qwen3-Embedding, Chinese-CLIP
- Smart Processing Pipeline: Different handling for each file type
- Metadata Storage: DuckDB for structured metadata
- Search Integration: Meilisearch for fast retrieval
- CLI Interface: Easy-to-use command-line tools
- Markdown (.md): Direct processing with bge-m3
- PDF (.pdf): MinerU extraction bge-m3
- Word (.doc, .docx): MinerU extraction bge-m3
- HTML (.html): markitdown conversion bge-m3
- JPEG (.jpg): Direct processing with CLIP
- PNG (.png): Direct processing with CLIP
- WebP (.webp): Convert to JPG CLIP
- RAW (.arw): Convert to JPG + demosaic CLIP
- MP4 (.mp4): Keyframe extraction CLIP
- Python 3.12 or higher
- uv package manager
- Clone the repository:
git clone <repository-url>
cd EmebeddingService- Install dependencies:
uv syncThe service provides a comprehensive CLI through the fire framework:
# Process a single file
uv run python -m embeddingservice process_file "path/to/file.pdf"
# Process all files in a directory
uv run python -m embeddingservice process_directory "path/to/directory"
# Search for similar documents
uv run python -m embeddingservice search "your query here"
# Get file metadata
uv run python -m embeddingservice get_metadata "path/to/file.pdf"
# List all processed files
uv run python -m embeddingservice list_files
# Filter by content type
uv run python -m embeddingservice list_files --content_type "document"
# Get service status
uv run python -m embeddingservice status
# Configuration management
uv run python -m embeddingservice config show
uv run python -m embeddingservice config create --config_path "config.json"The service can be configured through environment variables or a configuration file:
# Show current configuration
uv run python -m embeddingservice config show
# Create configuration file
uv run python -m embeddingservice config create --config_path "config.json"- File Type Detection: Automatic identification of file format
- Content Extraction: Format-specific extraction (MinerU, markitdown, etc.)
- Embedding Generation: Model-specific vector generation
- Metadata Storage: Structured data in DuckDB
- Search Indexing: Meilisearch for fast retrieval
- CLI Interface: Command-line interaction via
fire - Processors: Format-specific content extraction
- Embedding Models: Multiple model support (bge-m3, CLIP, etc.)
- Storage Layer: DuckDB for metadata, Meilisearch for search
- Configuration: Flexible configuration system
- fire: CLI framework
- markitdown: HTML to Markdown conversion
- mineru[core]: Document processing (e2.1.10)
- torch: Deep learning framework
- transformers: Model loading and inference
- pillow: Image processing
- duckdb: Metadata storage
- meilisearch: Search engine
- opencv-python: Video and image processing
- pytest: Testing framework
- black: Code formatting
- flake8: Linting
- mypy: Type checking
uv run pytestuv run black src/
uv run flake8 src/
uv run mypy src/# Add runtime dependency
uv add <package-name>
# Add development dependency
uv add --dev <package-name>- Fork the repository
- Create a feature branch
- Make your changes
- Add tests if applicable
- Run the test suite
- Submit a pull request
This project is part of the SlimRAG ecosystem. Please refer to the main project for licensing information.
For issues and questions, please refer to the main SlimRAG project documentation and support channels.