一个功能完整的跨端文档分析系统,支持法律文档分析、学术论文摘要、财报分析和RAG知识库功能。
A comprehensive cross-platform document analysis system supporting legal document analysis, academic paper summarization, financial report analysis, and RAG knowledge base functionality.
- 📤 多格式文档上传 / Multi-format document upload (PDF, DOCX, TXT, XLSX)
- 🧠 智能文档处理 / Intelligent document processing with chunking
- 🤖 DeepSeek AI 集成 / DeepSeek LLM integration for enhanced analysis
- 🔍 RAG知识库 / RAG-based knowledge base with vector search
- ⚖️ 法律文档分析 / Legal document analysis with clause extraction
- 🎓 学术论文摘要 / Academic paper summarization and analysis
- 💰 财报分析 / Financial report analysis with metrics extraction
- 🌐 跨端支持 / Cross-platform web interface
- 📊 数据可视化 / Data visualization and reporting
- DeepSeek 大模型集成 / DeepSeek LLM integration for deep analysis
- 智能文档理解 / Intelligent document comprehension
- 多角度内容分析 / Multi-perspective content analysis
- 关键洞察提取 / Key insights extraction
- 智能问题生成 / Intelligent question generation
- 文档对比分析 / Document comparison analysis
- 关键条款提取 / Key clause extraction
- 风险评估 / Risk assessment
- 合规评分 / Compliance scoring
- 实体识别 / Named entity recognition
- 合同方识别 / Contract party identification
- 自动摘要生成 / Automatic abstract generation
- 论文结构分析 / Paper structure analysis
- 关键词提取 / Keyword extraction
- 引用分析 / Citation analysis
- 可读性评估 / Readability assessment
- 关键财务指标提取 / Key financial metrics extraction
- 财务比率计算 / Financial ratios calculation
- 趋势分析 / Trend analysis
- 财务健康评分 / Financial health scoring
- 风险因素识别 / Risk factor identification
- Python 3.8+
- 4GB+ RAM
- 2GB+ 磁盘空间 / disk space
- 克隆项目 / Clone the repository
git clone <repository-url>
cd document-analysis-system- 创建虚拟环境 / Create virtual environment
python -m venv venv
# Windows
venv\\Scripts\\activate
# macOS/Linux
source venv/bin/activate- 安装依赖 / Install dependencies
pip install -r requirements.txt- 下载语言模型 / Download language models
# spaCy 英文模型
python -m spacy download en_core_web_sm- 配置环境变量 / Configure environment
# 复制配置文件
cp .env.example .env
# 编辑 .env 文件,添加 API 密钥
# Edit .env file and add your API keys:
# - DEEPSEEK_API_KEY (推荐 / Recommended)
# - OPENAI_API_KEY (可选 / Optional)- 启动应用 / Start the application
Web界面 / Web Interface (Streamlit):
streamlit run app.pyAPI服务器 / API Server (FastAPI):
python api.py
# 或者 / or
uvicorn api:app --host 0.0.0.0 --port 8000 --reload-
访问应用 / Access the application:
http://localhost:8501 -
上传文档 / Upload documents:
- 选择"文档上传"页面 / Navigate to "Document Upload" page
- 选择文档文件 / Select document files
- 选择文档类型 / Choose document type
- 点击处理 / Click "Process"
-
AI智能分析 / AI Enhanced Analysis (推荐 / Recommended):
- 选择"AI智能分析"页面 / Navigate to "AI Enhanced Analysis" page
- 选择已上传的文档 / Select uploaded document
- 选择分析重点 / Choose analysis focus
- 点击"开始AI智能分析" / Click "Start AI Enhanced Analysis"
- 查看 DeepSeek AI 生成的深度分析结果 / View DeepSeek AI analysis results
-
专项分析 / Specialized Analysis:
- 根据文档类型选择对应分析页面 / Choose analysis page based on document type
- 选择已上传的文档 / Select uploaded document
- 点击分析按钮 / Click analysis button
- 查看传统分析结果 / View traditional analysis results
-
知识库搜索 / Knowledge base search:
- 输入搜索查询 / Enter search query
- 选择搜索范围 / Select search scope
- 查看搜索结果 / View search results
上传文档 / Upload document:
curl -X POST \"http://localhost:8000/upload\" \\
-H \"Content-Type: multipart/form-data\" \\
-F \"[email protected]\" \\
-F \"document_type=legal\"AI智能分析 / AI Enhanced Analysis (推荐 / Recommended):
curl -X POST \"http://localhost:8000/analyze/enhanced/{document_id}\" \\
-H \"Content-Type: application/json\" \\
-d '{\"analysis_focus\": \"comprehensive\"}'DeepSeek 文档摘要 / DeepSeek Document Summary:
curl -X POST \"http://localhost:8000/analyze/summarize/{document_id}\" \\
-H \"Content-Type: application/json\" \\
-d '{\"max_length\": 300}'文档对比分析 / Document Comparison:
curl -X POST \"http://localhost:8000/compare/documents\" \\
-H \"Content-Type: application/json\" \\
-d '{\"document_id1\": \"doc1_id\", \"document_id2\": \"doc2_id\"}'分析法律文档 / Analyze legal document:
curl -X POST \"http://localhost:8000/analyze/legal/{document_id}\"搜索知识库 / Search knowledge base:
curl -X POST \"http://localhost:8000/search\" \\
-H \"Content-Type: application/json\" \\
-d '{\"query\": \"合同条款\", \"document_type\": \"legal\", \"num_results\": 5}'document-analysis-system/
├── src/ # 源代码 / Source code
│ ├── __init__.py
│ ├── config.py # 配置管理 / Configuration
│ ├── document_processor.py # 文档处理 / Document processing
│ ├── knowledge_base.py # 知识库 / Knowledge base
│ ├── legal_analyzer.py # 法律分析 / Legal analysis
│ ├── academic_analyzer.py # 学术分析 / Academic analysis
│ └── financial_analyzer.py # 财务分析 / Financial analysis
├── data/ # 数据目录 / Data directory
│ ├── uploads/ # 上传文件 / Uploaded files
│ ├── processed/ # 处理后文件 / Processed files
│ └── chroma_db/ # 向量数据库 / Vector database
├── tests/ # 测试文件 / Test files
├── config/ # 配置文件 / Config files
├── app.py # Streamlit Web应用 / Web app
├── api.py # FastAPI API服务 / API server
├── requirements.txt # 依赖列表 / Dependencies
├── .env.example # 环境变量示例 / Environment template
└── README.md # 项目文档 / Documentation
# API密钥 / API Keys
OPENAI_API_KEY=your_openai_api_key_here
# 数据库配置 / Database Configuration
CHROMA_DB_PATH=./data/chroma_db
UPLOAD_DIR=./data/uploads
PROCESSED_DIR=./data/processed
# 服务器配置 / Server Configuration
HOST=0.0.0.0
PORT=8000
DEBUG=True
# 模型配置 / Model Configuration
EMBEDDING_MODEL=all-MiniLM-L6-v2
SUMMARIZATION_MODEL=facebook/bart-large-cnn
MAX_CHUNK_SIZE=1000
CHUNK_OVERLAP=200您可以通过修改配置文件来使用不同的AI模型: You can use different AI models by modifying the configuration:
# 在 config.py 中修改 / Modify in config.py
embedding_model: str = \"sentence-transformers/all-mpnet-base-v2\"
summarization_model: str = \"facebook/bart-large-cnn\"系统使用ChromaDB作为向量数据库。您可以配置不同的存储路径: The system uses ChromaDB as vector database. You can configure different storage paths:
chroma_db_path: str = \"/path/to/your/database\"# 运行测试 / Run tests
python -m pytest tests/
# 运行特定测试 / Run specific tests
python -m pytest tests/test_legal_analyzer.py- 对于大型文档,系统会自动分块处理 / Large documents are automatically chunked
- 可以调整
MAX_CHUNK_SIZE来优化内存使用 / AdjustMAX_CHUNK_SIZEto optimize memory usage
- 使用更小的嵌入模型以提高速度 / Use smaller embedding models for faster processing
- 启用GPU加速(如果可用)/ Enable GPU acceleration (if available)
- API密钥安全 / API Key Security: 请勿在代码中硬编码API密钥 / Never hardcode API keys
- 文件上传安全 / File Upload Security: 系统会验证文件类型 / File types are validated
- 数据隐私 / Data Privacy: 上传的文档仅在本地处理 / Documents are processed locally only
- Fork 项目 / Fork the project
- 创建功能分支 / Create feature branch (
git checkout -b feature/AmazingFeature) - 提交更改 / Commit changes (
git commit -m 'Add some AmazingFeature') - 推送到分支 / Push to branch (
git push origin feature/AmazingFeature) - 创建 Pull Request / Create Pull Request
本项目采用 MIT 许可证 - 查看 LICENSE 文件了解详情。 This project is licensed under the MIT License - see the LICENSE file for details.
1. spaCy模型未找到 / spaCy model not found
python -m spacy download en_core_web_sm2. 内存不足 / Out of memory
- 减少
MAX_CHUNK_SIZE/ ReduceMAX_CHUNK_SIZE - 一次处理较少的文档 / Process fewer documents at once
3. API连接错误 / API connection error
- 检查 OpenAI API 密钥 / Check OpenAI API key
- 验证网络连接 / Verify network connection
系统使用 loguru 进行日志记录。日志级别可以通过环境变量控制: The system uses loguru for logging. Log level can be controlled via environment variables:
LOG_LEVEL=DEBUG python app.py- 创建 Issue 报告问题 / Create an issue to report problems
- 发送邮件获取支持 / Send email for support
- 查看文档获取更多信息 / Check documentation for more information
版本 / Version: 1.0.0
最后更新 / Last Updated: 2024年1月 / January 2024# document-analysis-system