📊 长文本处理与分析系统 / Document Analysis System

一个功能完整的跨端文档分析系统，支持法律文档分析、学术论文摘要、财报分析和RAG知识库功能。

A comprehensive cross-platform document analysis system supporting legal document analysis, academic paper summarization, financial report analysis, and RAG knowledge base functionality.

✨ 功能特性 / Features

🔧 核心功能 / Core Features

📤 多格式文档上传 / Multi-format document upload (PDF, DOCX, TXT, XLSX)
🧠 智能文档处理 / Intelligent document processing with chunking
🤖 DeepSeek AI 集成 / DeepSeek LLM integration for enhanced analysis
🔍 RAG知识库 / RAG-based knowledge base with vector search
⚖️ 法律文档分析 / Legal document analysis with clause extraction
🎓 学术论文摘要 / Academic paper summarization and analysis
💰 财报分析 / Financial report analysis with metrics extraction
🌐 跨端支持 / Cross-platform web interface
📊 数据可视化 / Data visualization and reporting

🔍 分析功能 / Analysis Features

🤖 AI 智能分析 / AI Enhanced Analysis (NEW!)

DeepSeek 大模型集成 / DeepSeek LLM integration for deep analysis
智能文档理解 / Intelligent document comprehension
多角度内容分析 / Multi-perspective content analysis
关键洞察提取 / Key insights extraction
智能问题生成 / Intelligent question generation
文档对比分析 / Document comparison analysis

⚖️ 法律文档分析 / Legal Document Analysis

关键条款提取 / Key clause extraction
风险评估 / Risk assessment
合规评分 / Compliance scoring
实体识别 / Named entity recognition
合同方识别 / Contract party identification

🎓 学术论文分析 / Academic Paper Analysis

自动摘要生成 / Automatic abstract generation
论文结构分析 / Paper structure analysis
关键词提取 / Keyword extraction
引用分析 / Citation analysis
可读性评估 / Readability assessment

💰 财务报告分析 / Financial Report Analysis

关键财务指标提取 / Key financial metrics extraction
财务比率计算 / Financial ratios calculation
趋势分析 / Trend analysis
财务健康评分 / Financial health scoring
风险因素识别 / Risk factor identification

🚀 快速开始 / Quick Start

📋 系统要求 / Requirements

Python 3.8+
4GB+ RAM
2GB+ 磁盘空间 / disk space

🔧 安装步骤 / Installation

克隆项目 / Clone the repository

git clone <repository-url>
cd document-analysis-system

创建虚拟环境 / Create virtual environment

python -m venv venv

# Windows
venv\\Scripts\\activate

# macOS/Linux
source venv/bin/activate

安装依赖 / Install dependencies

pip install -r requirements.txt

下载语言模型 / Download language models

# spaCy 英文模型
python -m spacy download en_core_web_sm

配置环境变量 / Configure environment

# 复制配置文件
cp .env.example .env

# 编辑 .env 文件，添加 API 密钥
# Edit .env file and add your API keys:
# - DEEPSEEK_API_KEY (推荐 / Recommended)
# - OPENAI_API_KEY (可选 / Optional)

启动应用 / Start the application

Web界面 / Web Interface (Streamlit):

streamlit run app.py

API服务器 / API Server (FastAPI):

python api.py
# 或者 / or
uvicorn api:app --host 0.0.0.0 --port 8000 --reload

📖 使用指南 / Usage Guide

🌐 Web界面使用 / Web Interface Usage

访问应用 / Access the application: http://localhost:8501
上传文档 / Upload documents:
- 选择"文档上传"页面 / Navigate to "Document Upload" page
- 选择文档文件 / Select document files
- 选择文档类型 / Choose document type
- 点击处理 / Click "Process"
AI智能分析 / AI Enhanced Analysis (推荐 / Recommended):
- 选择"AI智能分析"页面 / Navigate to "AI Enhanced Analysis" page
- 选择已上传的文档 / Select uploaded document
- 选择分析重点 / Choose analysis focus
- 点击"开始AI智能分析" / Click "Start AI Enhanced Analysis"
- 查看 DeepSeek AI 生成的深度分析结果 / View DeepSeek AI analysis results
专项分析 / Specialized Analysis:
- 根据文档类型选择对应分析页面 / Choose analysis page based on document type
- 选择已上传的文档 / Select uploaded document
- 点击分析按钮 / Click analysis button
- 查看传统分析结果 / View traditional analysis results
知识库搜索 / Knowledge base search:
- 输入搜索查询 / Enter search query
- 选择搜索范围 / Select search scope
- 查看搜索结果 / View search results

🔌 API使用 / API Usage

上传文档 / Upload document:

curl -X POST \"http://localhost:8000/upload\" \\
  -H \"Content-Type: multipart/form-data\" \\
  -F \"[email protected]\" \\
  -F \"document_type=legal\"

AI智能分析 / AI Enhanced Analysis (推荐 / Recommended):

curl -X POST \"http://localhost:8000/analyze/enhanced/{document_id}\" \\
  -H \"Content-Type: application/json\" \\
  -d '{\"analysis_focus\": \"comprehensive\"}'

DeepSeek 文档摘要 / DeepSeek Document Summary:

curl -X POST \"http://localhost:8000/analyze/summarize/{document_id}\" \\
  -H \"Content-Type: application/json\" \\
  -d '{\"max_length\": 300}'

文档对比分析 / Document Comparison:

curl -X POST \"http://localhost:8000/compare/documents\" \\
  -H \"Content-Type: application/json\" \\
  -d '{\"document_id1\": \"doc1_id\", \"document_id2\": \"doc2_id\"}'

分析法律文档 / Analyze legal document:

curl -X POST \"http://localhost:8000/analyze/legal/{document_id}\"

搜索知识库 / Search knowledge base:

curl -X POST \"http://localhost:8000/search\" \\
  -H \"Content-Type: application/json\" \\
  -d '{\"query\": \"合同条款\", \"document_type\": \"legal\", \"num_results\": 5}'

📁 项目结构 / Project Structure

document-analysis-system/
├── src/                          # 源代码 / Source code
│   ├── __init__.py
│   ├── config.py                 # 配置管理 / Configuration
│   ├── document_processor.py     # 文档处理 / Document processing
│   ├── knowledge_base.py         # 知识库 / Knowledge base
│   ├── legal_analyzer.py         # 法律分析 / Legal analysis
│   ├── academic_analyzer.py      # 学术分析 / Academic analysis
│   └── financial_analyzer.py     # 财务分析 / Financial analysis
├── data/                         # 数据目录 / Data directory
│   ├── uploads/                  # 上传文件 / Uploaded files
│   ├── processed/                # 处理后文件 / Processed files
│   └── chroma_db/                # 向量数据库 / Vector database
├── tests/                        # 测试文件 / Test files
├── config/                       # 配置文件 / Config files
├── app.py                        # Streamlit Web应用 / Web app
├── api.py                        # FastAPI API服务 / API server
├── requirements.txt              # 依赖列表 / Dependencies
├── .env.example                  # 环境变量示例 / Environment template
└── README.md                     # 项目文档 / Documentation

⚙️ 配置选项 / Configuration Options

环境变量 / Environment Variables

# API密钥 / API Keys
OPENAI_API_KEY=your_openai_api_key_here

# 数据库配置 / Database Configuration
CHROMA_DB_PATH=./data/chroma_db
UPLOAD_DIR=./data/uploads
PROCESSED_DIR=./data/processed

# 服务器配置 / Server Configuration
HOST=0.0.0.0
PORT=8000
DEBUG=True

# 模型配置 / Model Configuration
EMBEDDING_MODEL=all-MiniLM-L6-v2
SUMMARIZATION_MODEL=facebook/bart-large-cnn
MAX_CHUNK_SIZE=1000
CHUNK_OVERLAP=200

🔧 高级配置 / Advanced Configuration

自定义分析模型 / Custom Analysis Models

您可以通过修改配置文件来使用不同的AI模型： You can use different AI models by modifying the configuration:

# 在 config.py 中修改 / Modify in config.py
embedding_model: str = \"sentence-transformers/all-mpnet-base-v2\"
summarization_model: str = \"facebook/bart-large-cnn\"

数据库配置 / Database Configuration

系统使用ChromaDB作为向量数据库。您可以配置不同的存储路径： The system uses ChromaDB as vector database. You can configure different storage paths:

chroma_db_path: str = \"/path/to/your/database\"

🧪 测试 / Testing

# 运行测试 / Run tests
python -m pytest tests/

# 运行特定测试 / Run specific tests
python -m pytest tests/test_legal_analyzer.py

📊 性能优化 / Performance Optimization

内存优化 / Memory Optimization

对于大型文档，系统会自动分块处理 / Large documents are automatically chunked
可以调整 MAX_CHUNK_SIZE 来优化内存使用 / Adjust MAX_CHUNK_SIZE to optimize memory usage

速度优化 / Speed Optimization

使用更小的嵌入模型以提高速度 / Use smaller embedding models for faster processing
启用GPU加速（如果可用）/ Enable GPU acceleration (if available)

🔒 安全注意事项 / Security Considerations

API密钥安全 / API Key Security: 请勿在代码中硬编码API密钥 / Never hardcode API keys
文件上传安全 / File Upload Security: 系统会验证文件类型 / File types are validated
数据隐私 / Data Privacy: 上传的文档仅在本地处理 / Documents are processed locally only

🤝 贡献指南 / Contributing

Fork 项目 / Fork the project
创建功能分支 / Create feature branch (git checkout -b feature/AmazingFeature)
提交更改 / Commit changes (git commit -m 'Add some AmazingFeature')
推送到分支 / Push to branch (git push origin feature/AmazingFeature)
创建 Pull Request / Create Pull Request

📝 许可证 / License

本项目采用 MIT 许可证 - 查看 LICENSE 文件了解详情。 This project is licensed under the MIT License - see the LICENSE file for details.

🆘 故障排除 / Troubleshooting

常见问题 / Common Issues

1. spaCy模型未找到 / spaCy model not found

python -m spacy download en_core_web_sm

2. 内存不足 / Out of memory

减少 MAX_CHUNK_SIZE / Reduce MAX_CHUNK_SIZE
一次处理较少的文档 / Process fewer documents at once

3. API连接错误 / API connection error

检查 OpenAI API 密钥 / Check OpenAI API key
验证网络连接 / Verify network connection

日志和调试 / Logging and Debugging

系统使用 loguru 进行日志记录。日志级别可以通过环境变量控制： The system uses loguru for logging. Log level can be controlled via environment variables:

LOG_LEVEL=DEBUG python app.py

📞 支持和联系 / Support and Contact

创建 Issue 报告问题 / Create an issue to report problems
发送邮件获取支持 / Send email for support
查看文档获取更多信息 / Check documentation for more information

版本 / Version: 1.0.0
最后更新 / Last Updated: 2024年1月 / January 2024# document-analysis-system

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
data/chroma_db		data/chroma_db
src		src
.env.example		.env.example
.gitignore		.gitignore
DEEPSEEK_GUIDE.md		DEEPSEEK_GUIDE.md
INSTALL.md		INSTALL.md
README.md		README.md
api.py		api.py
app.py		app.py
example_usage.py		example_usage.py
requirements.txt		requirements.txt
run.py		run.py
test_deepseek.py		test_deepseek.py

j0rGeT/document-analysis-system

Folders and files

Latest commit

History

Repository files navigation