A web crawler tool built with PocketFlow that crawls websites and analyzes content using LLM.
- Crawls websites while respecting domain boundaries
- Extracts text content and links from pages
- Analyzes content using GPT-4 to generate:
- Page summaries
- Main topics/keywords
- Content type classification
- Processes pages in batches for efficiency
- Generates a comprehensive analysis report
- Clone the repository
- Install dependencies:
pip install -r requirements.txt
- Set your OpenAI API key:
export OPENAI_API_KEY='your-api-key'
Run the crawler:
python main.pyYou will be prompted to:
- Enter the website URL to crawl
- Specify maximum number of pages to crawl (default: 10)
The tool will then:
- Crawl the specified website
- Extract and analyze content using GPT-4
- Generate a report with findings
pocketflow-tool-crawler/
├── tools/
│ ├── crawler.py # Web crawling functionality
│ └── parser.py # Content analysis using LLM
├── utils/
│ └── call_llm.py # LLM API wrapper
├── nodes.py # PocketFlow nodes
├── flow.py # Flow configuration
├── main.py # Main script
└── requirements.txt # Dependencies
- Only crawls within the same domain
- Text content only (no images/media)
- Rate limited by OpenAI API
- Basic error handling
- pocketflow: Flow-based processing
- requests: HTTP requests
- beautifulsoup4: HTML parsing
- openai: GPT-4 API access