This is a concurrent web crawler built using Go, a custom HTML parser, and MongoDB for data storage. It utilizes a worker pool pattern to efficiently crawl pages, extracts useful information, and stores it in a structured format.
I chose BFS to prioritize discovering top-level pages before diving deeper, which better simulates typical user exploration and ensures broader coverage of the site structure.
Instead of spawning a goroutine for every URL (which can be uncontrolled), I implemented a worker pool. This allows limiting the number of concurrent connections, respecting system resources and the target server's limits.
The project uses golang.org/x/net/html for tokenization, but the parsing logic is custom-built to efficiently extract titles, body text, and links while skipping irrelevant tags (scripts, styles, etc.).
- Concurrency: Uses a worker pool for parallel processing.
- Politeness: Checks
robots.txtbefore crawling. - Data Persistence: Stores crawled metadata and content in MongoDB.
- Metrics: Tracks and logs crawl statistics (duration, pages/sec, success/failure rates) upon completion.
- Dockerized: Fully containerized with Docker and Docker Compose for easy setup.
The project follows a standard Go project layout:
cmd/crawler: Application entry point.internal/crawler: Core crawler logic and worker pool implementation.internal/queue: Thread-safe URL queue.internal/storage: MongoDB storage implementation.internal/parser: HTML parsing logic.
- Docker & Docker Compose (recommended)
- OR Go 1.23+ and a running MongoDB instance
The easiest way to run the crawler is using the provided Makefile and Docker Compose:
-
Start the Crawler:
make docker-up
This will build the image, start MongoDB, and begin crawling the seed URL.
-
Stop the Crawler:
make docker-down
-
Install dependencies:
go mod tidy
-
Set up Environment: Copy the example environment file:
cp .env.example .env
Update
.envwith your MongoDB credentials if necessary. -
Run the application:
make run # OR go run cmd/crawler/main.go
When the crawl finishes (or is interrupted), the crawler logs statistics to the console:
database connected successfully
Starting crawler...
--- Crawl Statistics ---
Total Duration: 4.500428338s
Total Pages Crawled: 0
Successful Requests: 0
Failed Requests: 1
Run the test suite using:
make test- Language: Go 1.23
- Database: MongoDB
- Containerization: Docker
- Orchestration: Docker Compose