Go Web Crawler

This is a concurrent web crawler built using Go, a custom HTML parser, and MongoDB for data storage. It utilizes a worker pool pattern to efficiently crawl pages, extracts useful information, and stores it in a structured format.

Design Decisions

Breadth-First Search (BFS)

I chose BFS to prioritize discovering top-level pages before diving deeper, which better simulates typical user exploration and ensures broader coverage of the site structure.

Worker Pool

Instead of spawning a goroutine for every URL (which can be uncontrolled), I implemented a worker pool. This allows limiting the number of concurrent connections, respecting system resources and the target server's limits.

Custom HTML Parser

The project uses golang.org/x/net/html for tokenization, but the parsing logic is custom-built to efficiently extract titles, body text, and links while skipping irrelevant tags (scripts, styles, etc.).

Features

Concurrency: Uses a worker pool for parallel processing.
Politeness: Checks robots.txt before crawling.
Data Persistence: Stores crawled metadata and content in MongoDB.
Metrics: Tracks and logs crawl statistics (duration, pages/sec, success/failure rates) upon completion.
Dockerized: Fully containerized with Docker and Docker Compose for easy setup.

Project Structure

The project follows a standard Go project layout:

cmd/crawler: Application entry point.
internal/crawler: Core crawler logic and worker pool implementation.
internal/queue: Thread-safe URL queue.
internal/storage: MongoDB storage implementation.
internal/parser: HTML parsing logic.

How to Run

Prerequisites

Docker & Docker Compose (recommended)
OR Go 1.23+ and a running MongoDB instance

Using Docker (Recommended)

The easiest way to run the crawler is using the provided Makefile and Docker Compose:

Start the Crawler:
```
make docker-up
```
This will build the image, start MongoDB, and begin crawling the seed URL.
Stop the Crawler:
```
make docker-down
```

Running Locally

Install dependencies:
```
go mod tidy
```
Set up Environment: Copy the example environment file:
```
cp .env.example .env
```
Update .env with your MongoDB credentials if necessary.

Run the application:

make run
# OR
go run cmd/crawler/main.go

Metrics

When the crawl finishes (or is interrupted), the crawler logs statistics to the console:

database connected successfully
Starting crawler...

--- Crawl Statistics ---
Total Duration: 4.500428338s
Total Pages Crawled: 0
Successful Requests: 0
Failed Requests: 1

Testing

Run the test suite using:

make test

Tech Stack

Language: Go 1.23
Database: MongoDB
Containerization: Docker
Orchestration: Docker Compose

Name		Name	Last commit message	Last commit date
Latest commit History 27 Commits
cmd/crawler		cmd/crawler
htmltokenizer		htmltokenizer
image		image
internal		internal
.env.example		.env.example
.gitignore		.gitignore
Dockerfile		Dockerfile
Makefile		Makefile
README.md		README.md
crawler		crawler
docker-compose.yml		docker-compose.yml
go.mod		go.mod
go.sum		go.sum

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Go Web Crawler

Design Decisions

Breadth-First Search (BFS)

Worker Pool

Custom HTML Parser

Features

Project Structure

How to Run

Prerequisites

Using Docker (Recommended)

Running Locally

Metrics

Testing

Tech Stack

About

Uh oh!

Releases

Packages

Languages

lupppig/web-crawler

Folders and files

Latest commit

History

Repository files navigation

Go Web Crawler

Design Decisions

Breadth-First Search (BFS)

Worker Pool

Custom HTML Parser

Features

Project Structure

How to Run

Prerequisites

Using Docker (Recommended)

Running Locally

Metrics

Testing

Tech Stack

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages