WARCbench 🛠️

A tool for exploring, analyzing, transforming, recombining, and extracting data from WARC (Web ARChive) files.

Quickstart

To install WARCbench, use Pip:

# From PyPI (recommended):
pip install warcbench

# Or directly from GitHub using HTTPS...
pip install git+https://github.com/harvard-lil/warcbench.git

# ...or SSH:
pip install git+ssh://git@github.com/harvard-lil/warcbench.git

Once WARCbench is installed, you may run it on the command line...

wb summarize example.com.warc

...or import it in your Python project:

from warcbench import WARCParser

with open('example.com.warc', 'rb') as warc_file:
    parser = WARCParser(warc_file)
    parser.parse()

⇧ Back to top

About

WARCbench has been designed as a resilient, efficient, and highly configurable tool for working with WARC files in all their variety. Among our motivations for the project:

Enable users to explore a WARC without prior knowledge of the format
Support inspection of malformed or misbehaving WARCs
Everything is configurable: plenty of hooks and custom callbacks
Flexibility to optimize for memory, speed, or convenience as needed
As little magic as possible: e.g., don't decode bytes into strings or deserialize headers until you need to

Many other useful open-source WARC packages can be found online. Among the inspirations for WARCbench are:

WARCbench is a project of the Harvard Library Innovation Lab.

⇧ Back to top

Command line usage

After installing WARCbench, you may use wb to interact with WARC files on the command line:

user@host~$ wb inspect example.com.warc

Record bytes 0-280

WARC/1.1
WARC-Filename: archive.warc
WARC-Date: 2024-11-04T19:10:55.900Z
WARC-Type: warcinfo
...

All commands support .warc, .warc.gz, and .wacz file formats.

To view a complete summary of WARCbench commands and options, invoke the --help flag:

user@host~$ wb --help

Usage: wb [OPTIONS] COMMAND [ARGS]...

  WARCbench command framework

Options:
  -o, --out [raw|json]            Format subcommand output as a human-readable
                                  report (raw) or as JSON.
  -v, --verbose                   Logging verbosity; repeatable.
  -d, --decompression [python|system]
                                  Use native Python or system tools for
                                  extracting archives.  [default: python]
  --gunzip / --no-gunzip          Gunzip the input archive before parsing, if
                                  it is gzipped.  [default: no-gunzip]
  -V, --version                   Show the version and exit.
  -h, --help                      Show this message and exit.

Commands:
  compare-headers     Compare the record headers of two archives.
  compare-parsers     Compare all available parsing strategies.
  extract             Extract files of MIMETYPE to disk.
  filter-records      Filter records; optionally extract to a new archive.
  inspect             Get detailed record metadata.
  match-record-pairs  Match requests/responses into pairs.
  summarize           Summarize the contents of an archive.
...

Each subcommand has its own, more-detailed --help text. For example, filter-records:

user@host~$ wb filter-records --help

Usage: wb filter-records [OPTIONS] FILEPATH

  Applies the specified filters (if any) to the archive's records. If no
  filters are specified, all WARC records are considered to match.

  By default, outputs the number of matching records. Use the `--output-*`
  options to include more detailed information about matching records, or
  `--no-output-count` to suppress the count.

  Can also extract the matching records to a new WARC file (`--extract-to-
  warc`, `--extract-to-gzipped-warc`). To ensure the new WARC includes a
  `WARC-Type: warcinfo` record (if present in the original), even if it would
  otherwise be filtered out by any applied filters, run with `--force-include-
  warcinfo`.

  If extracting records to a new WARC file, by default, no other output is
  produced. To produce a summary report as well, run with `--extract-summary-
  to`.

  To apply your own, custom filters, use `--custom-filter-path` to specify the
  path to a python file where the custom filter functions are listed, in
  desired order of application, in `__all__`. See `tests/assets/custom-
  filters.py` for an example. See the "Filters" section of the README for more
  information on constructing filters.

  This command also supports custom record handlers, which can be used to do
  arbitrary work on records that pass through the supplied filters. For
  example, you could use record handlers to construct a custom report, or
  export records one-at-a-time to an upstream service. Use `--custom-record-
  handler-path` to specify the path to a python file where the custom handler
  functions are listed, in desired order of application, in `__all__`. See
  `tests/assets/custom-handlers.py` for an example. See the "Handlers" section
  of the README for more information on constructing handlers.

  ---

  Example:

      $ wb filter-records --filter-by-warc-named-field Type response tests/assets/example.com.warc
      Found 6 records.

Options:
  --filter-by-http-header TEXT...
                                  Find records with WARC-Type: {request,
                                  response} and look for the supplied HTTP
                                  header name and value.
  --filter-by-http-response-content-type TEXT
                                  Find records with WARC-Type: response, and
                                  then filters by Content-Type.
  --filter-by-http-status-code INTEGER
                                  Find records with WARC-Type: response, and
                                  then filters by HTTP status code.
  --filter-by-http-verb TEXT      Find records with WARC-Type: request, and
                                  then filter by HTTP verb.
...

See tests/assets for sample outputs.

⇧ Back to top

Python usage

Parsing a WARC file

The WARCParser class is typically the best way to start interacting with a WARC file in Python:

from warcbench import WARCParser

# Instantiate a parser, passing in a file handle along with any other config
with open('example.com.warc', 'rb') as warc_file:
    parser = WARCParser(warc_file)

    # Iterate lazily over each record in the WARC...
    for record in parser.iterator():
        print(record.bytes)

    # ...or parse the entire file and produce a list of all records
    parser.parse(cache_records=True)
    print(len(parser.records))
    print(parser.records[3].header.bytes)

Parsing a Gzipped WARC file

You can parse and interact with a gzipped WARC file without gunzipping it using the WARCGZParser class.

This is not only for convenience, but for utility: by convention, WARCs are frequently gzipped one record at a time, such that a complete warc.gz file is in fact a series of concatenated, individually valid gzip files, or "members". This makes it possible to extract individual WARC records, if the byte offsets of their members are known in advance, without needing to gunzip the entire file, which in certain applications can be a significant performance improvement.

from warcbench import WARCGZParser

# Instantiate a parser, passing in a file handle along with any other config
with open('example.com.warc.gz', 'rb') as warcgz_file:
    parser = WARCGZParser(warcgz_file)

    # Iterate lazily over each record in the WARC...
    for record in parser.iterator(yield_type="records"):
        print(record.start, record.length)

    # ... or over each gzipped member...
    for member in parser.iterator(yield_type="members"):
        print(member.start, member.length, member.record.bytes)

    # ...or parse the entire file and produce a list of all members and records
    parser.parse(cache_members=True)
    print(len(parser.members))
    print(len(parser.records))
    print(parser.records[3].header.bytes)

Utility functions

For other use cases, such as extracting and working with WARCs in a WACZ file, you may wish to use WARCbench's utility functions:

from warcbench import WARCParser
from warcbench.utils import python_open_archive, system_open_archive

# Slower: uses Python zip/gzip to decompress
with python_open_archive('example.com.wacz') as warcgz_file:
    parser = WARCGZParser(warcgz_file)

with python_open_archive('example.com.wacz', gunzip=True) as warc_file:
    parser = WARCParser(warc_file)

# Faster: uses system zip/gzip to decompress where possible
with system_open_archive('example.com.wacz') as warcgz_file:
    parser = WARCGZParser(warcgz_file)

with system_open_archive('example.com.wacz', gunzip=True) as warc_file:
    parser = WARCParser(warc_file)

Filters, handlers, and callbacks

WARCbench includes several additional mechanisms for wrangling WARC records: filters, handlers, and callbacks.

Filters

Record Filters are functions that include or exclude a WARC record based on a given condition. You can pass in any function that accepts a warcbench.models.Record as its sole argument and returns a Boolean value. (A number of built-in filters are included in the warcbench.filters module.) Example:

from warcbench import WARCGZParser
from warcbench.config import WARCGZProcessorConfig
from warcbench.filters import warc_named_field_filter
from warcbench.utils import system_open_archive

with system_open_archive('example.com.wacz') as warcgz_file:
    parser = WARCGZParser(
        warcgz_file,
        processors=WARCGZProcessorConfig(
            record_filters=[
                warc_named_field_filter('type', 'request'),
            ]
        )
    )

Member Filters (only supported when using the WARCGZParser) behave just like record filters, except they work with warcbench.models.GzippedMember objects instead of Records. Example:

from warcbench import WARCGZParser
from warcbench.config import WARCGZProcessorConfig
from warcbench.utils import system_open_archive

with system_open_archive('example.com.wacz') as warcgz_file:
    parser = WARCGZParser(
        warcgz_file,
        processors=WARCGZProcessorConfig(
            member_filters=[
                # only yield malformed members
                lambda member: bool(member.uncompressed_non_warc_data),
            ]
        )
    )

Handlers

Record handlers are functions that process a record once it is parsed. For example, you could use a record handler to print each record's content in bytes for debugging purposes, or write each record to disk as a separate file. As with filters, you may pass in an arbitrary handler function that accepts a warcbench.models.Record as its sole argument; a handler's return value is ignored. Example:

from warcbench import WARCParser
from warcbench.config import WARCProcessorConfig
from warcbench.record_handlers import get_record_offsets
from warcbench.utils import system_open_archive

with system_open_archive('example.com.warc') as warc_file:
    parser = WARCParser(
        warc_file,
        processors=WARCProcessorConfig(
            record_handlers=[
                get_record_offsets(),
            ]
        )
    )

To support inspection of WARC files that contain invalid records, WARCbench also includes a way to specify handlers for unparsable lines. Unparsable line handlers behave just like record handlers, except that they accept warcbench.models.UnparsableLine objects instead of Records. You could use these handlers to print information about unparsable lines, or even repair them. Example:

from warcbench import WARCParser
from warcbench.config import WARCProcessorConfig
from warcbench.record_handlers import get_record_offsets
from warcbench.utils import system_open_archive

with system_open_archive('example.com.wacz') as warc_file:
    parser = WARCParser(
        warc_file,
        processors=WARCProcessorConfig(
            unparsable_line_handlers=[
                lambda line: print(line),
            ]
        )
    )

Callbacks

Callbacks are functions that run after the WARCbench parser finishes parsing a WARC file. A callback can be any function that accepts a warcbench.WARCParser or warcbench.WARCGZParser object as its sole argument. You could use a callback to print the number of records parsed, write the records out to disk, pass the full set of records over to another function, and so on.

Combining filters, handlers, and callbacks

Filters, handlers, and callbacks are additive, but you can combine them together to produce output of arbitrary complexity. Example:

from warcbench import WARCGZParser
from warcbench.config import WARCGZProcessorConfig
from warcbench.filters import warc_named_field_filter
from warcbench.utils import system_open_archive

def combo_filter(record):
    is_warc_info = lambda r: warc_named_field_filter('type', 'warcinfo')(r)

    targets_example_page = lambda r: warc_named_field_filter(
        'target-uri',
        'http://example.com/',
        exact_match=True
    )(r)

    return is_warc_info(record) or (
        targets_example_page(record) and
        http_verb_filter('get')(record) and
        http_status_filter(200)(record)
    )

with system_open_archive('example.com.wacz') as warcgz_file:
    parser = WARCGZParser(
        warcgz_file,
        processors=WARCGZProcessorConfig(
            record_filters=[
                combo_filter,
                record_content_length_filter('2056', 'le'),
            ]
        )
    )

Configuration

WARCbench supports a number of configuration options:

You can parse a WARC file by reading the WARC record headers' Content-Length fields (faster), or by scanning and splitting on the delimiter expected between WARC records (slower; may rarely detect false positives; more robust against mangled or broken WARCs).
You can parse a gzipped WARC file by reading and parsing the file member by member (much slower; simpler), or by gunzipping the entire file while making note of member/record boundaries, and then further processing the bytes of the decompressed records (much faster; may use more disk space).
You can choose whether or not to attempt to split WARC records into headers and content blocks.
You can choose whether to cache record properties (such as the bytes of headers or content blocks) during parsing, or to consume those bytes lazily on access, or both. These features are independent and can be used together.

See config.py for details.

⇧ Back to top

Development setup

We use uv for package dependency management, Ruff for code linting/formatting, and pytest for testing.

To set up a local development environment, follow these steps:

Install uv if it is not already installed
Clone this repository
From the project root, uv sync to set up a virtual environment and install dependencies

Linting/formatting

Run the linting process like so:

uv run ruff check

Run the formatting process like so:

# Check formatting changes before applying
uv run ruff format --check

# Apply formatting changes
uv run ruff format

Tests

Run tests like so:

uv run pytest

Coverage

Run tests with coverage reporting:

# Terminal coverage report
uv run pytest --cov=src/warcbench

# HTML coverage report (opens in browser)
uv run pytest --cov=src/warcbench --cov-report=html

The HTML report will be generated in the htmlcov/ directory. Open htmlcov/index.html in your browser to view the detailed coverage report.

Type checking

Run type checking with mypy:

uv run mypy

The mypy configuration is defined in mypy.ini.

⇧ Back to top

Name		Name	Last commit message	Last commit date
Latest commit History 348 Commits
.github/workflows		.github/workflows
src/warcbench		src/warcbench
tests		tests
.gitignore		.gitignore
.python-version		.python-version
LICENSE		LICENSE
README.md		README.md
mypy.ini		mypy.ini
pyproject.toml		pyproject.toml
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

WARCbench 🛠️

Contents

Quickstart

About

Command line usage

Python usage

Parsing a WARC file

Parsing a Gzipped WARC file

Utility functions

Filters, handlers, and callbacks

Filters

Handlers

Callbacks

Combining filters, handlers, and callbacks

Configuration

Development setup

Linting/formatting

Tests

Coverage

Type checking

About

Uh oh!

Releases

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

WARCbench 🛠️

Contents

Quickstart

About

Command line usage

Python usage

Parsing a WARC file

Parsing a Gzipped WARC file

Utility functions

Filters, handlers, and callbacks

Filters

Handlers

Callbacks

Combining filters, handlers, and callbacks

Configuration

Development setup

Linting/formatting

Tests

Coverage

Type checking

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Contributors

Uh oh!

Languages