A DuckDB extension for remote filesystem access cache.
Since DuckDB v1.0.0, cache httpfs can be loaded as a community extension without requiring the unsigned flag. From any DuckDB instance, the following two commands will allow you to install and load the extension:
INSTALL cache_httpfs from community;
-- Or upgrade to latest version with `FORCE INSTALL cache_httpfs from community;`
LOAD cache_httpfs;See the cache httpfs community extension page for more information.
This repository is made as read-only filesystem for remote access, which serves as cache layer above duckdb httpfs.
Key features:
- Caching for data, which adds support for remote file access to improve IO performance and reduce egress cost; several caching options and entities are supported
- in-memory, cache fetched file content into blocks and leverages a LRU cache to evict stale blocks
- on-disk (default), already read blocks are stored to load filesystem, and evicted on insufficient disk space based on their access timestamp
- no cache, it's allowed to disable cache and fallback to httpfs without any side effects
- Parallel read, read operations are split into size-tunable chunks to increase cache hit rate and improve performance
- Apart from data blocks, the extension also supports cache file handle, file metadata and glob operation
- The cache for these entities are enabled by default.
- Profiling helps us to understand system better, key metrics measured include cache access stats, and IO operation latency, we plan to support multiple types of profile result access; as of now there're three types of profiling
- temp, all access stats are stored in memory, which could be retrieved via
SELECT cache_httpfs_get_profile(); - duckdb (under work), stats are stored in duckdb so we could leverage its rich feature for analysis purpose (i.e. use histogram to understant latency distribution)
- profiling is by default disabled
- temp, all access stats are stored in memory, which could be retrieved via
- Cache status query functions provide visibility into cache state and access:
cache_httpfs_cache_status_query()- Returns information about all cached entries including cache filepath, remote filename, byte ranges (start_offset, end_offset), and cache type (in-memory or on-disk)cache_httpfs_cache_access_info_query()- Returns cache access statistics including cache hit/miss counts, bytes read, and bytes cached for different cache entities (data, metadata, file handles, glob)
- 100% Compatibility with duckdb
httpfs- Extension is built upon
httpfsextension and automatically load it beforehand, so it's fully compatible with it; we provide optionSET cache_httpfs_type='noop'; SET enable_external_file_cache=true;to fallback to and behave exactly as httpfs.
- Extension is built upon
- Interaction with duckdb internal "external file cache". Duckdb by default enables external file cache, to avoid double caching cache_httpfs extension by default disable external file cache, which could be re-enabled by
SET enable_external_file_cache=true;. - Able to wrap ALL duckdb-compatible filesystem with one simple SQL
SELECT cache_httpfs_wrap_cache_filesystem(<your-fs>), and get all the benefit of caching, parallel read, IO performance stats, you name it.
Caveat:
- The extension is implemented for object storage, which is expected to be read-heavy workload and (mostly) immutable, so it only supports read cache (at the moment), cache won't be cleared on write operation for the same object.
- We provide workaround for overwrite -- user could call
cache_httpfs_clear_cacheto delete all cache content, andcache_httpfs_clear_cache_for_filefor a certain object. - All types of cache provides eventual consistency guarantee, which gets evicted after a tunable timeout.
- We provide workaround for overwrite -- user could call
- Filesystem requests are split into multiple sub-requests and aligned with block size for parallel IO requests and cache efficiency, so for small requests (i.e. read 1 byte) could suffer read amplification.
A workaround for reducing amplification is to tune down block size via
cache_httpfs_cache_block_sizeor fallback to native httpfs.
-- No need to load httpfs.
D LOAD cache_httpfs;
-- Create S3 secret to access objects.
D CREATE SECRET my_secret ( TYPE S3, KEY_ID '<key>', SECRET '<secret>', REGION 'us-east-1', ENDPOINT 's3express-use1-az6.us-east-1.amazonaws.com');
┌─────────┐
│ Success │
│ boolean │
├─────────┤
│ true │
└─────────┘
-- Set cache type to in-memory.
D SET cache_httpfs_type='in_mem';
-- Access remote file.
D SELECT * FROM 's3://s3-bucket-user-2skzy8zuigonczyfiofztl0zbug--use1-az6--x-s3/t.parquet';
┌───────┬───────┐
│ i │ j │
│ int64 │ int64 │
├───────┼───────┤
│ 0 │ 1 │
│ 1 │ 2 │
│ 2 │ 3 │
│ 3 │ 4 │
│ 4 │ 5 │
├───────┴───────┤
│ 5 rows │
└───────────────┘For more example usage, checkout example usage
For guidance on diagnosing and optimizing query performance, including how to use profiling, interpret cache metrics, and troubleshoot common issues, see the Performance Troubleshooting Guide.
At the moment macOS and Linux are supported, shoot us a feature request if you would like to run extension on other platforms.
| Feature | cache_httpfs | QuackStore | DuckDB External File Cache |
|---|---|---|---|
| Persistence | Supports both in-memory and on-disk, by default on-disk, also provides multi-disk support | On-disk | In-memory |
| Cache granularity | Tunable block size, 1MiB by default | 1 MB aligned blocks | Arbitrary byte ranges |
| Activation | Automatic | Explicit quackstore:// prefix |
Automatic |
| Cache invalidation | No support, on the roadmap | Optional (mtime + size) | Always validates (mtime, ETag for HTTP) |
| Explicit cache eviction | Yes | Yes | Yes |
| Data integrity | No support, on the roadmap | Checksum validation + auto-recovery | ETag/version checking |
| Eviction policy | LRU for in-memory cache, LRU or deadline-based eviction for on-disk | LRU with configurable size limit | LRU (memory-based) |
| Observability | Provides cache access stats | No | No |
For development, the extension requires CMake, and a C++14 compliant compiler. Run make in the root directory to compile the sources. For development, use make debug to build a non-optimized debug version. You should run make unit.
Please also refer to our Contribution Guide.