duck-read-cache-fs

A DuckDB extension for remote filesystem access cache.

Loading cache httpfs

Since DuckDB v1.0.0, cache httpfs can be loaded as a community extension without requiring the unsigned flag. From any DuckDB instance, the following two commands will allow you to install and load the extension:

INSTALL cache_httpfs from community;
-- Or upgrade to latest version with `FORCE INSTALL cache_httpfs from community;`
LOAD cache_httpfs;

See the cache httpfs community extension page for more information.

Introduction

This repository is made as read-only filesystem for remote access, which serves as cache layer above duckdb httpfs.

Key features:

Caching for data, which adds support for remote file access to improve IO performance and reduce egress cost; several caching options and entities are supported
- in-memory, cache fetched file content into blocks and leverages a LRU cache to evict stale blocks
- on-disk (default), already read blocks are stored to load filesystem, and evicted on insufficient disk space based on their access timestamp
- no cache, it's allowed to disable cache and fallback to httpfs without any side effects
Parallel read, read operations are split into size-tunable chunks to increase cache hit rate and improve performance
Apart from data blocks, the extension also supports cache file handle, file metadata and glob operation
- The cache for these entities are enabled by default.
Profiling helps us to understand system better, key metrics measured include cache access stats, and IO operation latency, we plan to support multiple types of profile result access; as of now there're three types of profiling
- temp, all access stats are stored in memory, which could be retrieved via SELECT cache_httpfs_get_profile();
- duckdb (under work), stats are stored in duckdb so we could leverage its rich feature for analysis purpose (i.e. use histogram to understant latency distribution)
- profiling is by default disabled
Cache status query functions provide visibility into cache state and access:
- cache_httpfs_cache_status_query() - Returns information about all cached entries including cache filepath, remote filename, byte ranges (start_offset, end_offset), and cache type (in-memory or on-disk)
- cache_httpfs_cache_access_info_query() - Returns cache access statistics including cache hit/miss counts, bytes read, and bytes cached for different cache entities (data, metadata, file handles, glob)
100% Compatibility with duckdb httpfs
- Extension is built upon httpfs extension and automatically load it beforehand, so it's fully compatible with it; we provide option SET cache_httpfs_type='noop'; SET enable_external_file_cache=true; to fallback to and behave exactly as httpfs.
Interaction with duckdb internal "external file cache". Duckdb by default enables external file cache, to avoid double caching cache_httpfs extension by default disable external file cache, which could be re-enabled by SET enable_external_file_cache=true;.
Able to wrap ALL duckdb-compatible filesystem with one simple SQL SELECT cache_httpfs_wrap_cache_filesystem(<your-fs>), and get all the benefit of caching, parallel read, IO performance stats, you name it.

Caveat:

The extension is implemented for object storage, which is expected to be read-heavy workload and (mostly) immutable, so it only supports read cache (at the moment), cache won't be cleared on write operation for the same object.
- We provide workaround for overwrite -- user could call cache_httpfs_clear_cache to delete all cache content, and cache_httpfs_clear_cache_for_file for a certain object.
- All types of cache provides eventual consistency guarantee, which gets evicted after a tunable timeout.
Filesystem requests are split into multiple sub-requests and aligned with block size for parallel IO requests and cache efficiency, so for small requests (i.e. read 1 byte) could suffer read amplification. A workaround for reducing amplification is to tune down block size via cache_httpfs_cache_block_size or fallback to native httpfs.

Example usage

-- No need to load httpfs.
D LOAD cache_httpfs;
-- Create S3 secret to access objects.
D CREATE SECRET my_secret (      TYPE S3,      KEY_ID '<key>',      SECRET '<secret>',      REGION 'us-east-1',      ENDPOINT 's3express-use1-az6.us-east-1.amazonaws.com');
┌─────────┐
│ Success │
│ boolean │
├─────────┤
│ true    │
└─────────┘

-- Set cache type to in-memory.
D SET cache_httpfs_type='in_mem';

-- Access remote file.
D SELECT * FROM 's3://s3-bucket-user-2skzy8zuigonczyfiofztl0zbug--use1-az6--x-s3/t.parquet';
┌───────┬───────┐
│   i   │   j   │
│ int64 │ int64 │
├───────┼───────┤
│     0 │     1 │
│     1 │     2 │
│     2 │     3 │
│     3 │     4 │
│     4 │     5 │
├───────┴───────┤
│    5 rows     │
└───────────────┘

For more example usage, checkout example usage

More About Benchmark

Performance Troubleshooting

For guidance on diagnosing and optimizing query performance, including how to use profiling, interpret cache metrics, and troubleshoot common issues, see the Performance Troubleshooting Guide.

Platform support

At the moment macOS and Linux are supported, shoot us a feature request if you would like to run extension on other platforms.

Comparison with other available caching options

Feature	cache_httpfs	QuackStore	DuckDB External File Cache
Persistence	Supports both in-memory and on-disk, by default on-disk, also provides multi-disk support	On-disk	In-memory
Cache granularity	Tunable block size, 1MiB by default	1 MB aligned blocks	Arbitrary byte ranges
Activation	Automatic	Explicit `quackstore://` prefix	Automatic
Cache invalidation	No support, on the roadmap	Optional (mtime + size)	Always validates (mtime, ETag for HTTP)
Explicit cache eviction	Yes	Yes	Yes
Data integrity	No support, on the roadmap	Checksum validation + auto-recovery	ETag/version checking
Eviction policy	LRU for in-memory cache, LRU or deadline-based eviction for on-disk	LRU with configurable size limit	LRU (memory-based)
Observability	Provides cache access stats	No	No

Development

For development, the extension requires CMake, and a C++14 compliant compiler. Run make in the root directory to compile the sources. For development, use make debug to build a non-optimized debug version. You should run make unit.

Please also refer to our Contribution Guide.

Name		Name	Last commit message	Last commit date
Latest commit History 345 Commits
.devcontainer		.devcontainer
.github/ISSUE_TEMPLATE		.github/ISSUE_TEMPLATE
benchmark-graph		benchmark-graph
benchmark		benchmark
doc		doc
duckdb @ 7dbb2e6		duckdb @ 7dbb2e6
duckdb-httpfs @ 7e86e7a		duckdb-httpfs @ 7e86e7a
extension-ci-tools @ ef15a2a		extension-ci-tools @ ef15a2a
src		src
test		test
.clang-format		.clang-format
.clang-tidy		.clang-tidy
.editorconfig		.editorconfig
.gitattributes		.gitattributes
.gitignore		.gitignore
.gitmodules		.gitmodules
CHANGELOG.md		CHANGELOG.md
CMakeLists.txt		CMakeLists.txt
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
extension_config.cmake		extension_config.cmake
vcpkg.json		vcpkg.json

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

duck-read-cache-fs

Loading cache httpfs

Introduction

Example usage

More About Benchmark

Performance Troubleshooting

Platform support

Comparison with other available caching options

Development

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

duck-read-cache-fs

Loading cache httpfs

Introduction

Example usage

More About Benchmark

Performance Troubleshooting

Platform support

Comparison with other available caching options

Development

About

Resources

License

Contributing

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages