Behavioral analytics functions for DuckDB, inspired by ClickHouse.
Quick Start • Functions • Examples • Performance • Documentation
Provides sessionize, retention, window_funnel, sequence_match,
sequence_count, sequence_match_events, and sequence_next_node as a loadable
DuckDB extension written in Rust. Complete
ClickHouse
behavioral analytics parity.
Personal Project Disclaimer: This is a personal project developed on my own time. It is not affiliated with, endorsed by, or related to my employer or professional role in any way.
AI-Assisted Development: Built with Claude (Anthropic). Correctness is validated by automated testing — not assumed from AI output. See Quality.
- Quick Start
- Functions
- Examples
- Integrations
- Performance
- Community Extension
- Quality
- ClickHouse Parity Status
- Building
- Development
- Documentation
- Requirements
- License
-- Install from the DuckDB Community Extensions repository
INSTALL behavioral FROM community;
LOAD behavioral;Or build from source:
cargo build --release
duckdb -unsigned -cmd "LOAD 'target/release/libbehavioral.so';"Verify it works — run these one-liners after loading:
-- Session IDs (should return 1)
SELECT sessionize(TIMESTAMP '2024-01-01 10:00:00', INTERVAL '30 minutes') OVER () as session_id;
-- Retention (should return [true, false])
SELECT retention(true, false);
-- Funnel progress (should return 2)
SELECT window_funnel(INTERVAL '1 hour', TIMESTAMP '2024-01-01', true, true, false);| Function | Signature | Returns | Description |
|---|---|---|---|
sessionize |
(TIMESTAMP, INTERVAL) |
BIGINT |
Window function assigning session IDs based on inactivity gaps |
retention |
(BOOLEAN, BOOLEAN, ...) |
BOOLEAN[] |
Cohort retention analysis |
window_funnel |
(INTERVAL [, VARCHAR], TIMESTAMP, BOOLEAN, ...) |
INTEGER |
Conversion funnel step tracking with 6 combinable modes |
sequence_match |
(VARCHAR, TIMESTAMP, BOOLEAN, ...) |
BOOLEAN |
NFA-based pattern matching over event sequences |
sequence_count |
(VARCHAR, TIMESTAMP, BOOLEAN, ...) |
BIGINT |
Count non-overlapping pattern matches |
sequence_match_events |
(VARCHAR, TIMESTAMP, BOOLEAN, ...) |
LIST(TIMESTAMP) |
Return matched condition timestamps |
sequence_next_node |
(VARCHAR, VARCHAR, TIMESTAMP, VARCHAR, BOOLEAN, ...) |
VARCHAR |
Next event value after pattern match |
All functions support 2 to 32 boolean conditions, matching ClickHouse's limit. Detailed documentation, examples, and edge case behavior for each function: Function Reference
| I want to... | Use |
|---|---|
| Break events into sessions by inactivity gap | sessionize |
| Check if users returned in later time periods | retention |
| Measure how far users get through ordered steps | window_funnel |
| Detect whether a pattern of events occurred | sequence_match |
| Count how many times a pattern occurred | sequence_count |
| Get timestamps of each matched pattern step | sequence_match_events |
| Find what happened immediately after/before a pattern | sequence_next_node |
Track how far users progress through a purchase flow within 1 hour:
SELECT user_id,
window_funnel(INTERVAL '1 hour', event_time,
event_type = 'page_view',
event_type = 'add_to_cart',
event_type = 'checkout',
event_type = 'purchase'
) as furthest_step
FROM events
GROUP BY user_id;Assign session IDs with a 30-minute inactivity gap, then compute metrics:
WITH sessionized AS (
SELECT user_id, event_time,
sessionize(event_time, INTERVAL '30 minutes') OVER (
PARTITION BY user_id ORDER BY event_time
) as session_id
FROM events
)
SELECT user_id, session_id,
COUNT(*) as page_views,
MIN(event_time) as session_start,
MAX(event_time) as session_end
FROM sessionized
GROUP BY user_id, session_id;Measure week-over-week retention for signup cohorts:
SELECT cohort_week,
COUNT(*) as cohort_size,
SUM(CASE WHEN r[1] THEN 1 ELSE 0 END) as week_0,
SUM(CASE WHEN r[2] THEN 1 ELSE 0 END) as week_1,
SUM(CASE WHEN r[3] THEN 1 ELSE 0 END) as week_2
FROM (
SELECT user_id, cohort_week,
retention(
activity_date >= cohort_week AND activity_date < cohort_week + INTERVAL '7 days',
activity_date >= cohort_week + INTERVAL '7 days' AND activity_date < cohort_week + INTERVAL '14 days',
activity_date >= cohort_week + INTERVAL '14 days' AND activity_date < cohort_week + INTERVAL '21 days'
) as r
FROM activity GROUP BY user_id, cohort_week
)
GROUP BY cohort_week ORDER BY cohort_week;Find users who viewed then purchased within 1 hour:
SELECT user_id,
sequence_match('(?1).*(?t<=3600)(?2)', event_time,
event_type = 'page_view',
event_type = 'purchase'
) as converted_within_hour
FROM events GROUP BY user_id;Aggregate funnel results into a conversion report:
WITH funnels AS (
SELECT user_id,
window_funnel(INTERVAL '1 hour', event_time,
event_type = 'page_view', event_type = 'add_to_cart',
event_type = 'checkout', event_type = 'purchase'
) as step
FROM events GROUP BY user_id
)
SELECT step as reached_step,
COUNT(*) as users,
ROUND(100.0 * COUNT(*) / SUM(COUNT(*)) OVER (), 1) as pct
FROM funnels GROUP BY step ORDER BY step;Discover what page users visit after Home → Product:
SELECT
sequence_next_node('forward', 'first_match', event_time, page,
page = 'Home', page = 'Home', page = 'Product'
) as next_page,
COUNT(*) as user_count
FROM events GROUP BY ALL ORDER BY user_count DESC;Count how many times users repeat a view → cart cycle:
SELECT user_id,
sequence_count('(?1).*(?2)', event_time,
event_type = 'page_view',
event_type = 'add_to_cart'
) as view_cart_cycles
FROM events GROUP BY user_id ORDER BY view_cart_cycles DESC;Get the exact timestamps when each funnel step was satisfied:
SELECT user_id,
sequence_match_events('(?1).*(?2).*(?3)', event_time,
event_type = 'page_view',
event_type = 'add_to_cart',
event_type = 'purchase'
) as step_timestamps
FROM events GROUP BY user_id;For 5 complete real-world examples with sample data, see Use Cases. For a comprehensive recipe collection, see SQL Cookbook.
import duckdb
conn = duckdb.connect()
conn.execute("INSTALL behavioral FROM community")
conn.execute("LOAD behavioral")
df = conn.execute("""
SELECT user_id,
window_funnel(INTERVAL '1 hour', event_time,
event_type = 'view', event_type = 'cart', event_type = 'purchase'
) as steps
FROM events GROUP BY user_id
""").fetchdf()const duckdb = require('duckdb');
const db = new duckdb.Database(':memory:');
db.run("INSTALL behavioral FROM community");
db.run("LOAD behavioral");# profiles.yml
my_project:
outputs:
dev:
type: duckdb
extensions:
- name: behavioral
repo: community-- Query any file format directly
SELECT user_id,
window_funnel(INTERVAL '1 hour', event_time,
event_type = 'view', event_type = 'purchase')
FROM read_parquet('events/*.parquet')
GROUP BY user_id;All measurements below are Criterion.rs 0.8.2 with 95% confidence intervals, validated across multiple runs on commodity hardware.
| Function | Scale | Wall Clock | Throughput |
|---|---|---|---|
sessionize |
1 billion | 1.20 s | 830 Melem/s |
retention (combine) |
100 million | 274 ms | 365 Melem/s |
window_funnel |
100 million | 791 ms | 126 Melem/s |
sequence_match |
100 million | 1.05 s | 95 Melem/s |
sequence_count |
100 million | 1.18 s | 85 Melem/s |
sequence_match_events |
100 million | 1.07 s | 93 Melem/s |
sequence_next_node |
10 million | 546 ms | 18 Melem/s |
Key design choices:
- 16-byte
Copyevents withu32bitmask conditions — four events per cache line, zero heap allocation per event - O(1) combine for
sessionizeandretentionvia boundary tracking and bitmask OR - In-place combine for event-collecting functions — O(N) amortized instead of O(N^2) from repeated allocation
- NFA fast paths — common pattern shapes dispatch to specialized O(n) linear scans instead of full NFA backtracking
- Presorted detection — O(n) check skips O(n log n) sort when events arrive in timestamp order
Optimization highlights:
| Optimization | Speedup | Technique |
|---|---|---|
| Event bitmask | 5–13x | Vec<bool> replaced with u32 bitmask, enabling Copy semantics |
| In-place combine | up to 2,436x | O(N) amortized extend instead of O(N^2) merge-allocate |
| NFA lazy matching | 1,961x at 1M events | Swapped exploration order so .* tries advancing before consuming |
Arc<str> values |
2.1–5.8x | Reference-counted strings for O(1) clone in sequence_next_node |
| NFA fast paths | 39–61% | Pattern classification dispatches common shapes to O(n) linear scans |
Five attempted optimizations were measured, found to be regressions, and reverted.
All negative results are documented in PERF.md.
Full methodology, per-session optimization history with confidence intervals, and
reproducible benchmark instructions: PERF.md.
This extension is listed in the DuckDB Community Extensions repository (PR #1306, merged 2026-02-15). Install with:
INSTALL behavioral FROM community;
LOAD behavioral;No build tools, compilation, or -unsigned flag required.
The community-submission.yml
workflow automates the full pre-submission pipeline in 5 phases:
| Phase | Purpose |
|---|---|
| Validate | description.yml schema, version consistency, required files |
| Quality Gate | cargo test, clippy, fmt, doc |
| Build & Test | make configure && make release && make test_release |
| Pin Ref | Updates description.yml ref to the validated commit SHA |
| Submission Package | Uploads artifact, generates step-by-step PR commands |
Push changes to this repository, re-run the submission workflow to pin the new
ref, then open a new PR against duckdb/community-extensions updating the ref
field in extensions/behavioral/description.yml. When DuckDB releases a new
version, update libduckdb-sys, TARGET_DUCKDB_VERSION, and the
extension-ci-tools submodule.
| Metric | Value |
|---|---|
| Unit tests | 453 + 1 doc-test |
| E2E tests | 11 workflow steps (2 platforms) + 59 SQL queries (against real DuckDB CLI) |
| Property-based tests | 29 (proptest) |
| Mutation testing | 88.4% kill rate (130/147, cargo-mutants) |
| Clippy warnings | 0 (pedantic + nursery + cargo lint groups) |
| CI jobs | 13 (check, test, clippy, fmt, doc, MSRV, bench, deny, semver, coverage, cross-platform, extension-build) |
| Benchmark files | 7 (Criterion.rs, up to 1 billion elements) |
| Release platforms | 4 (Linux x86_64/ARM64, macOS x86_64/ARM64) |
CI runs on every push and PR: 6 workflows across .github/workflows/ including
E2E tests against real DuckDB, CodeQL static analysis, SemVer validation, and
4-platform release builds with provenance attestation.
COMPLETE — All ClickHouse behavioral analytics functions are implemented.
| Function | Status |
|---|---|
retention |
Complete |
window_funnel (6 modes) |
Complete |
sequence_match |
Complete |
sequence_count |
Complete |
sequence_match_events |
Complete |
sequence_next_node |
Complete |
| 32-condition support | Complete |
sessionize |
Extension-only (no ClickHouse equivalent) |
Prerequisites: Rust 1.84.1+ (MSRV), a C compiler (for DuckDB sys bindings)
# Build the extension (release mode)
cargo build --release
# The loadable extension will be at:
# target/release/libbehavioral.so (Linux)
# target/release/libbehavioral.dylib (macOS)cargo test # Unit tests + doc-tests (453 + 1)
cargo clippy --all-targets # Zero warnings required
cargo fmt -- --check # Format check
cargo bench # Criterion.rs benchmarks
cargo doc --no-deps # Build API documentation
# Run all quality checks at once
./scripts/check.sh
# Build extension via community Makefile
git submodule update --init
make configure && make release && make test_releaseThis project follows Semantic Versioning. See the versioning policy for the full SemVer rules applied to SQL function signatures.
- Getting Started — installation, loading, troubleshooting
- Function Reference — detailed docs for all 7 functions
- Use Cases — 5 complete real-world examples with sample data
- SQL Cookbook — practical recipes for common analytics patterns
- Quick Reference — one-page cheat sheet for all functions and patterns
- Engineering Overview — architecture, testing philosophy, design trade-offs
- Performance — benchmarks, optimization history, methodology
- ClickHouse Compatibility — syntax mapping, semantic parity
- Contributing — development setup, testing, PR process
- Rust 1.84.1+ (MSRV)
- DuckDB 1.5.1 (pinned dependency)
- Python 3.x (for extension metadata tooling)
MIT