GitHub - tomtom215/duckdb-behavioral: A DuckDB Community Extension to enable Behavioral Analytics, inspired by ClickHouse.

Behavioral analytics functions for DuckDB, inspired by ClickHouse.

Quick Start • Functions • Examples • Performance • Documentation

Provides sessionize, retention, window_funnel, sequence_match, sequence_count, sequence_match_events, and sequence_next_node as a loadable DuckDB extension written in Rust. Complete ClickHouse behavioral analytics parity.

Personal Project Disclaimer: This is a personal project developed on my own time. It is not affiliated with, endorsed by, or related to my employer or professional role in any way.

AI-Assisted Development: Built with Claude (Anthropic). Correctness is validated by automated testing — not assumed from AI output. See Quality.

Quick Start

-- Install from the DuckDB Community Extensions repository
INSTALL behavioral FROM community;
LOAD behavioral;

Or build from source:

cargo build --release
duckdb -unsigned -cmd "LOAD 'target/release/libbehavioral.so';"

Verify it works — run these one-liners after loading:

-- Session IDs (should return 1)
SELECT sessionize(TIMESTAMP '2024-01-01 10:00:00', INTERVAL '30 minutes') OVER () as session_id;

-- Retention (should return [true, false])
SELECT retention(true, false);

-- Funnel progress (should return 2)
SELECT window_funnel(INTERVAL '1 hour', TIMESTAMP '2024-01-01', true, true, false);

Functions

Function	Signature	Returns	Description
`sessionize`	`(TIMESTAMP, INTERVAL)`	`BIGINT`	Window function assigning session IDs based on inactivity gaps
`retention`	`(BOOLEAN, BOOLEAN, ...)`	`BOOLEAN[]`	Cohort retention analysis
`window_funnel`	`(INTERVAL [, VARCHAR], TIMESTAMP, BOOLEAN, ...)`	`INTEGER`	Conversion funnel step tracking with 6 combinable modes
`sequence_match`	`(VARCHAR, TIMESTAMP, BOOLEAN, ...)`	`BOOLEAN`	NFA-based pattern matching over event sequences
`sequence_count`	`(VARCHAR, TIMESTAMP, BOOLEAN, ...)`	`BIGINT`	Count non-overlapping pattern matches
`sequence_match_events`	`(VARCHAR, TIMESTAMP, BOOLEAN, ...)`	`LIST(TIMESTAMP)`	Return matched condition timestamps
`sequence_next_node`	`(VARCHAR, VARCHAR, TIMESTAMP, VARCHAR, BOOLEAN, ...)`	`VARCHAR`	Next event value after pattern match

All functions support 2 to 32 boolean conditions, matching ClickHouse's limit. Detailed documentation, examples, and edge case behavior for each function: Function Reference

Choosing the Right Function

I want to...	Use
Break events into sessions by inactivity gap	`sessionize`
Check if users returned in later time periods	`retention`
Measure how far users get through ordered steps	`window_funnel`
Detect whether a pattern of events occurred	`sequence_match`
Count how many times a pattern occurred	`sequence_count`
Get timestamps of each matched pattern step	`sequence_match_events`
Find what happened immediately after/before a pattern	`sequence_next_node`

Examples

Conversion Funnel

Track how far users progress through a purchase flow within 1 hour:

SELECT user_id,
  window_funnel(INTERVAL '1 hour', event_time,
    event_type = 'page_view',
    event_type = 'add_to_cart',
    event_type = 'checkout',
    event_type = 'purchase'
  ) as furthest_step
FROM events
GROUP BY user_id;

Session Analysis

Assign session IDs with a 30-minute inactivity gap, then compute metrics:

WITH sessionized AS (
    SELECT user_id, event_time,
      sessionize(event_time, INTERVAL '30 minutes') OVER (
        PARTITION BY user_id ORDER BY event_time
      ) as session_id
    FROM events
)
SELECT user_id, session_id,
  COUNT(*) as page_views,
  MIN(event_time) as session_start,
  MAX(event_time) as session_end
FROM sessionized
GROUP BY user_id, session_id;

Weekly Retention

Measure week-over-week retention for signup cohorts:

SELECT cohort_week,
  COUNT(*) as cohort_size,
  SUM(CASE WHEN r[1] THEN 1 ELSE 0 END) as week_0,
  SUM(CASE WHEN r[2] THEN 1 ELSE 0 END) as week_1,
  SUM(CASE WHEN r[3] THEN 1 ELSE 0 END) as week_2
FROM (
  SELECT user_id, cohort_week,
    retention(
      activity_date >= cohort_week AND activity_date < cohort_week + INTERVAL '7 days',
      activity_date >= cohort_week + INTERVAL '7 days' AND activity_date < cohort_week + INTERVAL '14 days',
      activity_date >= cohort_week + INTERVAL '14 days' AND activity_date < cohort_week + INTERVAL '21 days'
    ) as r
  FROM activity GROUP BY user_id, cohort_week
)
GROUP BY cohort_week ORDER BY cohort_week;

Pattern Detection with Time Constraints

Find users who viewed then purchased within 1 hour:

SELECT user_id,
  sequence_match('(?1).*(?t<=3600)(?2)', event_time,
    event_type = 'page_view',
    event_type = 'purchase'
  ) as converted_within_hour
FROM events GROUP BY user_id;

Funnel Drop-off Report

Aggregate funnel results into a conversion report:

WITH funnels AS (
  SELECT user_id,
    window_funnel(INTERVAL '1 hour', event_time,
      event_type = 'page_view', event_type = 'add_to_cart',
      event_type = 'checkout', event_type = 'purchase'
    ) as step
  FROM events GROUP BY user_id
)
SELECT step as reached_step,
  COUNT(*) as users,
  ROUND(100.0 * COUNT(*) / SUM(COUNT(*)) OVER (), 1) as pct
FROM funnels GROUP BY step ORDER BY step;

User Flow Analysis

Discover what page users visit after Home → Product:

SELECT
  sequence_next_node('forward', 'first_match', event_time, page,
    page = 'Home', page = 'Home', page = 'Product'
  ) as next_page,
  COUNT(*) as user_count
FROM events GROUP BY ALL ORDER BY user_count DESC;

Pattern Frequency

Count how many times users repeat a view → cart cycle:

SELECT user_id,
  sequence_count('(?1).*(?2)', event_time,
    event_type = 'page_view',
    event_type = 'add_to_cart'
  ) as view_cart_cycles
FROM events GROUP BY user_id ORDER BY view_cart_cycles DESC;

Matched Event Timestamps

Get the exact timestamps when each funnel step was satisfied:

SELECT user_id,
  sequence_match_events('(?1).*(?2).*(?3)', event_time,
    event_type = 'page_view',
    event_type = 'add_to_cart',
    event_type = 'purchase'
  ) as step_timestamps
FROM events GROUP BY user_id;

For 5 complete real-world examples with sample data, see Use Cases. For a comprehensive recipe collection, see SQL Cookbook.

Integrations

Python

import duckdb

conn = duckdb.connect()
conn.execute("INSTALL behavioral FROM community")
conn.execute("LOAD behavioral")

df = conn.execute("""
    SELECT user_id,
      window_funnel(INTERVAL '1 hour', event_time,
        event_type = 'view', event_type = 'cart', event_type = 'purchase'
      ) as steps
    FROM events GROUP BY user_id
""").fetchdf()

Node.js

const duckdb = require('duckdb');
const db = new duckdb.Database(':memory:');

db.run("INSTALL behavioral FROM community");
db.run("LOAD behavioral");

dbt

# profiles.yml
my_project:
  outputs:
    dev:
      type: duckdb
      extensions:
        - name: behavioral
          repo: community

Parquet / CSV / JSON

-- Query any file format directly
SELECT user_id,
  window_funnel(INTERVAL '1 hour', event_time,
    event_type = 'view', event_type = 'purchase')
FROM read_parquet('events/*.parquet')
GROUP BY user_id;

Performance

All measurements below are Criterion.rs 0.8.2 with 95% confidence intervals, validated across multiple runs on commodity hardware.

Function	Scale	Wall Clock	Throughput
`sessionize`	1 billion	1.20 s	830 Melem/s
`retention` (combine)	100 million	274 ms	365 Melem/s
`window_funnel`	100 million	791 ms	126 Melem/s
`sequence_match`	100 million	1.05 s	95 Melem/s
`sequence_count`	100 million	1.18 s	85 Melem/s
`sequence_match_events`	100 million	1.07 s	93 Melem/s
`sequence_next_node`	10 million	546 ms	18 Melem/s

Key design choices:

16-byte Copy events with u32 bitmask conditions — four events per cache line, zero heap allocation per event
O(1) combine for sessionize and retention via boundary tracking and bitmask OR
In-place combine for event-collecting functions — O(N) amortized instead of O(N^2) from repeated allocation
NFA fast paths — common pattern shapes dispatch to specialized O(n) linear scans instead of full NFA backtracking
Presorted detection — O(n) check skips O(n log n) sort when events arrive in timestamp order

Optimization highlights:

Optimization	Speedup	Technique
Event bitmask	5–13x	`Vec<bool>` replaced with `u32` bitmask, enabling `Copy` semantics
In-place combine	up to 2,436x	O(N) amortized extend instead of O(N^2) merge-allocate
NFA lazy matching	1,961x at 1M events	Swapped exploration order so `.*` tries advancing before consuming
`Arc<str>` values	2.1–5.8x	Reference-counted strings for O(1) clone in `sequence_next_node`
NFA fast paths	39–61%	Pattern classification dispatches common shapes to O(n) linear scans

Five attempted optimizations were measured, found to be regressions, and reverted. All negative results are documented in PERF.md.

Full methodology, per-session optimization history with confidence intervals, and reproducible benchmark instructions: PERF.md.

Community Extension

This extension is listed in the DuckDB Community Extensions repository (PR #1306, merged 2026-02-15). Install with:

INSTALL behavioral FROM community;
LOAD behavioral;

No build tools, compilation, or -unsigned flag required.

Update Process

The community-submission.yml workflow automates the full pre-submission pipeline in 5 phases:

Phase	Purpose
Validate	`description.yml` schema, version consistency, required files
Quality Gate	`cargo test`, `clippy`, `fmt`, `doc`
Build & Test	`make configure && make release && make test_release`
Pin Ref	Updates `description.yml` ref to the validated commit SHA
Submission Package	Uploads artifact, generates step-by-step PR commands

Updating the Published Extension

Push changes to this repository, re-run the submission workflow to pin the new ref, then open a new PR against duckdb/community-extensions updating the ref field in extensions/behavioral/description.yml. When DuckDB releases a new version, update libduckdb-sys, TARGET_DUCKDB_VERSION, and the extension-ci-tools submodule.

Quality

Metric	Value
Unit tests	453 + 1 doc-test
E2E tests	11 workflow steps (2 platforms) + 59 SQL queries (against real DuckDB CLI)
Property-based tests	29 (proptest)
Mutation testing	88.4% kill rate (130/147, cargo-mutants)
Clippy warnings	0 (pedantic + nursery + cargo lint groups)
CI jobs	13 (check, test, clippy, fmt, doc, MSRV, bench, deny, semver, coverage, cross-platform, extension-build)
Benchmark files	7 (Criterion.rs, up to 1 billion elements)
Release platforms	4 (Linux x86_64/ARM64, macOS x86_64/ARM64)

CI runs on every push and PR: 6 workflows across .github/workflows/ including E2E tests against real DuckDB, CodeQL static analysis, SemVer validation, and 4-platform release builds with provenance attestation.

ClickHouse Parity Status

COMPLETE — All ClickHouse behavioral analytics functions are implemented.

Function	Status
`retention`	Complete
`window_funnel` (6 modes)	Complete
`sequence_match`	Complete
`sequence_count`	Complete
`sequence_match_events`	Complete
`sequence_next_node`	Complete
32-condition support	Complete
`sessionize`	Extension-only (no ClickHouse equivalent)

Building

Prerequisites: Rust 1.84.1+ (MSRV), a C compiler (for DuckDB sys bindings)

# Build the extension (release mode)
cargo build --release

# The loadable extension will be at:
# target/release/libbehavioral.so   (Linux)
# target/release/libbehavioral.dylib (macOS)

Development

cargo test                  # Unit tests + doc-tests (453 + 1)
cargo clippy --all-targets  # Zero warnings required
cargo fmt -- --check        # Format check
cargo bench                 # Criterion.rs benchmarks
cargo doc --no-deps         # Build API documentation

# Run all quality checks at once
./scripts/check.sh

# Build extension via community Makefile
git submodule update --init
make configure && make release && make test_release

This project follows Semantic Versioning. See the versioning policy for the full SemVer rules applied to SQL function signatures.

Documentation

Getting Started — installation, loading, troubleshooting
Function Reference — detailed docs for all 7 functions
Use Cases — 5 complete real-world examples with sample data
SQL Cookbook — practical recipes for common analytics patterns
Quick Reference — one-page cheat sheet for all functions and patterns
Engineering Overview — architecture, testing philosophy, design trade-offs
Performance — benchmarks, optimization history, methodology
ClickHouse Compatibility — syntax mapping, semantic parity
Contributing — development setup, testing, PR process

Requirements

Rust 1.84.1+ (MSRV)
DuckDB 1.5.1 (pinned dependency)
Python 3.x (for extension metadata tooling)

License

MIT

Name		Name	Last commit message	Last commit date
Latest commit History 129 Commits
.config		.config
.github		.github
assets		assets
benches		benches
demo		demo
docs		docs
examples		examples
extension-ci-tools @ b35c1e0		extension-ci-tools @ b35c1e0
scripts		scripts
src		src
test/sql		test/sql
.gitignore		.gitignore
.gitmodules		.gitmodules
CHANGELOG.md		CHANGELOG.md
CLAUDE.md		CLAUDE.md
CODE_OF_CONDUCT.md		CODE_OF_CONDUCT.md
CONTRIBUTING.md		CONTRIBUTING.md
Cargo.lock		Cargo.lock
Cargo.toml		Cargo.toml
LESSONS.md		LESSONS.md
LICENSE		LICENSE
Makefile		Makefile
PERF.md		PERF.md
README.md		README.md
SECURITY.md		SECURITY.md
deny.toml		deny.toml
description.yml		description.yml
rust-toolchain.toml		rust-toolchain.toml

Folders and files

Latest commit

History

Repository files navigation

Table of Contents

Quick Start

Functions

Choosing the Right Function

Examples

Conversion Funnel

Session Analysis

Weekly Retention

Pattern Detection with Time Constraints

Funnel Drop-off Report

User Flow Analysis

Pattern Frequency

Matched Event Timestamps

Integrations

Python

Node.js

dbt

Parquet / CSV / JSON

Performance

Community Extension

Update Process

Updating the Published Extension

Quality

ClickHouse Parity Status

Building

Development

Documentation

Requirements

License

About

Topics

Resources

License

Code of conduct

Contributing

Security policy

Uh oh!

Stars

Watchers

Forks

Releases 3

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages