This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository.
Coz is a causal profiler for native code (C/C++/Rust) that uses performance experiments to predict optimization impact. Unlike traditional profilers, it measures "bang for buck" - showing how optimizing a line affects overall throughput or latency.
Traditional profilers measure where programs spend time (observational), but not whether optimizing that code will improve overall performance (causal). Key limitations of traditional profiling:
- Serial programs: High CPU-consuming code may not be on the critical path
- Parallel programs: Optimizing one thread may just cause it to wait longer at synchronization points
- Misleading hotspots: Code that consumes CPU time may not drive end-to-end performance
Causal profiling answers: "If I optimize this code, will my program actually run faster?"
The approach uses controlled performance experiments with virtual speedups, establishing causality rather than just correlation between code performance and program throughput.
The project uses CMake for building both the profiler library and benchmarks.
# Build the profiler
cmake .
make
sudo make install
sudo ldconfig
# Build with benchmarks (requires debug info)
cmake -DBUILD_BENCHMARKS=ON -DCMAKE_BUILD_TYPE=RelWithDebInfo .
makeImportant: Benchmarks must be built with debug information (Debug or RelWithDebInfo). The build will fail if you try to build benchmarks without debug info.
DWARF reminder: the vendored libelfin +
memory_maplogic now handle DWARF 2–5 line tables. DWARF 5 uses an explicit file table (no implicit “file 0 = CU source”), soline_table::entry::file_indexstarts at 0 in that mode. If you touch either libelfin orlibcoz/inspect.cpp, keep their assumptions in sync and always test against a DWARF‑5 build (GCC’s default on modern Linux).
Benchmarks are in benchmarks/ and each has its own CMakeLists.txt:
cd benchmarks
cmake .
make
# Run specific benchmark with coz
coz run --- ./toy/toy-
libcoz (
libcoz/): The profiler library loaded as a MODULE (not linked directly)profiler.cpp/h: Main profiler singleton, experiment orchestrationthread_state.h: Per-thread state for delay trackingprogress_point.h: Throughput and latency point tracking- Platform-specific sampling:
perf.cpp/h: Linux implementation using perf_event_open syscallperf_macos.cpp/h: macOS implementation using kperf private framework
inspect.cpp/h: DWARF debug info parsing using libelfin (supports DWARF 2-5). Source filtering obeysCOZ_SOURCE_SCOPE(defaults to%) and the optionalCOZ_FILTER_SYSTEM=1env to drop/usr/include,/usr/lib, etc. Keep those flags in mind before hard-coding additional filters.real.cpp/h: Wrapper functions to capture real pthread/libc functionslibcoz.cpp: Library initialization and exported symbols
-
coz script (
coz): Python 3 command-line wrapper- Handles
coz runandcoz plotcommands - Sets up LD_PRELOAD to inject libcoz.so
- Manages profiler environment variables
coz plotserves a local HTTP server with API endpoints:/llm-config: Returns env var availability for LLM providers/anthropic-models,/openai-models,/bedrock-models,/ollama-models: Dynamic model listing/optimize: Streaming AI optimization suggestions (POST, ndjson)/source-snippet: Source code context for viewer panels
- Bedrock endpoints use
converse_streamAPI (model-agnostic) with inference profile fallback
- Handles
-
include/coz.h: Instrumentation macros for target programs
COZ_PROGRESS/COZ_PROGRESS_NAMED: Throughput progress pointsCOZ_BEGIN(name)/COZ_END(name): Latency measurement points- Uses weak dlsym to locate
_coz_get_counterat runtime
-
viewer (
viewer/): Web-based profile visualization UI- TypeScript/JavaScript single-page application for viewing
.cozprofile files ts/profile.ts: Profile parsing, D3.js plot rendering (loess smoothing, interactive tooltips), AI optimization panel, LLM provider management, dynamic model fetching with localStorage caching, cookie-based settings persistencets/ui.ts: UI logic including file loading, drag-and-drop, theme toggle, keyboard shortcuts, resizable sidebar, model refresh handlerindex.htm: Main HTML with sidebar controls, welcome page, help modals, AI provider/model selectors with refresh buttoncss/ui.css,css/plot.css: Modern dark/light theme styling, code block copy buttons, syntax highlighting- Uses Bootstrap 3, D3.js v3, jQuery, and science.js for statistics
- TypeScript/JavaScript single-page application for viewing
Coz uses virtual speedups to measure optimization potential causally rather than observationally. Instead of actually optimizing code, it simulates the effect by slowing down everything else proportionally.
The key insight: slowing down all other code has the same relative effect as speeding up the selected line. When a sample falls within a selected line of code, other threads pause proportionally.
Delay-to-speedup relationship: Δt̄ = d/P
d= delay durationP= sampling period (1ms)- Example: Inserting a delay that is 25% of the sampling period virtually speeds up the line by 25%
Effective runtime: t̄ₑ = t̄ · (1 - d/P)
- This accounts for the sampling-based approach where not every execution is instrumented
For each experiment, Coz:
- Randomly selects a virtual speedup from 0-100% in 5% increments (
SpeedupDivisions = 20) - Baseline weighting: 0% speedup selected with 50% probability (
ZeroSpeedupWeight = 7), remaining speedup values split the other 50% - Measures impact on progress point visit rates
Measurement formula: 1 - (ps/p0)
p0= period between progress point visits with no speedup (baseline)ps= period measured with virtual speedup s- Result shows percent improvement in throughput
Instead of instrumenting every line execution (prohibitively expensive):
- Samples program counter every 1ms (
SamplePeriod = 1000000) - Counts samples falling in the selected line
- Other threads delay proportionally to sample count
- Processes samples in batches of 10 (
SampleBatchSize = 10) every 10ms for efficiency
Number of samples: s ≈ n · t̄/P
n= execution countt̄= average runtime per execution- Approximates how often the line would be sampled
Rather than expensive POSIX signals, Coz uses atomic counters:
- Global counter: Total pauses required across all threads (
_global_delay) - Local counters: Each thread's completed pause count (
thread_state->local_delay) - Threads pause when
local_delay < _global_delay - To signal pauses: increment both global counter and own local counter
Coz intercepts blocking operations (pthread_mutex_lock, pthread_cond_wait, blocking I/O) to ensure correctness:
Rule: If thread A resumes thread B from blocking, thread B should be credited for delays inserted in thread A.
Implementation (see profiler.h):
pre_block(): Records current global delay timepost_block(skip_delays): If thread was woken by another thread, skip delays inserted during blocked periodcatch_up(): Forces threads to execute all required delays before potentially unblocking other threads
This prevents virtual speedup measurements from being distorted by blocking behavior.
Throughput profiling: Measures rate of visits to progress points
- Use
COZ_PROGRESSorCOZ_PROGRESS_NAMED("name") - Coz tracks visit frequency changes across experiments
Latency profiling: Measures time between paired progress points using Little's Law
W = L/λwhere:W= average latencyL= average number of requests in progressλ= arrival rate (throughput)
- Use
COZ_BEGIN("name")andCOZ_END("name") - Allows latency measurement without tracking individual transactions
For programs with distinct execution phases, Coz applies a correction factor:
ΔP = Δpₐ · (tₒbₛ/sₒbₛ) · (s/T)
This prevents overstating speedup potential for code that only executes during certain program phases.
Key constants in profiler.h:
SamplePeriod = 1000000(1ms between samples)SampleBatchSize = 10(process every 10ms)SpeedupDivisions = 20(5% speedup increments)ZeroSpeedupWeight = 7(~25% of experiments at 0% baseline)ExperimentMinTime = SamplePeriod * SampleBatchSize * 50(minimum experiment duration)ExperimentCoolOffTime = SamplePeriod * SampleBatchSize(cooldown between experiments)ExperimentTargetDelta = 5(minimum progress point visits per experiment)
- Profiler wraps
pthread_createto inject delay tracking into new threads - Each thread has a
thread_statewith local delay counters - Global atomic delay counter (
_global_delay) coordinates experiments - Pre/post-block hooks skip delays during blocking operations
Programs must be compiled with debug info (-g) and linked with -ldl. Coz understands modern DWARF line tables (up through DWARF 5), so you can rely on your toolchain's default DWARF version.
# Build target with debug info
g++ -g -o myapp myapp.cpp -ldl
# Run with coz
coz run --- ./myapp
# View results (automatically opens profile in browser)
coz plot
# Or load a specific profile
coz plot -i /path/to/profile.cozIf you only want to collect lines from your own sources (and not the C++ standard library), pass one or more --source-scope globs or set COZ_SOURCE_SCOPE. Coz also honors COZ_FILTER_SYSTEM=1 as a quick toggle to drop system headers after the DWARF pass. For example:
# Limit to project files
coz run --source-scope '/media/psf/Home/git/coz-portage/benchmarks/**' --- ./benchmarks/toy/toy
# Or just drop system headers entirely
COZ_FILTER_SYSTEM=1 coz run --- ./benchmarks/toy/toyVerbose output:
Use --verbose (or -v) to see what libraries and source files coz is processing:
coz run --verbose --- ./myappThis prints:
- Bootstrap messages
- MAIN executable path resolution
- Source files found in DWARF debug info
- Libraries being profiled
Useful for debugging when coz isn't finding expected source lines or to verify which files are in scope.
Include include/coz.h and add macros:
#include "coz.h"
// Throughput: mark end of work unit
void process_request() {
// ... do work ...
COZ_PROGRESS; // Or COZ_PROGRESS_NAMED("requests")
}
// Latency: mark transaction boundaries
void handle_transaction() {
COZ_BEGIN("transaction");
// ... do work ...
COZ_END("transaction");
}Coz has Rust bindings in rust/:
- Cargo crate published as
coz - Use
coz::progress!(),coz::scope!(),coz::begin!(),coz::end!() - Must compile with debug info:
debug = 1in Cargo.toml[profile.release] - Run with:
coz run --- ./target/release/binary
Projects can import coz targets:
find_package(coz-profiler)
# Provides coz::coz (library+includes) and coz::profiler (binary)
target_link_libraries(myapp PRIVATE coz::coz)Run coz plot to view profiles. This launches a local web server and automatically opens your profile.coz in the browser. Use coz plot -i /path/to/file.coz to load a specific profile.
The viewer source is in viewer/ (TypeScript/JavaScript single-page application).
Text-mode output:
For terminal-based analysis (useful for CI/CD pipelines, remote sessions, or code agents), use the --text flag:
# Print a summary of profiling results to stdout
coz plot --text
# Print detailed results including all data points
coz plot --text --verbose
# Analyze a specific profile file
coz plot --text -i /path/to/profile.cozText mode outputs:
- Source file and line number for each profiled location
- Linear regression slope (predicted speedup per % optimization)
- Number of data points collected
- With
--verbose: individual speedup percentages and their measured effects
Viewer features:
- Automatic profile loading from current directory
- Drag-and-drop support for loading additional profiles
- Interactive D3.js plots with loess-smoothed trend lines
- Sort by impact, alphabetical, max/min speedup
- Minimum points filter slider (auto-adjusts if needed)
- Dark/light theme toggle (time-of-day default)
- Keyboard shortcuts:
Ctrl+Oto open file,?for help - Resizable sidebar via drag handle (persisted in localStorage)
- Source code snippets with syntax highlighting (click
</>icon) - AI optimization suggestions (click magic wand icon)
- Streams responses with progress bar, copyable code blocks
- Supports Anthropic, OpenAI, Amazon Bedrock, and Ollama
- Dynamic model fetching with localStorage cache and manual refresh
- Bedrock uses inference profile IDs via
list_inference_profilesAPI - Settings (provider, keys, model selections, region) persisted in cookies
- AWS credentials sourced from server env vars (not persisted client-side)
Viewer development:
cd viewer
npm install # Install dependencies
npx tsc -p tsconfig.json # Rebuild after editing ts/*.tsThe compiled JavaScript files are in js/ and committed to the repo.
- Linux 2.6.32+ with
perf_event_opensupport - Set perf paranoia:
echo 1 | sudo tee /proc/sys/kernel/perf_event_paranoid - Build dependencies: build-essential, cmake, pkg-config (libelfin fetched automatically)
- macOS 10.10+ (kperf framework availability)
- Requires elevated privileges or SIP adjustments for kperf access
- Build dependencies: cmake, pkg-config (libelfin fetched automatically)
- Important: Uses private kperf API which may change without notice
- Cannot be used in App Store applications due to private API usage
- libelfin: DWARF/ELF parsing (fetched automatically during build via CMake FetchContent)
- Build requires: CMake, C++11 compiler, Python 3, pkg-config
- On Debian/Ubuntu:
sudo apt-get install build-essential cmake pkg-config - Benchmark dependencies: libbz2-dev, libsqlite3-dev
Profiler writes to profile.coz (configurable). Format includes:
- Experiment records: speedup%, source location, progress delta, duration
- View with
coz plotcommand
Coz profiles show the predicted program speedup (y-axis) from optimizing each line by the percentage on the x-axis. Lines are sorted by linear regression slope:
Steep upward slopes (positive correlation):
- Strong optimization candidates
- Optimizing this code directly improves overall performance
- Example: A line with slope +0.8 means optimizing it by 10% improves program throughput by ~8%
Flat or near-zero slopes:
- Optimization won't improve overall performance
- Code may be fast already or not on critical path
- Investment in optimizing these lines has minimal return
Downward slopes (negative correlation):
- Counter-intuitive result indicating contention
- Optimizing might harm performance (e.g., lock contention increases)
- May indicate the real bottleneck is elsewhere
100% virtual speedup: Represents completely removing the line's runtime (theoretical upper bound).
From the SOSP 2015 paper evaluation:
- Mean overhead: 17.6% across benchmarks
- Delay insertion contributes: 10.2 percentage points
- Sampling and bookkeeping: ~7 percentage points
- Overhead is primarily from the virtual speedup mechanism itself, not instrumentation
The authors demonstrated significant speedups on real applications:
- Memcached: 9% improvement
- SQLite: 25% improvement
- PARSEC benchmarks: Up to 68% acceleration
Example optimization (SQLite): Identified three functions with high indirect call overhead. Converting function pointers to direct calls yielded measurable gains despite minimal computational work in the functions themselves.
Linux:
- Requires perf_event_open support (kernel 2.6.32+)
- Needs appropriate perf_event_paranoid settings
macOS:
- Uses private kperf API (may break in future OS versions)
- Requires elevated privileges or SIP configuration
- Cannot be distributed via Mac App Store
- Sampling implementation differs from Linux (less detailed)
- Requires DWARF debug information (supports DWARF 3, 4, and 5)
- No support for interpreted languages (Python, Ruby, JavaScript)
- JIT languages need debug info support (not currently implemented)
- Programs must have meaningful progress points for accurate profiling
GitHub releases include pre-built packages for Linux:
.tar.gz- Generic Linux tarball with install script.deb- Debian/Ubuntu packages (amd64, arm64).rpm- Fedora/RHEL/CentOS packages (x86_64, aarch64)
Original Paper: Charlie Curtsinger and Emery D. Berger. 2015. "Coz: Finding Code that Counts with Causal Profiling." In Proceedings of the 25th Symposium on Operating Systems Principles (SOSP '15). ACM. DOI: 10.1145/2815400.2815409
- Paper: https://arxiv.org/abs/1608.03676
- Received Best Paper Award at SOSP 2015
- Project homepage: https://github.com/plasma-umass/coz