dataprof

High-performance data profiling with ISO 8000/25012 quality metrics

dataprof is a Rust library and CLI for profiling tabular data. It computes column-level statistics, detects data types and patterns, and evaluates data quality against the ISO 8000/25012 standard -- all with bounded memory usage that lets you profile datasets far larger than your available RAM.

Highlights

Rust core -- fast columnar and streaming engines
ISO 8000/25012 quality assessment -- five dimensions: Completeness, Consistency, Uniqueness, Accuracy, Timeliness
Multi-format -- CSV (auto-delimiter detection), JSON, JSONL, Parquet, databases, DataFrames, Arrow
True streaming -- bounded-memory profiling with online algorithms (Incremental engine)
Three interfaces -- CLI binary, Rust library, Python package
Async-ready -- async/await API for embedding in web services and stream pipelines

Quick Start

CLI

cargo install dataprof

dataprof analyze data.csv --detailed
dataprof schema data.csv
dataprof count data.parquet

Rust

use dataprof::Profiler;

let report = Profiler::new().analyze_file("data.csv")?;
println!("Rows: {}", report.execution.rows_processed);
println!("Quality: {:.1}%", report.quality_score().unwrap_or(0.0));

for col in &report.column_profiles {
    println!("  {} ({:?}): {} nulls", col.name, col.data_type, col.null_count);
}

Python

import dataprof

report = dataprof.profile("data.csv")
print(f"{report.rows} rows, {report.columns} columns")
print(f"Quality score: {report.quality_score}")

for col in report.column_profiles:
    print(f"  {col.name} ({col.data_type}): {col.null_percentage:.1f}% null")

Installation

CLI binary

cargo install dataprof                        # default (CLI only)
cargo install dataprof --features full-cli    # CLI + all formats + databases

Rust library

[dependencies]
dataprof = "0.6"                  # core library (no CLI deps)
dataprof = { version = "0.6", features = ["async-streaming"] }

Python package

uv pip install dataprof
# or
pip install dataprof

Feature Flags

Feature	Description
`cli` (default)	CLI binary with clap, colored output, progress bars
`minimal`	CSV-only, no CLI -- fastest compile
`async-streaming`	Async profiling engine with tokio
`parquet-async`	Profile Parquet files over HTTP
`database`	Database profiling (connection handling, retry, SSL)
`postgres`	PostgreSQL connector (includes `database`)
`mysql`	MySQL/MariaDB connector (includes `database`)
`sqlite`	SQLite connector (includes `database`)
`all-db`	All three database connectors
`datafusion`	DataFusion SQL engine integration
`python`	Python bindings via PyO3
`python-async`	Async Python API (includes `python` + `async-streaming`)
`full-cli`	CLI + Parquet + all databases
`production`	PostgreSQL + MySQL (common deployment)

Supported Formats

Format	Engine	Notes
CSV	Incremental, Columnar	Auto-detects `,` `;` `\|` `\t` delimiters
JSON	Incremental	Array-of-objects
JSONL / NDJSON	Incremental	One object per line
Parquet	Columnar	Reads metadata for schema/count without scanning rows
Database query	Async	PostgreSQL, MySQL, SQLite via connection string
pandas / polars DataFrame	Columnar	Python API only
Arrow RecordBatch	Columnar	Via PyCapsule (zero-copy) or Rust API
Async byte stream	Incremental	Any `AsyncRead` source (HTTP, WebSocket, etc.)

Quality Metrics

dataprof evaluates data quality against the five dimensions defined in ISO 8000-8 and ISO/IEC 25012:

Dimension	What it measures
Completeness	Missing values ratio, complete records ratio, fully-null columns
Consistency	Data type consistency, format violations, encoding issues
Uniqueness	Duplicate rows, key uniqueness, high-cardinality warnings
Accuracy	Outlier ratio, range violations, negative values in positive-only columns
Timeliness	Future dates, stale data ratio, temporal ordering violations

An overall quality score (0 -- 100) is computed as a weighted average of dimension scores.

Documentation

CLI Usage Guide -- every subcommand and flag
Python API Guide -- profile(), report types, async, databases
Getting Started -- tutorial from zero to profiling
Examples Cookbook -- copy-pasteable recipes (CLI, Python, Rust)
Database Connectors -- PostgreSQL, MySQL, SQLite setup
Contributing
Changelog

Academic Work

dataprof is the subject of a peer-reviewed paper submitted to IEEE ScalCom 2026:

A. Bozzo, "A Compiled Paradigm for Scalable and Sustainable Edge AI: Out-of-Core Execution and SIMD Acceleration in Telemetry Profiling," IEEE ScalCom 2026 (under review). [Repository & reproducible benchmarks]

The paper benchmarks dataprof against YData Profiling, Polars, and pandas across execution efficiency, memory scalability, energy consumption, and zero-copy interoperability in constrained Edge AI environments.

BibTeX

@inproceedings{bozzo2026compiled,
  author={Bozzo, Andrea},
  title={A Compiled Paradigm for Scalable and Sustainable Edge AI: Out-of-Core Execution and SIMD Acceleration in Telemetry Profiling},
  booktitle={2026 IEEE International Conference on Scalable Computing and Communications (ScalCom)},
  year={2026},
  note={Under review}
}

License

Dual-licensed under either the MIT License or the Apache License, Version 2.0, at your option.

Name		Name	Last commit message	Last commit date
Latest commit History 646 Commits
.cargo		.cargo
.devcontainer		.devcontainer
.github		.github
assets/images		assets/images
benches		benches
docs		docs
examples		examples
python		python
src		src
tests		tests
.gitignore		.gitignore
.trufflehogignore		.trufflehogignore
Cargo.lock		Cargo.lock
Cargo.toml		Cargo.toml
Cross.toml		Cross.toml
LICENSE		LICENSE
LICENSE-APACHE		LICENSE-APACHE
MANIFEST.in		MANIFEST.in
Makefile		Makefile
README.md		README.md
cliff.toml		cliff.toml
clippy.toml		clippy.toml
deny.toml		deny.toml
pyproject.toml		pyproject.toml
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

dataprof

Highlights

Quick Start

CLI

Rust

Python

Installation

CLI binary

Rust library

Python package

Feature Flags

Supported Formats

Quality Metrics

Documentation

Academic Work

BibTeX

License

About

Licenses found

Uh oh!

Releases 27

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

dataprof

Highlights

Quick Start

CLI

Rust

Python

Installation

CLI binary

Rust library

Python package

Feature Flags

Supported Formats

Quality Metrics

Documentation

Academic Work

BibTeX

License

About

Topics

Resources

License

Licenses found

Code of conduct

Contributing

Security policy

Uh oh!

Stars

Watchers

Forks

Releases 27

Uh oh!

Contributors

Uh oh!

Languages