dataprof is a Rust library and CLI for profiling tabular data. It computes column-level statistics, detects data types and patterns, and evaluates data quality against the ISO 8000/25012 standard -- all with bounded memory usage that lets you profile datasets far larger than your available RAM.
- Rust core -- fast columnar and streaming engines
- ISO 8000/25012 quality assessment -- five dimensions: Completeness, Consistency, Uniqueness, Accuracy, Timeliness
- Multi-format -- CSV (auto-delimiter detection), JSON, JSONL, Parquet, databases, DataFrames, Arrow
- True streaming -- bounded-memory profiling with online algorithms (Incremental engine)
- Three interfaces -- CLI binary, Rust library, Python package
- Async-ready --
async/awaitAPI for embedding in web services and stream pipelines
cargo install dataprof
dataprof analyze data.csv --detailed
dataprof schema data.csv
dataprof count data.parquetuse dataprof::Profiler;
let report = Profiler::new().analyze_file("data.csv")?;
println!("Rows: {}", report.execution.rows_processed);
println!("Quality: {:.1}%", report.quality_score().unwrap_or(0.0));
for col in &report.column_profiles {
println!(" {} ({:?}): {} nulls", col.name, col.data_type, col.null_count);
}import dataprof
report = dataprof.profile("data.csv")
print(f"{report.rows} rows, {report.columns} columns")
print(f"Quality score: {report.quality_score}")
for col in report.column_profiles:
print(f" {col.name} ({col.data_type}): {col.null_percentage:.1f}% null")cargo install dataprof # default (CLI only)
cargo install dataprof --features full-cli # CLI + all formats + databases[dependencies]
dataprof = "0.6" # core library (no CLI deps)
dataprof = { version = "0.6", features = ["async-streaming"] }uv pip install dataprof
# or
pip install dataprof| Feature | Description |
|---|---|
cli (default) |
CLI binary with clap, colored output, progress bars |
minimal |
CSV-only, no CLI -- fastest compile |
async-streaming |
Async profiling engine with tokio |
parquet-async |
Profile Parquet files over HTTP |
database |
Database profiling (connection handling, retry, SSL) |
postgres |
PostgreSQL connector (includes database) |
mysql |
MySQL/MariaDB connector (includes database) |
sqlite |
SQLite connector (includes database) |
all-db |
All three database connectors |
datafusion |
DataFusion SQL engine integration |
python |
Python bindings via PyO3 |
python-async |
Async Python API (includes python + async-streaming) |
full-cli |
CLI + Parquet + all databases |
production |
PostgreSQL + MySQL (common deployment) |
| Format | Engine | Notes |
|---|---|---|
| CSV | Incremental, Columnar | Auto-detects , ; | \t delimiters |
| JSON | Incremental | Array-of-objects |
| JSONL / NDJSON | Incremental | One object per line |
| Parquet | Columnar | Reads metadata for schema/count without scanning rows |
| Database query | Async | PostgreSQL, MySQL, SQLite via connection string |
| pandas / polars DataFrame | Columnar | Python API only |
| Arrow RecordBatch | Columnar | Via PyCapsule (zero-copy) or Rust API |
| Async byte stream | Incremental | Any AsyncRead source (HTTP, WebSocket, etc.) |
dataprof evaluates data quality against the five dimensions defined in ISO 8000-8 and ISO/IEC 25012:
| Dimension | What it measures |
|---|---|
| Completeness | Missing values ratio, complete records ratio, fully-null columns |
| Consistency | Data type consistency, format violations, encoding issues |
| Uniqueness | Duplicate rows, key uniqueness, high-cardinality warnings |
| Accuracy | Outlier ratio, range violations, negative values in positive-only columns |
| Timeliness | Future dates, stale data ratio, temporal ordering violations |
An overall quality score (0 -- 100) is computed as a weighted average of dimension scores.
- CLI Usage Guide -- every subcommand and flag
- Python API Guide --
profile(), report types, async, databases - Getting Started -- tutorial from zero to profiling
- Examples Cookbook -- copy-pasteable recipes (CLI, Python, Rust)
- Database Connectors -- PostgreSQL, MySQL, SQLite setup
- Contributing
- Changelog
dataprof is the subject of a peer-reviewed paper submitted to IEEE ScalCom 2026:
A. Bozzo, "A Compiled Paradigm for Scalable and Sustainable Edge AI: Out-of-Core Execution and SIMD Acceleration in Telemetry Profiling," IEEE ScalCom 2026 (under review). [Repository & reproducible benchmarks]
The paper benchmarks dataprof against YData Profiling, Polars, and pandas across execution efficiency, memory scalability, energy consumption, and zero-copy interoperability in constrained Edge AI environments.
@inproceedings{bozzo2026compiled,
author={Bozzo, Andrea},
title={A Compiled Paradigm for Scalable and Sustainable Edge AI: Out-of-Core Execution and SIMD Acceleration in Telemetry Profiling},
booktitle={2026 IEEE International Conference on Scalable Computing and Communications (ScalCom)},
year={2026},
note={Under review}
}Dual-licensed under either the MIT License or the Apache License, Version 2.0, at your option.
