Skip to content

Conversation

@suxiaogang223
Copy link
Contributor

ORC CLI tools

All binaries are gated behind the cli feature and can be installed with:

cargo install orc-rust --features cli

orc-read

  • Stream ORC data to stdout as CSV (default) or JSON lines.
  • Supports - for stdin, --num-records to cap output, --batch-size to tune read throughput, and --json to switch formats.
  • Example:
    orc-read --json --num-records 5 tests/integration/data/TestOrcFile.test1.orc

orc-schema

  • Print file-level metadata (format version, compression, row index stride, rows, stripes).
  • Shows the logical schema; --verbose adds stripe offsets and row counts.
  • Example:
    orc-schema --verbose tests/integration/data/TestOrcFile.test1.orc

orc-rowcount

  • Report the total row count for one or more ORC files.
  • Example:
    orc-rowcount tests/integration/data/TestOrcFile.test1.orc

orc-index

  • Inspect row index (row group) statistics for a top-level column.
  • Outputs per-stripe row group ranges and available min/max/null metadata.
  • Example:
    orc-index tests/integration/data/TestOrcFile.testPredicatePushdown.orc int1

orc-layout

  • Emit a JSON document describing each stripe: offsets, section sizes, streams (kind/column/offset/length), and encodings.
  • Example:
    orc-layout tests/integration/data/TestOrcFile.test1.orc | jq .

Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR adds five new CLI tools for inspecting and manipulating ORC files: orc-read (stream data as CSV/JSON), orc-schema (display metadata and schema), orc-rowcount (report row counts), orc-index (inspect row group statistics), and orc-layout (emit physical layout as JSON). To support these tools, the proto module is made public to expose protobuf types, and serde/serde_json dependencies are added to the cli feature.

Key Changes

  • Added five new CLI binaries with corresponding Cargo.toml bin entries
  • Made proto module public to enable CLI tools to access low-level protobuf structures
  • Added serde and serde_json as optional dependencies under the cli feature
  • Created integration tests in tests/bin/main.rs to verify basic CLI functionality

Reviewed changes

Copilot reviewed 2 out of 8 changed files in this pull request and generated 2 comments.

Show a summary per file
File Description
Cargo.toml Adds serde/serde_json to cli feature dependencies; registers 5 new binaries
src/lib.rs Changes proto module from private to public
src/bin/orc-read.rs New CLI tool to stream ORC data as CSV or JSON lines with stdin support
src/bin/orc-schema.rs New CLI tool to print file metadata and schema with optional verbose mode
src/bin/orc-rowcount.rs New CLI tool to report total row counts for one or more files
src/bin/orc-index.rs New CLI tool to inspect row group statistics for a specific column
src/bin/orc-layout.rs New CLI tool to emit JSON description of stripe physical layout
tests/bin/main.rs Smoke tests for all new CLI binaries, gated behind cli feature

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@WenyXu
Copy link
Collaborator

WenyXu commented Dec 14, 2025

Nit: Should we include an AI-generated README.md to demonstrate how to use the CLI?

@suxiaogang223
Copy link
Contributor Author

Nit: Should we include an AI-generated README.md to demonstrate how to use the CLI?

good idea

@suxiaogang223
Copy link
Contributor Author

Consider add show_bloom_filter after this pr #72 merged

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants