A modern data lakehouse stack for local development and testing, providing a complete analytics environment with:
- Trino 476: Distributed SQL query engine
- Hive Metastore 3.1.3: Metadata management for tables and schemas
- MinIO: S3-compatible object storage
- MySQL 8.0.34: Backend database for Hive Metastore
Download the following jars:
-
aws-java-sdk-bundle-1.12.367.jar
-
hadoop-aws-3.3.4.jar
-
mysql-connector-java-8.0.23.jar
↓ Place them in: data/jars/
The HOT Stack architecture shows the data flow between components:
- Trino serves as the query engine, connecting to both Hive Metastore for metadata and MinIO for data storage
- Hive Metastore manages table schemas and locations, using MySQL as its backend database
- MinIO provides S3-compatible object storage for actual data files
- OPA (Open Policy Agent) provides authorization policies through mounted volumes
- Apache Spark can also connect to MinIO for data processing workloads
- Role: Distributed SQL query engine for analytics
- Port: 8080 (HTTP/Web UI)
- Configuration:
/conf/trino/ - Catalogs:
hive: Main catalog for data lake tables (backed by Hive Metastore + MinIO)memory: In-memory tables for testingtpch: Built-in TPC-H benchmark datasetsystem: System catalog for monitoring
Key Features:
- Native S3 filesystem support for MinIO
- Parquet format with Snappy compression
- Support for both managed and external tables
- Partitioned table support
- ANSI SQL compliant
- Role: Centralized metadata repository for table schemas, partitions, and locations
- Port: 9083 (Thrift protocol)
- Configuration:
/conf/hive-site.xml - Storage: MySQL database backend
- Warehouse Directory:
s3a://com.dldgv2/delta/(for managed tables)
Key Features:
- S3A filesystem support for MinIO
- Schema versioning and validation
- Supports Hive 3.1.0 metadata schema
- Role: S3-compatible object storage for data lake files
- Ports:
- 9000: S3 API endpoint
- 9001: Web Console UI
- Credentials:
- Access Key:
minio - Secret Key:
minio_admin
- Access Key:
- Data Location:
./data/minio/
Key Features:
- Full S3 API compatibility
- Web-based console for bucket management
- Path-style access enabled
- No SSL (development mode)
- Role: Backend database for Hive Metastore metadata
- Port: 3306
- Database:
metastore - Credentials:
- User:
dataeng/ Password:dataengineering_user - Root:
dataengineering
- User:
- Data Location:
./data/mysqldir/
podman-compose up -dpodman-compose psAll services should show as "healthy" or "running".
podman exec -it trino trino- Trino Web UI: http://localhost:8080
- MinIO Console: http://localhost:9001 (login: minio/minio_admin)
External tables allow you to specify exactly where data is stored. When dropped, only metadata is removed.
-- Create schema with specific location
CREATE SCHEMA IF NOT EXISTS hive.base
WITH (location = 's3a://com.dldgv2/base/');
-- Create external table
CREATE TABLE hive.base.users (
user_id BIGINT,
username VARCHAR,
email VARCHAR,
created_at TIMESTAMP
)
WITH (
format = 'PARQUET',
external_location = 's3a://com.dldgv2/base/users/'
);Use Cases:
- Data shared across multiple systems
- Data preservation required after table drop
- Custom data organization
Managed tables are stored in the default warehouse directory. When dropped, both metadata AND data are deleted.
-- Create schema (uses default warehouse location)
CREATE SCHEMA IF NOT EXISTS hive.managed_db;
-- Create managed table (no external_location)
CREATE TABLE hive.managed_db.customers (
customer_id BIGINT,
first_name VARCHAR,
last_name VARCHAR,
email VARCHAR
)
WITH (
format = 'PARQUET'
);
-- Auto-stored at: s3a://com.dldgv2/delta/managed_db.db/customers/Use Cases:
- Standard analytics tables
- Development and testing
- Full lifecycle management by Trino
Improve query performance by partitioning data:
CREATE TABLE hive.base.sales (
sale_id BIGINT,
product_name VARCHAR,
amount DECIMAL(10, 2),
sale_date DATE
)
WITH (
format = 'PARQUET',
partitioned_by = ARRAY['sale_date'],
external_location = 's3a://com.dldgv2/base/sales/'
);See create_table_example.sql and managed_tables_example.sql for more examples.
Trino uses native S3 filesystem with credentials provided via environment variables:
AWS_ACCESS_KEY_ID=minioAWS_SECRET_ACCESS_KEY=minio_adminAWS_REGION=us-east-1
Endpoint and path-style access configured in conf/trino/catalog/hive.properties.
- Default: Parquet with Snappy compression
- Supported: Parquet, ORC, Avro, JSON, CSV
- Recommended: Parquet for analytics workloads
- Trino JVM Heap: 4GB
- Query Memory: 4GB max (2GB per node)
- Coordinator: Includes coordinator in scheduling
- Workers: Single node setup (coordinator acts as worker)
SHOW CATALOGS;SHOW SCHEMAS IN hive;SHOW TABLES IN hive.base;DESCRIBE hive.base.users;
SHOW CREATE TABLE hive.base.users;SELECT * FROM hive.base.users LIMIT 10;INSERT INTO hive.base.users VALUES
(1, 'alice', 'alice@example.com', CURRENT_TIMESTAMP);| Type | Location | Controlled By |
|---|---|---|
| Managed Tables | s3a://com.dldgv2/delta/<schema>/<table>/ |
Hive Metastore |
| External Tables | User-specified (e.g., s3a://com.dldgv2/base/<table>/) |
User |
| MinIO Data | ./data/minio/com.dldgv2/ |
Local filesystem |
| MySQL Data | ./data/mysqldir/ |
Local filesystem |
podman logs trinopodman logs metastorepodman exec -it trino curl http://minio:9000/minio/health/live-- In Trino CLI
SHOW CATALOGS;
SELECT * FROM system.metadata.catalogs;1. "Invalid location URI: s3a://"
- Ensure Trino has restarted after configuration changes
- Verify S3 credentials in docker-compose environment variables
2. Metastore connection failed
- Check if MySQL is healthy:
podman ps - Verify metastore service is running
- Check
IS_RESUME="true"is set in docker-compose (after first boot)
3. Configuration property errors
- Review
podman logs trinofor specific property names - Trino 476 may not support older Hive properties
- Remove or update deprecated properties
- On first boot, Hive Metastore will initialize the MySQL schema
- After successful initialization, ensure
IS_RESUME="true"is set in docker-compose.yaml - Create the MinIO bucket
com.dldgv2via console or CLI
mysql> GRANT ALL PRIVILEGES ON metastore.* TO 'dataeng'@'%' WITH GRANT OPTION;All services are connected via the dldg Docker network for internal communication.
.
├── docker-compose.yaml # Service definitions
├── conf/
│ ├── hive-site.xml # Hive Metastore configuration
│ └── trino/ # Trino configuration
│ ├── config.properties # Main Trino config
│ ├── jvm.config # JVM settings
│ ├── node.properties # Node configuration
│ └── catalog/ # Catalog configurations
│ ├── hive.properties # Hive catalog
│ ├── memory.properties # Memory catalog
│ └── tpch.properties # TPC-H catalog
├── data/
│ ├── minio/ # MinIO object storage
│ ├── mysqldir/ # MySQL data
│ └── jars/ # Additional JAR files
├── create_table_example.sql # External table examples
└── managed_tables_example.sql # Managed table examples
- Use Partitioning: For large tables, partition by date or frequently filtered columns
- Choose Parquet: Best compression and query performance for analytics
- Bucketing: For large tables with frequent joins on specific columns
- Statistics: Enable
hive.collect-column-statistics-on-write=true(already enabled) - File Size: Aim for 128MB-1GB files (configured: max 1GB)
- Original source: https://github.com/delta-incubator/delta-lake-definitive-guide/tree/main/ch04
- Trino documentation: https://trino.io/docs/current/
- Hive Metastore: https://hive.apache.org/
This setup is for local development and testing purposes.
