This project demonstrates a production-grade, end-to-end data engineering platform for a banking domain. It implements real-time CDC ingestion, batch ELT, SCD Type-2 modeling, and analytics-ready marts using modern data engineering tools.
Pipeline Flow:
- Data Generator → Simulates banking transactions, accounts & customers (via Faker).
- Kafka + Debezium → Streams change data (CDC) into MinIO (S3-compatible storage).
- Airflow → Orchestrates data ingestion & snapshots into Snowflake.
- Snowflake → Cloud Data Warehouse (Bronze → Silver → Gold).
- DBT → Applies transformations, builds marts & snapshots (SCD Type-2).
- CI/CD with GitHub Actions → Automated tests, build & deployment.
- Snowflake → Cloud Data Warehouse
- DBT → Transformations, testing, snapshots (SCD Type-2)
- Apache Airflow → Orchestration & DAG scheduling
- Apache Kafka + Debezium → Real-time streaming & CDC
- MinIO → S3-compatible object storage
- Postgres → Source OLTP system
- Python (Faker) → Data simulation
- Docker & docker-compose → Containerized setup
- Git & GitHub Actions → CI/CD workflows
- PostgreSQL OLTP: Source relational database with ACID guarantees (customers, accounts, transactions)
- Simulated banking system: customers, accounts, and transactions
- Change Data Capture (CDC) via Kafka + Debezium (capturing Postgres WAL)
- Raw → Staging → Fact/Dimension models in DBT
- Snapshots for history tracking (slowly changing dimensions)
- Automated pipeline orchestration using Airflow
- CI/CD pipeline with dbt tests + GitHub Actions
finops-data-platform/
├── .github/workflows/ # CI/CD pipelines (ci.yml, cd.yml)
├── banking_dbt/ # DBT project
│ ├── models/
│ │ ├── staging/ # Staging models
│ │ ├── marts/ # Facts & dimensions
│ │ └── sources.yml
│ ├── snapshots/
│ └── dbt_project.yml # SCD2 snapshots
├── consumer
│ └── kafka_to_minio.py
├── data-generator/ # Faker-based data simulator
│ └── faker_generator.py
├── docker/ # Airflow DAGs, plugins, etc.
│ ├── dags/ # DAGs (minio_to_snowflake, scd_snapshots)
├── kafka-debezium/ # Kafka connectors & CDC logic
│ └── generate_and_post_connector.py
├── postgres/ # Postgres schema (OLTP DDL & seeds)
│ └── schema.sql
├── .gitignore
├── docker-compose.yml # Containerized infra
├── dockerfile-airflow.dockerfile
├── requirements.txt
└── README.md
Before starting, ensure the following are installed on your system:
- Docker & Docker Compose
- Python 3.10+
- Git
- Dbt
- Snowflake account (trial is sufficient)
git clone https://github.com/<your-username>/finops-data-platform.git
cd finops-data-platform
# ---------- Postgres (Banking OLTP) ----------
POSTGRES_HOST=postgres
POSTGRES_PORT=5432
POSTGRES_DB=banking
POSTGRES_USER=banking_user
POSTGRES_PASSWORD=banking_pass
# ---------- Kafka ----------
KAFKA_BOOTSTRAP=kafka:9092
KAFKA_GROUP=banking-consumer-group
# ---------- MinIO ----------
MINIO_ENDPOINT=http://minio:9000
MINIO_ACCESS_KEY=minioadmin
MINIO_SECRET_KEY=minioadmin
MINIO_BUCKET=banking-data
MINIO_LOCAL_DIR=/tmp/minio_downloads
# ---------- Snowflake ----------
SNOWFLAKE_ACCOUNT=xxxxxx.ap-south-1
SNOWFLAKE_USER=xxxx
SNOWFLAKE_PASSWORD=xxxx
SNOWFLAKE_WAREHOUSE=COMPUTE_WH
SNOWFLAKE_DB=BANKING
SNOWFLAKE_SCHEMA=RAW
# ---------- Airflow DB ----------
AIRFLOW_DB_USER=airflow
AIRFLOW_DB_PASSWORD=airflow
AIRFLOW_DB_NAME=airflow
# ---------- MinIO Root ----------
MINIO_ROOT_USER=minioadmin
MINIO_ROOT_PASSWORD=minioadmin
This command launches
- PostgreSQL
- Kafka & Zookeeper
- Debezium Connect
- MinIO
- Airflow (Webserver + Scheduler)
- Airflow Metadata DB
docker-compose up -d
docker ps
- Download & Install DBeaver
- Click New Database Connection → Select Postgres → Enter Details
- Create Schema (DDL) for customers, accounts & transactions
pip install -r requirements.txt
python data-generator/faker_generator.py
python kafka-debezium/generate_and_post_connector.py
python consumer/kafka_to_minio.py
- Reads CDC events from Kafka
- Buffers records
- Writes Parquet files to MinIO
http://localhost:8081
- Username: airflow
- Password: airflow
- DAG Name: minio_to_snowflake_banking
- Turn ON
- Trigger manually (or wait for schedule)
SELECT COUNT(*) FROM raw.customers;
UNION ALL
SELECT COUNT(*) FROM raw.accounts;
UNION ALL
SELECT COUNT(*) FROM raw.transactions;
cd banking_dbt
dbt deps
dbt run
dbt test
- Staging models
- Fact & dimension tables
DAG: SCD2_snapshots
- dbt snapshot
- dbt run --select marts
SELECT * FROM ANALYTICS.DIM_CUSTOMERS;
- ANALYTICS.DIM_CUSTOMERS
- ANALYTICS.DIM_ACCOUNTS
- ANALYTICS.FACT_TRANSACTIONS
- Automated CDC pipeline from Postgres → Snowflake
- DBT models (facts, dimensions, snapshots)
- Orchestrated DAGs in Airflow
- Synthetic banking dataset for demos
- CI/CD workflows ensuring reliability
Author: Shubham Raju Mergu
LinkedIn: shubham-mergu
Contact: shubhammergu.work@gmail.com