A fully local, cost-friendly analytics pipeline that ingests data from PostgreSQL, lands CDC deltas in GCS as Parquet, enforces schema on load into a BigQuery temp layer, incrementally merges into Bronze with dbt, transforms in Silver and Gold, validates with dbt tests + Elementary, and publishes a static Elementary report.
This README explains the whole process, how to run it locally, the resources provisioned via Terraform, and suggested next steps for production hardening. Check the Data Archicteture overview in Data Architecture.md
- Source (OLTP): PostgreSQL running in Docker
- Extraction & CDC: Dagster + Python (pandas/pyarrow) pull deltas using watermarks
- Landing: Parquet files stored in GCS in a partitioned path
- Staging: Parquet -> BigQuery Temp (unique hashed tables per run + stable table)
- Medallion model (dbt):
- Bronze: incremental merge on unique keys, append-only metadata
- Silver: cleaned/standardized business-ready base
- Gold: curated analytics & risk scoring models
- Testing & Monitoring: dbt tests + Elementary report published to a public GCS website
- Orchestration: Dagster (Python assets, schedules)
- Transformations: dbt (BigQuery)
- Storage/Compute: GCS + BigQuery
- Observability: dbt tests + Elementary report
- Infra-as-code: Terraform (datasets, buckets, IAM)
- Local OLTP: PostgreSQL in Docker, optional one-shot data-loader container
- Docker & local OLTP:
docker-compose.yml(Postgres, optional data-loader)docker/postgres/init/(DB bootstrap SQL)docker/data-loader/(Parquet -> Postgres loader)
- Dagster project:
credix_pipeline/credix_pipeline/credix_pipeline/assets/(extraction to Parquet/GCS/temp tables, dbt bronze/silver/gold hooks, Elementary)resources/(Postgres & GCP clients)utils/(CDC helpers, data & GCS utilities)jobs/(asset jobs: cnpj, installments, full, monitoring)definitions.py(Dagster Definitions: schedules, resources, assets)
- dbt project:
dbt/business_case/(models for bronze/silver/gold, packages, profiles)
- Infra via Terraform:
terraform/(BigQuery datasets, buckets, IAM)
- Postgres service is defined in
docker-compose.yml. It exposes port 5432 and runs an init script fromdocker/postgres/init/01_init_schema.sql. - Optional: a one-shot
data_loadercontainer (docker/data-loader/) loads local Parquet files into Postgres.- It discovers
*.parquetunder a mounted/app/dataand appends to a target table with the same base filename. - The script uses SQLAlchemy to write into Postgres (
parquet_to_postgres.py). Default schema there isoltp.
- It discovers
Important note:
- Ensure the schema used by the source tables exists. Current init script creates
raw_data; Python data-loader defaults tooltp. Either:- change the loader to use
raw_data, or - create
oltpschema and load there, or - point Dagster extraction queries to whichever schema standardize on.
- change the loader to use
- CDC state is maintained inside Dagster via materialization metadata on checkpoint assets.
- Helper:
utils/cdc_helpers.py#get_cdc_last_processed_timereads the last watermark (max_updated_at) from the latest checkpoint materialization.
- Helper:
- Each extractor asset queries only new/changed rows since the last watermark:
- CNPJ: uses
created_at/updated_attimestamp columns. - Installments: uses the max among
invoice_issue_date,paid_datewithGREATEST(...).
- CNPJ: uses
- After Bronze succeeds, checkpoint assets advance the watermark by emitting
max_updated_atin metadata (only after successful downstream ingestion), ensuring exactly-once progression.
- DataFrames are normalized for BigQuery compatibility in
utils/data_processing.py:prepare_dataframe_for_bigquery: coerces date/timestamp columns appropriatelydataframe_to_parquet_bytes: writes Parquet with Arrow, preserving logical types
- Files are uploaded to
gs://data_lake_credix/business_case/landing/ingestion_dt=YYYY-MM-DD/<table>_<ts>.parquet. - On load into BigQuery temp:
- Retrieve schema from a reference table
- Load with explicit schema and
WRITE_TRUNCATEfor the stable table.
- After a successful BigQuery load, the file is moved to the archive bucket
data_lake_credix_archiveunder the mirrored path (archive/prefix). - If the BigQuery load fails, the blob is moved to a
failed/prefix in the original bucket to aid incident response. - Implemented in
GCPResource.move_blob()with copy + delete semantics.
- To guarantee isolation and deterministic Bronze merges, each run writes to a unique hashed temp table in
business_case_temp:- Name generated by
utils/gcs_operations.py#generate_unique_table_name(e.g.,cnpj_ws_ab12cd34ef56). - Alongside, a stable table (e.g.,
business_case_temp.cnpj_ws_dbt) is also overwritten to simplify ad hoc exploration and dbt sources.
- Name generated by
- The hashed table name is passed to dbt as a var (
temp_table_name) by the Dagster dbt asset runner; Bronze models then refer to the stable source, and the checkpoint cleans up the hashed table after success.
- Bronze:
- Incremental merge models with
unique_keyandmerge_update_columns. - Keeps metadata like
_loaded_at,_source_table. - Models:
models/bronze/*.sql(e.g.,cnpj_ws.sql,installments.sql).
- Incremental merge models with
- Silver:
- Cleansing, normalization, and derived flags.
- Examples: standardize state/UF/city, data quality flags.
- Models:
models/silver/*.sql(e.g.,cnpj_ws_clean.sql,installments_clean.sql).
- Gold:
- Business-ready analytics and risk scoring.
- Models:
company_payment_summary.sqlandpayment_analytics_detailed.sql.
- Column-level tests are defined (e.g.,
not_null,unique,accepted_values) inmodels/bronze/schema.yml. - Elementary augments testing with anomaly detection and reporting.
- Two Dagster assets handle Elementary:
edr_monitor_asset: runsedr monitoredr_send_report_asset: runsedr send-reportand uploadsindex.htmlto thecredix-elementary-reportbucket
- Terraform configures that bucket as a static website and makes objects public for read.
- After a run, the report is available at a public URL:
https://storage.googleapis.com/credix-elementary-report/index.html.
Defined under terraform/:
- BigQuery datasets:
business_case_bronze,business_case_silver,business_case_gold,business_case_temp,business_case_elementary(seebigquery.tf). - GCS buckets: raw
data_lake_credix, archivedata_lake_credix_archive, Elementarycredix-elementary-reportwith public viewer IAM and website main page (seestorage.tf). - Backend state in GCS (see
main.tf). Configure your state bucket beforeterraform init.
- Extract from Postgres with CDC
- Dagster assets
cnpj_raw_data/installments_raw_dataquery rows updated since the last watermark. - Results are normalized and written as Parquet.
- Land files in GCS
- Files are written to the partitioned landing path.
- No-change runs short-circuit and emit a
no_changesmarker.
- Load to BigQuery temp
- For each landed file, two loads occur:
- Hashed table (append) for run isolation and traceability
- Stable table (truncate) for consistent dbt sources
- On success: move blob to archive; on failure: move to failed.
- dbt Bronze/Silver/Gold
- Bronze merges incremental deltas keyed by business identifiers.
- Silver applies cleaning and business rules.
- Gold produces curated analytics and risk scoring.
- Advance CDC checkpoint and cleanup
- Only after Bronze success, watermark is advanced and hashed temp table is deleted.
- Tests and monitoring
- dbt tests run inside the dbt builds.
- Elementary monitor and report assets generate & publish a report.
Prerequisites
- macOS with Docker and Docker Compose
- Python 3.11+
- A GCP project with a service account that has access to BigQuery and GCS; JSON keyfile locally
- dbt (BigQuery adapter) and Elementary installed in your environment
Environment variables (examples)
POSTGRES_HOST=localhostPOSTGRES_PORT=5432POSTGRES_DB=credix_transactionsPOSTGRES_USER=postgresPOSTGRES_PASSWORD=postgres123GCP_PROJECT=product-reliability-analyzerGOOGLE_APPLICATION_CREDENTIALS=/absolute/path/to/your-sa.json
- Bring up Postgres (and optionally seed it)
- Start Postgres:
docker compose up -d postgres - Optionally load local Parquet into Postgres tables:
docker compose --profile data-loader up --build data_loader- The loader expects Parquet files to be mounted at
./data->/app/datainside the container.
- The loader expects Parquet files to be mounted at
- Prepare Python environment
- Create/activate a virtualenv
- Install the Dagster/dbt project in editable mode (from
credix_pipeline/credix_pipeline/):pip install -e .
- Ensure dbt deps are installed in
dbt/business_case:dbt deps
- Configure dbt profiles
dbt/business_case/profiles.ymlpoints to your keyfile andproduct-reliability-analyzer. Adjust as needed.- Alternatively, set
DBT_PROFILES_DIRto thedbt/folder.
- Run Dagster
- From the repository root, run:
dagster dev -m credix_pipeline.credix_pipeline.definitions - Open the Dagster UI, materialize one of:
cnpj_pipelinegroup (or its jobcnpj_data_pipeline)installments_pipelinegroup (orinstallments_data_pipeline)full_data_pipelinejob to run both + gold groupmonitoring_jobto generate/publish the Elementary report
- Verify results in BigQuery/GCS
- Check
business_case_temp(hashed & stable tables) - Check Bronze/Silver/Gold datasets
- Check GCS archive/failed paths for landed files
- View the Elementary report
- Use the public URL from the asset metadata (or browse the bucket object path as described above).
- Watermarks & reprocessing: CDC state is tied to Dagster materializations. To replay a period, reset the checkpoint asset materialization (or override query filters temporarily).
- Idempotency: Bronze incremental merges ensure repeated loads with the same deltas won’t duplicate rows.
- Observability: Each asset emits metadata such as record counts, max timestamps, table references, and report URLs.
- Environment separation
- Split Terraform state and resources into dev/stage/prod; parameterize bucket/dataset names and IAM per env.
- Use separate Dagster deployments (or workspaces) and dbt targets per environment.
- CI/CD
- GitHub Actions for: Terraform plan/apply (with approvals), dbt build + tests, Dagster code checks and packaging.
- Secrets management
- Replace local env vars with Secret Manager / Vault; avoid embedding credentials in profiles files.
- Alerts & notifications
- Add Dagster sensors to push Slack/Email/PubSub alerts on failure or data quality regressions.
- Backfill strategy
- Implement backfill ingestion from the archive bucket by listing archived blobs for a date range and replaying loads into temp + Bronze.
- Automated testing
- Add Dagster unit tests for assets/utils; include dbt unit tests or macros tests; introduce integration tests that spin up ephemeral Postgres and mock GCS.
- Scalability of ingestion
- This repo uses Dagster + Python extraction because it’s local and cost-efficient for relatively small datasets.
- For higher volumes or multiple sources, keep Dagster as the orchestrator and plug in a managed/elastic ingestion tier:
- Airbyte (managed/source connectors) for CDC into GCS/BQ
- Dataflow or Datastream for streaming or large-scale CDC pipelines
- Choose per business requirement and expected volumetry
- Cost and performance
- Consider clustering/partitioning strategies in BigQuery models; refine incremental filters; add table expiration where applicable.
- Why use both hashed and stable temp tables?
- Hashed tables provide isolation and traceability for each run. The stable table gives a consistent reference for dbt sources and ad hoc validation.
- Where are dbt tests?
- See
dbt/business_case/models/bronze/schema.yml. Extend to silver/gold as you harden the models.
- See
This project prioritizes local development speed and clarity. It is not productionized by default (e.g., secrets, retries, SLAs). Use the “Next steps” section to guide a production rollout.