Skip to content

paudley/core_data

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

94 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

core_data

CI License: MIT

A reproducible PostgreSQL 17 platform delivered as code. core_data builds a hardened database image with spatial, vector, and graph extensions, provisions PgHero for observability, and ships a management CLI that automates backups, restores, QA cloning, and upgrades. Everything lives in version control so environments can be rebuilt consistently across laptops, CI, and production.

Why You Want This

  • Run the same Postgres 17 stack everywhere: laptop, CI runner, or production.
  • Ship with the heavy hitters pre-installed—PostGIS, pgvector, AGE, pg_cron, pgBackRest—without custom build scripts.
  • Automate the boring-but-critical tasks: backups, restores, QA clones, log analytics, and even major version upgrades via pgautoupgrade.
  • Treat your database like code with reproducible .env configs, templated init scripts, and a pytest smoke test that catches regressions early.

Published Docker Images

Pre-built images are available from GitHub Container Registry with full SLSA attestations and SBOM for supply chain security.

Quick Start with Published Image

# Pull the latest stable release
docker pull ghcr.io/paudley/core-data-postgres:latest

# Or specify a version
docker pull ghcr.io/paudley/core-data-postgres:17.2-v1.0.0

Available Tags

Images follow a hybrid versioning strategy combining PostgreSQL version with semantic versioning:

Tag Pattern Example Description
latest latest Latest stable release
{PG_VERSION}-v{MAJOR}.{MINOR}.{PATCH} 17.2-v1.0.0 Exact version (recommended for production)
{PG_VERSION}-v{MAJOR}.{MINOR} 17.2-v1.0 Latest patch for minor version
{PG_VERSION}-v{MAJOR} 17.2-v1 Latest minor for major version
{PG_VERSION} 17.2 Latest semantic version for PostgreSQL version
{PG_MAJOR} 17 Latest for PostgreSQL major version

Version Format: {PostgreSQL_Version}-v{Semantic_Version}

  • Example: 17.2-v1.0.0 means PostgreSQL 17.2 with semantic version 1.0.0
  • See docs/RELEASING.md for complete versioning details

Security & Verification

All published images include cryptographic attestations that prove build provenance and supply chain integrity:

# Verify image attestation (requires GitHub CLI)
gh attestation verify oci://ghcr.io/paudley/core-data-postgres:17.2-v1.0.0 \
  --owner paudley

# Pull and verify in one step
docker pull ghcr.io/paudley/core-data-postgres:17.2-v1.0.0
gh attestation verify oci://ghcr.io/paudley/core-data-postgres:17.2-v1.0.0 \
  --owner paudley

# Verify every referenced image from your .env (pretty console output)
./scripts/manage.sh attestation-verify --env-file ci.env.example

GitHub invariably injects a repository-scoped GITHUB_TOKEN when workflows run. If that token cannot read paudley/core_data, the helper automatically retries the same verification anonymously (clearing GH_TOKEN/GITHUB_TOKEN) so public attestations still succeed without extra flags.

Published GHCR images

Every service in the stack is published with the same attested tag:

Service Image reference
PostgreSQL ghcr.io/paudley/core_data/postgres:<stack-tag>
ValKey ghcr.io/paudley/core_data/valkey:<stack-tag>
PgBouncer ghcr.io/paudley/core_data/pgbouncer:<stack-tag>
RabbitMQ ghcr.io/paudley/core_data/rabbitmq:<stack-tag>
Memcached ghcr.io/paudley/core_data/memcached:<stack-tag>
Network Probe ghcr.io/paudley/core_data/network-probe:<stack-tag>
Network Guard ghcr.io/paudley/core_data/network-guard:<stack-tag>

Use the optional CORE_DATA_STACK_TAG / CORE_DATA_STACK_REGISTRY environment variables (or .env entries) to pin a specific release across every helper; otherwise the defaults use the latest tag from ghcr.io/paudley/core_data.

Every container publishes /README.md and /SECURITY.md (copies of this repository's documentation) so you can inspect provenance and policy from inside the image with a simple ls.

Expected verification output:

✓ Verification succeeded!

sha256:abc123... was attested by:
REPO              PREDICATE_TYPE                  WORKFLOW
paudley/core_data https://slsa.dev/provenance/v1  .github/workflows/publish-docker.yml@refs/tags/17.2-v1.0.0

What's Verified:

  • SLSA Provenance: Confirms the image was built by GitHub Actions from this repository
  • SBOM: Software Bill of Materials listing all components and dependencies
  • Build Reproducibility: Links image digest to specific source code commit

Learn More:

Using Published Images in Docker Compose

Update your docker-compose.yml to use the published image instead of building locally:

services:
  postgres:
    image: ghcr.io/paudley/core-data-postgres:17.2-v1.0.0  # Use published image
    # Remove or comment out the 'build:' section
    # build:
    #   context: .
    #   dockerfile: postgres/Dockerfile

Or override via .env:

POSTGRES_IMAGE_NAME=ghcr.io/paudley/core-data-postgres
POSTGRES_IMAGE_TAG=17.2-v1.0.0

Important: UID/GID Configuration for Pre-built Images

Pre-built images from ghcr.io have the postgres user baked in with UID/GID 999. Your .env must match:

POSTGRES_UID=999
POSTGRES_GID=999

Why this matters: The volume_prep service uses these values to chown data directories. If they don't match the image's baked-in UID, you'll see permission errors like:

chmod: changing permissions of '/var/lib/postgresql/data': Operation not permitted

Building locally? If you set CORE_DATA_BUILD_IMAGE=1, you can use your host user's UID/GID instead—the image build process will create the postgres user with your specified IDs.

Highlights

  • Custom Docker image with PostGIS, pgvector, Apache AGE, pg_cron, pg_squeeze, pgAudit, pgBadger, pgBackRest, and pgtune baked in.
  • Init scripts render configuration from templates, create application databases, and enable extensions automatically.
  • ./scripts/manage.sh wraps lifecycle tasks: image builds, psql, logical dumps, pgBackRest backups/restores, QA cloning, log analysis, daily maintenance, and major upgrades via pgautoupgrade.
  • PGDATA, WAL, and pgBackRest now live on dedicated Docker named volumes for near-native Linux I/O, while BACKUPS_HOST_PATH (defaults to ./backups) remains a bind mount for easy artifact exports.
  • Secrets stay in Docker secrets (POSTGRES_PASSWORD_FILE) and the container runs as the non-root postgres UID/GID at all times, keeping the least-privilege posture consistent across init and steady state.
  • TLS is enforced by default with auto-generated self-signed certificates (override with your own CA material), and a multi-stage health probe (scripts/healthcheck.sh) guards dependent services before they start.
  • Logging uses Docker's local driver with rotation and non-blocking delivery, preventing runaway JSON logs from filling the host while preserving enough history for incident response.
  • Optional profiles bundle ValKey, PgBouncer, and Memcached with sensible defaults, secrets, and CLI helpers so you can layer caches and pooling alongside PostgreSQL in one step.
  • CI smoke test (python -m pytest -k full_workflow) provisions a stack, exercises critical commands, and verifies upgrade safety.

Default Extension Bundle

core_data provisions a batteries-included extension stack in every non-template database at init time:

  • Performance & Observabilitypg_stat_statements, auto_explain, pg_buffercache, pg_prewarm, bloom.
  • Security & Compliancepgaudit, pgcrypto, "uuid-ossp".
  • Developer Ergonomicshstore, citext, pg_trgm, btree_gin, btree_gist, hypopg, intarray, ltree, tablefunc, unaccent.
  • Connectivitypostgres_fdw, dblink.
  • Spatial, Vector, Graphpostgis, postgis_raster, postgis_topology, vector, age, cube, earthdistance.
  • Maintenance & Testingpg_cron (kept in the postgres database), pg_partman, pg_repack, pg_squeeze, pgstattuple, pgtap.
  • Geospatial Extraspostgis_tiger_geocoder, address_standardizer, address_standardizer_data_us, pgrouting, fuzzystrmatch.

The same bundle is installed into template1 so freshly created databases inherit the tooling automatically.

pg_partman_bgw is preloaded with a one-hour interval targeting the postgres database under the postgres superuser. Adjust pg_partman_bgw.dbname/role in postgresql.conf.tpl (or via postgresql.pgtune.conf) if you manage partitions from a different control schema.

Use ./scripts/manage.sh partman-show-config to inspect tracked parents, partman-maintenance to run run_maintenance_proc() on demand, and partman-create-parent schema.table control_column '1 day' to bootstrap new partition sets without hand-writing SQL.

Run ./scripts/manage.sh async-queue bootstrap when you want a lightweight background-job queue. It provisions an asyncq.jobs table plus helpers (enqueue, dequeue, complete, fail, extend_lease) that rely on FOR UPDATE SKIP LOCKED, pg_notify, and UUID leasing. Point a worker at the queue with SELECT * FROM asyncq.dequeue('default'); in a loop and call asyncq.complete(...) or asyncq.fail(...) as you process jobs.

Quick Start

  1. Bootstrap environment config: ./scripts/manage.sh create-env to walk through password creation, host UID/GID selection, and resource sizing (writes .env + secrets).

  2. Build and start the stack:

    ./scripts/manage.sh build-image
    ./scripts/manage.sh up
  3. Verify health (multi-stage probe):

    docker compose exec postgres /opt/core_data/scripts/healthcheck.sh
    ./scripts/manage.sh psql -c 'SELECT 1;'
  4. Explore the CLI: ./scripts/manage.sh help for grouped commands (lifecycle, CI, backups, audits, extensions, cache/messaging, security).

  5. CI runners: see CI_USAGE.md for the pipelines-focused flow (attestations, ci-up/ci-down, pytest markers).

Project Ethos

We optimize for data infrastructure as code ➜ automated admin ➜ human-friendly ➜ best-practices by default (including security and performance). In practice that means:

  1. Data infrastructure as code. Declarative, reproducible Postgres once, versioned forever.
  2. Automated admin. Every recurring task should be scriptable and CI-friendly before we worry about shell ergonomics.
  3. Human-friendly. CLI helpers, sensible prompts, and clear docs matter—but only after the first two goals are met.
  4. Best-practices by default. Security and performance guardrails (cap drops, TLS, tuned memory, backups) are enabled out of the box, with deliberate escape hatches when absolutely required.

See docs/security_philosophy.md for how capability hardening and related controls fit into that priority order.

Operational Defaults

  • Resource guardrails. Container memory, CPU, and shared memory limits come from .env, keeping pgtune advice and runtime constraints aligned. Adjust POSTGRES_MEMORY_LIMIT, POSTGRES_CPU_LIMIT, and POSTGRES_SHM_SIZE to match the host.
  • TLS everywhere. PostgreSQL refuses non-SSL connections from the bridge network. Provide your own certificate/key via Docker secrets or rely on the init hook to mint a self-signed pair under ${PGDATA}/tls.
  • Named volumes for PGDATA/WAL. pgdata, pgwal, and pgbackrest volumes provide near-native I/O on Linux. Override the volume definitions if you pin WAL/data to specific devices.
  • Non-root from the start. A one-shot volume_prep helper chowns the volumes before Postgres launches so the main service and sidecars run as your host user by default (UID/GID ${POSTGRES_UID}), keeping file ownership consistent across deployments. Supply alternative IDs only when required.
  • Automated logical backups. The logical_backup sidecar runs pg_dump/pg_dumpall on the cadence defined by LOGICAL_BACKUP_INTERVAL_SECONDS, writes into ${BACKUPS_HOST_PATH}/logical, prunes according to LOGICAL_BACKUP_RETENTION_DAYS, and skips any databases listed in LOGICAL_BACKUP_EXCLUDE (defaults to postgres). daily-maintenance captures the latest run in logical_backup_status.txt for auditing.
  • Composable health check. scripts/healthcheck.sh verifies readiness, executes SELECT 1, and optionally enforces replication lag ceilings before dependents start.
  • Rotated container logs. Docker's local driver with non-blocking delivery prevents runaway JSON files while retaining compressed history for incident response.
  • Optional service profiles. COMPOSE_PROFILES=valkey,pgbouncer,memcached,rabbitmq brings the cache/pooling stack online; drop profiles from the list to opt out without editing docker-compose.yml.
  • Seccomp baseline. Shipping profiles in seccomp/ cover each service (postgres.json, logical_backup.json, pgbouncer.json, valkey.json, memcached.json, pghero.json, plus docker-default.json reused for RabbitMQ). ./scripts/manage.sh seccomp-status reports the active spec, seccomp-verify gates compose configs, and docs/security_philosophy.md outlines how to regenerate traces when you need to tighten them further.
  • AppArmor (opt-in). Minimal profiles live in apparmor/core_data_minimal.profile. Load them with ./scripts/manage.sh apparmor-load (sudo), then set CORE_DATA_APPARMOR_<SERVICE>=apparmor:core_data_minimal in .env for each service you want to confine. The profile denies access to high-value host paths (/root, /etc/shadow, Docker socket) while leaving normal container paths alone.

Service Add-ons

  • ValKey — Requires authentication by default (valkey_password secret), persists to the valkey_data volume (appendonly yes), exposes valkey-cli/valkey-bgsave, and is tuned via .env knobs such as VALKEY_MAXMEMORY and VALKEY_MAXMEMORY_POLICY.
  • PgBouncer — Uses SCRAM auth backed by a dedicated superuser, renders config/userlist from templates, and offers pgbouncer-stats / pgbouncer-pools helpers. Pool sizing and admin/stat users are driven by the PGBOUNCER_* variables.
  • Memcached — Lightweight hot cache with configurable memory, connection, and thread limits (MEMCACHED_*). The memcached-stats helper pipes stats output for quick validation. All services stay on the internal bridge network by default.
  • RabbitMQ — Delivers AMQP 0.9.1 alongside the management API. Credentials and cluster identity come from secrets/rabbitmq_default_pass and secrets/rabbitmq_erlang_cookie; daily maintenance captures definitions/status snapshots, and on-demand helpers (rabbitmq-ctl, rabbitmq-diagnostics, rabbitmq-export, rabbitmq-overview) streamline broker admin.

Project Layout


core\_data/
├── .env.example # Template for environment-specific settings (never commit real secrets)
├── docker-compose.yml # Orchestrates PostgreSQL and PgHero services
├── scripts/ # Operator tooling (manage.sh + lib modules + maintenance workflow)
├── postgres/ # Custom image build assets, configs, and initdb scripts
├── backups/ # Host output directory for logical dumps and reports
├── secrets/ # Docker secret material (e.g., postgres\_superuser\_password)
├── README.md # This guide
├── THIRD\_PARTY\_LICENSES.md # Upstream license attributions for vendored tooling
└── AGENTS.md # Contributor quick-reference & runbooks

If you override the named volumes with host bind mounts, keep those directories out of version control—they contain live cluster state and pgBackRest archives.

CI Workflow

Use the dedicated CI helpers when you need to spin up the published stack inside automation:

  1. ./scripts/manage.sh ci-verify --env-file ci.env.example --require-attestation confirms Docker is available, checks disk/port prerequisites, and verifies that every referenced image carries a recent attestation.
  2. ./scripts/manage.sh ci-up --env-file ci.env.example --output ./backups/ci-output.json bootstraps secrets/network allow lists, pulls/warms the prebuilt images, and emits a JSON manifest with connection details for downstream jobs.
  3. ./scripts/manage.sh ci-down --volumes --prune-data --prune-secrets stops containers and removes ephemeral state when the pipeline finishes.

ci.env.example shows the minimal environment required for this workflow, and docs/examples/ci-workflow.md includes a ready-to-copy GitHub Actions job that wires everything together.

Management CLI

./scripts/manage.sh is the operator entry point. Frequently used commands:

Command Description
build-image Build the custom PostgreSQL image defined in postgres/Dockerfile.
create-env Interactive wizard that copies .env.example, sizes resources, seeds secrets, and writes .env.
up / down Start or stop the Compose stack (volumes preserved).
psql Open psql inside the container (respects PGHOST, PGUSER, etc.).
dump / dump-sql Produce logical backups (custom or plain format) under /backups.
restore-dump Drop and recreate a database before restoring a .dump.gz.
backup [--verify] / stanza-create / restore-snapshot Manage pgBackRest backups & optionally restore the latest backup into a throwaway data dir for checksum verification.
daily-maintenance Run dumps, log capture, pgBadger analysis, and retention pruning.
provision-qa Differential backup + targeted restore for QA databases.
config-render Re-render postgresql.conf / pg_hba.conf from the templates and restart PostgreSQL (terminates active connections; required for some settings like shared_buffers and max_connections).
config-check Compare live postgresql.conf / pg_hba.conf against rendered templates to catch drift.
audit-roles / audit-security Generate CSV/text reports covering role hygiene, passwords, and HBA/RLS posture.
audit-extensions Confirm bundled extensions are present and on expected versions.
audit-autovacuum Flag tables with high dead tuple counts or ratios.
audit-replication Summarise follower lag and sync state.
audit-cron / audit-squeeze Inspect pg_cron schedules and pg_squeeze activity tables.
audit-index-bloat Estimate index bloat using pgstattuple (supports --min-size-mb).
audit-buffercache Snapshot shared buffer usage per relation (supports --limit).
audit-schema Snapshot schema metadata for drift detection.
snapshot-pgstat Capture a pg_stat_statements baseline (CSV output) for performance trending.
diff-pgstat --base --compare Diff two snapshots (CSV-in/CSV-out) to highlight hot queries.
compact --level N Layered bloat management: 1=autovacuum audit, 2=pg_squeeze refresh, 3=pg_repack (needs --tables), 4=VACUUM FULL (needs --yes).
exercise-extensions Smoke-test the core extension bundle (vector, PostGIS, AGE, citext, hstore, pgcrypto, hypopg, pg_partman, etc.).
pgtap-smoke Run a micro pgTap plan to confirm key extensions (including hypopg/pg_partman) are registered.
test-dataset bootstrap Stand up a synthetic schema with spatial, vector, routing, and graph fixtures to exercise PgBouncer/GraphQL clients.
async-queue bootstrap Install the lightweight async queue schema (asyncq) with enqueue/dequeue helpers.
partman-maintenance Invoke run_maintenance_proc() for the selected database (defaults to POSTGRES_DB).
partman-show-config Print rows from part_config (optionally filter by --parent schema.table).
partman-create-parent Wrap create_parent to bootstrap managed partitions without manual SQL.
valkey-cli Run valkey-cli inside the ValKey container with secrets wired in.
valkey-bgsave Trigger BGSAVE so the ValKey RDB is flushed to the valkey_data volume.
rabbitmq-ctl Execute rabbitmqctl inside the RabbitMQ container (requires the rabbitmq profile).
rabbitmq-diagnostics Run rabbitmq-diagnostics commands such as status or check_running.
rabbitmq-export Export broker definitions to JSON (defaults to ${BACKUPS_HOST_PATH}/rabbitmq-definitions.json).
rabbitmq-overview Print the summary from rabbitmq-diagnostics status.
pgbouncer-stats / pgbouncer-pools Emit PgBouncer SHOW STATS / SHOW POOLS via the admin console.
memcached-stats Fetch stats output from the Memcached service.
version-status Compare installed Postgres/extension versions with upstream releases (CSV via --output).
upgrade --new-version Orchestrate pgautoupgrade (takes backups, validates base image, restarts).

The CLI sources modular helpers from scripts/lib/ so each function can be imported by tests or future automation.

daily-maintenance now emits a richer bundle under backups/daily/<YYYYMMDD>/, including pg_stat_statements snapshots, pg_buffercache heatmaps, role/extension/autovacuum/replication CSVs, pg_cron schedules, pg_squeeze activity, and a security checklist alongside logs, dumps, pgBadger HTML, and pgaudit summaries. The workflow also records the most recent sidecar dump run in logical_backup_status.txt, runs partman.run_maintenance_proc() across each database so freshly created partitions land even if the background worker interval has not elapsed, and captures version drift in version_status.csv (focusing on out-of-date components). Pair those reports with config-check to keep the rendered configs aligned with the templates. Tune the thresholds via DAILY_PG_STAT_LIMIT, DAILY_BUFFERCACHE_LIMIT, DAILY_DEAD_TUPLE_THRESHOLD, DAILY_DEAD_TUPLE_RATIO, and DAILY_REPLICATION_LAG_THRESHOLD as needed.

Nightly cron jobs also refresh pg_squeeze targets, reset pg_stat_statements, and run a safe VACUUM (ANALYZE, SKIP_LOCKED, PARALLEL 4) so statistics stay current without blocking hot tables.

Set DAILY_EMAIL_REPORT=true and [email protected] in .env to have the maintenance job email the HTML summary via sendmail (if available inside the container).

To compare performance snapshots between runs, capture CSVs with snapshot-pgstat --output /backups/pg_stat_before.csv and snapshot-pgstat --output /backups/pg_stat_after.csv, then run ./scripts/manage.sh diff-pgstat --base /backups/pg_stat_before.csv --compare /backups/pg_stat_after.csv --limit 25 for a ranked delta report.

Compacting storage

./scripts/manage.sh compact provides escalating space-recovery options:

  • Level 1 — run the autovacuum audit (no changes, just reporting).
  • Level 2 — rerun core_data_admin.refresh_pg_squeeze_targets() and emit updated pg_squeeze findings.
  • Level 3 — execute pg_repack for specific tables (--tables schema.table[,schema.table...]) without heavy locks.
  • Level 4 — run VACUUM (FULL, ANALYZE, VERBOSE) across all tables or a comma-delimited --scope (requires --yes). Expect exclusive locks; schedule during maintenance windows.

All runs write logs under backups/ for auditing (pg_repack-*.log, vacuum-full-*.log).

Automation & Testing

  • CI Workflow: .github/workflows/ci.yml builds the image, runs python -m pytest -k full_workflow, and uploads generated backups for inspection.
  • Smoke Test: tests/test_manage.py spins up a disposable environment, exercises key CLI commands (including daily-maintenance, pgBackRest, and upgrade), and tears everything down. Run locally with python -m pytest -k full_workflow (Docker required).
  • Fast Tests: tests/test_lightweight.py validates offline flows like help output and the vendored tooling without needing Docker.
  • Extension Smoke: ./scripts/manage.sh exercise-extensions --db <name> plus pgtap-smoke provide quick feedback that the entire core extension bundle (vector/PostGIS/AGE/hstore/citext/pgcrypto/pg_partman/etc.) is ready for use.
  • Synthetic Fixture: ./scripts/manage.sh test-dataset bootstrap --db core_data_testkit --owner demo_user --print-password --force creates repeatable spatial/vector/graph/routing sample data so PgBouncer, REST, and GraphQL clients can run end-to-end queries without touching production schemas.
  • Documentation: AGENTS.md offers contributor runbooks and on-call notes.

Credits

Thank you to the maintainers and communities behind the components that make core_data possible:

  • PostgreSQL – the database at the heart of the platform.
  • Docker & Docker Compose – containerization and orchestration.
  • PgBackRest – resilient backup and restore tooling.
  • PgHero – query insights and monitoring.
  • pgBadger – PostgreSQL log analytics.
  • pg_cron – database-native scheduling.
  • pg_squeeze – automatic bloat mitigation.
  • pgvector – high-dimensional vector search.
  • PostGIS – spatial superpowers for PostgreSQL (including Tiger Geocoder & Address Standardizer).
  • pgRouting – network routing & graph analysis atop PostGIS.
  • Apache AGE – graph database extension.
  • pgaudit – enhanced auditing.
  • pg_partman – automated time/ID partition management.
  • HypoPG – hypothetical index exploration for query tuning.
  • pg_repack & pgtap – maintenance & testing extensions.

Their work powers the database-as-code experience delivered by core_data.

About

Docker managed PostgreSQL server for local development ease

Topics

Resources

License

Code of conduct

Contributing

Security policy

Stars

Watchers

Forks

Packages

 
 
 

Contributors 5