A reproducible PostgreSQL 17 platform delivered as code. core_data builds a hardened database image with spatial, vector, and graph extensions, provisions PgHero for observability, and ships a management CLI that automates backups, restores, QA cloning, and upgrades. Everything lives in version control so environments can be rebuilt consistently across laptops, CI, and production.
- Run the same Postgres 17 stack everywhere: laptop, CI runner, or production.
- Ship with the heavy hitters pre-installed—PostGIS, pgvector, AGE, pg_cron, pgBackRest—without custom build scripts.
- Automate the boring-but-critical tasks: backups, restores, QA clones, log analytics, and even major version upgrades via pgautoupgrade.
- Treat your database like code with reproducible
.envconfigs, templated init scripts, and a pytest smoke test that catches regressions early.
Pre-built images are available from GitHub Container Registry with full SLSA attestations and SBOM for supply chain security.
# Pull the latest stable release
docker pull ghcr.io/paudley/core-data-postgres:latest
# Or specify a version
docker pull ghcr.io/paudley/core-data-postgres:17.2-v1.0.0Images follow a hybrid versioning strategy combining PostgreSQL version with semantic versioning:
| Tag Pattern | Example | Description |
|---|---|---|
latest |
latest |
Latest stable release |
{PG_VERSION}-v{MAJOR}.{MINOR}.{PATCH} |
17.2-v1.0.0 |
Exact version (recommended for production) |
{PG_VERSION}-v{MAJOR}.{MINOR} |
17.2-v1.0 |
Latest patch for minor version |
{PG_VERSION}-v{MAJOR} |
17.2-v1 |
Latest minor for major version |
{PG_VERSION} |
17.2 |
Latest semantic version for PostgreSQL version |
{PG_MAJOR} |
17 |
Latest for PostgreSQL major version |
Version Format: {PostgreSQL_Version}-v{Semantic_Version}
- Example:
17.2-v1.0.0means PostgreSQL 17.2 with semantic version 1.0.0 - See docs/RELEASING.md for complete versioning details
All published images include cryptographic attestations that prove build provenance and supply chain integrity:
# Verify image attestation (requires GitHub CLI)
gh attestation verify oci://ghcr.io/paudley/core-data-postgres:17.2-v1.0.0 \
--owner paudley
# Pull and verify in one step
docker pull ghcr.io/paudley/core-data-postgres:17.2-v1.0.0
gh attestation verify oci://ghcr.io/paudley/core-data-postgres:17.2-v1.0.0 \
--owner paudley
# Verify every referenced image from your .env (pretty console output)
./scripts/manage.sh attestation-verify --env-file ci.env.exampleGitHub invariably injects a repository-scoped GITHUB_TOKEN when workflows run. If that token cannot read paudley/core_data, the helper automatically retries the same verification anonymously (clearing GH_TOKEN/GITHUB_TOKEN) so public attestations still succeed without extra flags.
Every service in the stack is published with the same attested tag:
| Service | Image reference |
|---|---|
| PostgreSQL | ghcr.io/paudley/core_data/postgres:<stack-tag> |
| ValKey | ghcr.io/paudley/core_data/valkey:<stack-tag> |
| PgBouncer | ghcr.io/paudley/core_data/pgbouncer:<stack-tag> |
| RabbitMQ | ghcr.io/paudley/core_data/rabbitmq:<stack-tag> |
| Memcached | ghcr.io/paudley/core_data/memcached:<stack-tag> |
| Network Probe | ghcr.io/paudley/core_data/network-probe:<stack-tag> |
| Network Guard | ghcr.io/paudley/core_data/network-guard:<stack-tag> |
Use the optional CORE_DATA_STACK_TAG / CORE_DATA_STACK_REGISTRY environment variables (or .env entries) to pin a specific release across every helper; otherwise the defaults use the latest tag from ghcr.io/paudley/core_data.
Every container publishes /README.md and /SECURITY.md (copies of this repository's documentation) so you can inspect provenance and policy from inside the image with a simple ls.
Expected verification output:
✓ Verification succeeded!
sha256:abc123... was attested by:
REPO PREDICATE_TYPE WORKFLOW
paudley/core_data https://slsa.dev/provenance/v1 .github/workflows/publish-docker.yml@refs/tags/17.2-v1.0.0
What's Verified:
- SLSA Provenance: Confirms the image was built by GitHub Actions from this repository
- SBOM: Software Bill of Materials listing all components and dependencies
- Build Reproducibility: Links image digest to specific source code commit
Learn More:
Update your docker-compose.yml to use the published image instead of building locally:
services:
postgres:
image: ghcr.io/paudley/core-data-postgres:17.2-v1.0.0 # Use published image
# Remove or comment out the 'build:' section
# build:
# context: .
# dockerfile: postgres/DockerfileOr override via .env:
POSTGRES_IMAGE_NAME=ghcr.io/paudley/core-data-postgres
POSTGRES_IMAGE_TAG=17.2-v1.0.0Pre-built images from ghcr.io have the postgres user baked in with UID/GID 999. Your .env must match:
POSTGRES_UID=999
POSTGRES_GID=999Why this matters: The volume_prep service uses these values to chown data directories. If they don't match the image's baked-in UID, you'll see permission errors like:
chmod: changing permissions of '/var/lib/postgresql/data': Operation not permitted
Building locally? If you set CORE_DATA_BUILD_IMAGE=1, you can use your host user's UID/GID instead—the image build process will create the postgres user with your specified IDs.
- Custom Docker image with PostGIS, pgvector, Apache AGE, pg_cron, pg_squeeze, pgAudit, pgBadger, pgBackRest, and pgtune baked in.
- Init scripts render configuration from templates, create application databases, and enable extensions automatically.
./scripts/manage.shwraps lifecycle tasks: image builds,psql, logical dumps, pgBackRest backups/restores, QA cloning, log analysis, daily maintenance, and major upgrades via pgautoupgrade.- PGDATA, WAL, and pgBackRest now live on dedicated Docker named volumes for near-native Linux I/O, while
BACKUPS_HOST_PATH(defaults to./backups) remains a bind mount for easy artifact exports. - Secrets stay in Docker secrets (
POSTGRES_PASSWORD_FILE) and the container runs as the non-rootpostgresUID/GID at all times, keeping the least-privilege posture consistent across init and steady state. - TLS is enforced by default with auto-generated self-signed certificates (override with your own CA material), and a multi-stage health probe (
scripts/healthcheck.sh) guards dependent services before they start. - Logging uses Docker's
localdriver with rotation and non-blocking delivery, preventing runaway JSON logs from filling the host while preserving enough history for incident response. - Optional profiles bundle ValKey, PgBouncer, and Memcached with sensible defaults, secrets, and CLI helpers so you can layer caches and pooling alongside PostgreSQL in one step.
- CI smoke test (
python -m pytest -k full_workflow) provisions a stack, exercises critical commands, and verifies upgrade safety.
core_data provisions a batteries-included extension stack in every non-template database at init time:
- Performance & Observability —
pg_stat_statements,auto_explain,pg_buffercache,pg_prewarm,bloom. - Security & Compliance —
pgaudit,pgcrypto,"uuid-ossp". - Developer Ergonomics —
hstore,citext,pg_trgm,btree_gin,btree_gist,hypopg,intarray,ltree,tablefunc,unaccent. - Connectivity —
postgres_fdw,dblink. - Spatial, Vector, Graph —
postgis,postgis_raster,postgis_topology,vector,age,cube,earthdistance. - Maintenance & Testing —
pg_cron(kept in thepostgresdatabase),pg_partman,pg_repack,pg_squeeze,pgstattuple,pgtap. - Geospatial Extras —
postgis_tiger_geocoder,address_standardizer,address_standardizer_data_us,pgrouting,fuzzystrmatch.
The same bundle is installed into template1 so freshly created databases inherit the tooling automatically.
pg_partman_bgw is preloaded with a one-hour interval targeting the postgres database under the postgres superuser. Adjust pg_partman_bgw.dbname/role in postgresql.conf.tpl (or via postgresql.pgtune.conf) if you manage partitions from a different control schema.
Use ./scripts/manage.sh partman-show-config to inspect tracked parents, partman-maintenance to run run_maintenance_proc() on demand, and partman-create-parent schema.table control_column '1 day' to bootstrap new partition sets without hand-writing SQL.
Run ./scripts/manage.sh async-queue bootstrap when you want a lightweight background-job queue. It provisions an asyncq.jobs table plus helpers (enqueue, dequeue, complete, fail, extend_lease) that rely on FOR UPDATE SKIP LOCKED, pg_notify, and UUID leasing. Point a worker at the queue with SELECT * FROM asyncq.dequeue('default'); in a loop and call asyncq.complete(...) or asyncq.fail(...) as you process jobs.
-
Bootstrap environment config:
./scripts/manage.sh create-envto walk through password creation, host UID/GID selection, and resource sizing (writes.env+ secrets). -
Build and start the stack:
./scripts/manage.sh build-image ./scripts/manage.sh up
-
Verify health (multi-stage probe):
docker compose exec postgres /opt/core_data/scripts/healthcheck.sh ./scripts/manage.sh psql -c 'SELECT 1;'
-
Explore the CLI:
./scripts/manage.sh helpfor grouped commands (lifecycle, CI, backups, audits, extensions, cache/messaging, security). -
CI runners: see
CI_USAGE.mdfor the pipelines-focused flow (attestations,ci-up/ci-down, pytest markers).
We optimize for data infrastructure as code ➜ automated admin ➜ human-friendly ➜ best-practices by default (including security and performance). In practice that means:
- Data infrastructure as code. Declarative, reproducible Postgres once, versioned forever.
- Automated admin. Every recurring task should be scriptable and CI-friendly before we worry about shell ergonomics.
- Human-friendly. CLI helpers, sensible prompts, and clear docs matter—but only after the first two goals are met.
- Best-practices by default. Security and performance guardrails (cap drops, TLS, tuned memory, backups) are enabled out of the box, with deliberate escape hatches when absolutely required.
See docs/security_philosophy.md for how capability hardening and related controls fit into that priority order.
- Resource guardrails. Container memory, CPU, and shared memory limits come from
.env, keeping pgtune advice and runtime constraints aligned. AdjustPOSTGRES_MEMORY_LIMIT,POSTGRES_CPU_LIMIT, andPOSTGRES_SHM_SIZEto match the host. - TLS everywhere. PostgreSQL refuses non-SSL connections from the bridge network. Provide your own certificate/key via Docker secrets or rely on the init hook to mint a self-signed pair under
${PGDATA}/tls. - Named volumes for PGDATA/WAL.
pgdata,pgwal, andpgbackrestvolumes provide near-native I/O on Linux. Override the volume definitions if you pin WAL/data to specific devices. - Non-root from the start. A one-shot
volume_prephelper chowns the volumes before Postgres launches so the main service and sidecars run as your host user by default (UID/GID${POSTGRES_UID}), keeping file ownership consistent across deployments. Supply alternative IDs only when required. - Automated logical backups. The
logical_backupsidecar runspg_dump/pg_dumpallon the cadence defined byLOGICAL_BACKUP_INTERVAL_SECONDS, writes into${BACKUPS_HOST_PATH}/logical, prunes according toLOGICAL_BACKUP_RETENTION_DAYS, and skips any databases listed inLOGICAL_BACKUP_EXCLUDE(defaults topostgres).daily-maintenancecaptures the latest run inlogical_backup_status.txtfor auditing. - Composable health check.
scripts/healthcheck.shverifies readiness, executesSELECT 1, and optionally enforces replication lag ceilings before dependents start. - Rotated container logs. Docker's
localdriver with non-blocking delivery prevents runaway JSON files while retaining compressed history for incident response. - Optional service profiles.
COMPOSE_PROFILES=valkey,pgbouncer,memcached,rabbitmqbrings the cache/pooling stack online; drop profiles from the list to opt out without editingdocker-compose.yml. - Seccomp baseline. Shipping profiles in
seccomp/cover each service (postgres.json,logical_backup.json,pgbouncer.json,valkey.json,memcached.json,pghero.json, plusdocker-default.jsonreused for RabbitMQ)../scripts/manage.sh seccomp-statusreports the active spec,seccomp-verifygates compose configs, anddocs/security_philosophy.mdoutlines how to regenerate traces when you need to tighten them further. - AppArmor (opt-in). Minimal profiles live in
apparmor/core_data_minimal.profile. Load them with./scripts/manage.sh apparmor-load(sudo), then setCORE_DATA_APPARMOR_<SERVICE>=apparmor:core_data_minimalin.envfor each service you want to confine. The profile denies access to high-value host paths (/root,/etc/shadow, Docker socket) while leaving normal container paths alone.
- ValKey — Requires authentication by default (
valkey_passwordsecret), persists to thevalkey_datavolume (appendonly yes), exposesvalkey-cli/valkey-bgsave, and is tuned via.envknobs such asVALKEY_MAXMEMORYandVALKEY_MAXMEMORY_POLICY. - PgBouncer — Uses SCRAM auth backed by a dedicated superuser, renders config/userlist from templates, and offers
pgbouncer-stats/pgbouncer-poolshelpers. Pool sizing and admin/stat users are driven by thePGBOUNCER_*variables. - Memcached — Lightweight hot cache with configurable memory, connection, and thread limits (
MEMCACHED_*). Thememcached-statshelper pipesstatsoutput for quick validation. All services stay on the internal bridge network by default. - RabbitMQ — Delivers AMQP 0.9.1 alongside the management API. Credentials and cluster identity come from
secrets/rabbitmq_default_passandsecrets/rabbitmq_erlang_cookie; daily maintenance captures definitions/status snapshots, and on-demand helpers (rabbitmq-ctl,rabbitmq-diagnostics,rabbitmq-export,rabbitmq-overview) streamline broker admin.
core\_data/
├── .env.example # Template for environment-specific settings (never commit real secrets)
├── docker-compose.yml # Orchestrates PostgreSQL and PgHero services
├── scripts/ # Operator tooling (manage.sh + lib modules + maintenance workflow)
├── postgres/ # Custom image build assets, configs, and initdb scripts
├── backups/ # Host output directory for logical dumps and reports
├── secrets/ # Docker secret material (e.g., postgres\_superuser\_password)
├── README.md # This guide
├── THIRD\_PARTY\_LICENSES.md # Upstream license attributions for vendored tooling
└── AGENTS.md # Contributor quick-reference & runbooks
If you override the named volumes with host bind mounts, keep those directories out of version control—they contain live cluster state and pgBackRest archives.
Use the dedicated CI helpers when you need to spin up the published stack inside automation:
./scripts/manage.sh ci-verify --env-file ci.env.example --require-attestationconfirms Docker is available, checks disk/port prerequisites, and verifies that every referenced image carries a recent attestation../scripts/manage.sh ci-up --env-file ci.env.example --output ./backups/ci-output.jsonbootstraps secrets/network allow lists, pulls/warms the prebuilt images, and emits a JSON manifest with connection details for downstream jobs../scripts/manage.sh ci-down --volumes --prune-data --prune-secretsstops containers and removes ephemeral state when the pipeline finishes.
ci.env.example shows the minimal environment required for this workflow, and docs/examples/ci-workflow.md includes a ready-to-copy GitHub Actions job that wires everything together.
./scripts/manage.sh is the operator entry point. Frequently used commands:
| Command | Description |
|---|---|
build-image |
Build the custom PostgreSQL image defined in postgres/Dockerfile. |
create-env |
Interactive wizard that copies .env.example, sizes resources, seeds secrets, and writes .env. |
up / down |
Start or stop the Compose stack (volumes preserved). |
psql |
Open psql inside the container (respects PGHOST, PGUSER, etc.). |
dump / dump-sql |
Produce logical backups (custom or plain format) under /backups. |
restore-dump |
Drop and recreate a database before restoring a .dump.gz. |
backup [--verify] / stanza-create / restore-snapshot |
Manage pgBackRest backups & optionally restore the latest backup into a throwaway data dir for checksum verification. |
daily-maintenance |
Run dumps, log capture, pgBadger analysis, and retention pruning. |
provision-qa |
Differential backup + targeted restore for QA databases. |
config-render |
Re-render postgresql.conf / pg_hba.conf from the templates and restart PostgreSQL (terminates active connections; required for some settings like shared_buffers and max_connections). |
config-check |
Compare live postgresql.conf / pg_hba.conf against rendered templates to catch drift. |
audit-roles / audit-security |
Generate CSV/text reports covering role hygiene, passwords, and HBA/RLS posture. |
audit-extensions |
Confirm bundled extensions are present and on expected versions. |
audit-autovacuum |
Flag tables with high dead tuple counts or ratios. |
audit-replication |
Summarise follower lag and sync state. |
audit-cron / audit-squeeze |
Inspect pg_cron schedules and pg_squeeze activity tables. |
audit-index-bloat |
Estimate index bloat using pgstattuple (supports --min-size-mb). |
audit-buffercache |
Snapshot shared buffer usage per relation (supports --limit). |
audit-schema |
Snapshot schema metadata for drift detection. |
snapshot-pgstat |
Capture a pg_stat_statements baseline (CSV output) for performance trending. |
diff-pgstat --base --compare |
Diff two snapshots (CSV-in/CSV-out) to highlight hot queries. |
compact --level N |
Layered bloat management: 1=autovacuum audit, 2=pg_squeeze refresh, 3=pg_repack (needs --tables), 4=VACUUM FULL (needs --yes). |
exercise-extensions |
Smoke-test the core extension bundle (vector, PostGIS, AGE, citext, hstore, pgcrypto, hypopg, pg_partman, etc.). |
pgtap-smoke |
Run a micro pgTap plan to confirm key extensions (including hypopg/pg_partman) are registered. |
test-dataset bootstrap |
Stand up a synthetic schema with spatial, vector, routing, and graph fixtures to exercise PgBouncer/GraphQL clients. |
async-queue bootstrap |
Install the lightweight async queue schema (asyncq) with enqueue/dequeue helpers. |
partman-maintenance |
Invoke run_maintenance_proc() for the selected database (defaults to POSTGRES_DB). |
partman-show-config |
Print rows from part_config (optionally filter by --parent schema.table). |
partman-create-parent |
Wrap create_parent to bootstrap managed partitions without manual SQL. |
valkey-cli |
Run valkey-cli inside the ValKey container with secrets wired in. |
valkey-bgsave |
Trigger BGSAVE so the ValKey RDB is flushed to the valkey_data volume. |
rabbitmq-ctl |
Execute rabbitmqctl inside the RabbitMQ container (requires the rabbitmq profile). |
rabbitmq-diagnostics |
Run rabbitmq-diagnostics commands such as status or check_running. |
rabbitmq-export |
Export broker definitions to JSON (defaults to ${BACKUPS_HOST_PATH}/rabbitmq-definitions.json). |
rabbitmq-overview |
Print the summary from rabbitmq-diagnostics status. |
pgbouncer-stats / pgbouncer-pools |
Emit PgBouncer SHOW STATS / SHOW POOLS via the admin console. |
memcached-stats |
Fetch stats output from the Memcached service. |
version-status |
Compare installed Postgres/extension versions with upstream releases (CSV via --output). |
upgrade --new-version |
Orchestrate pgautoupgrade (takes backups, validates base image, restarts). |
The CLI sources modular helpers from scripts/lib/ so each function can be imported by tests or future automation.
daily-maintenance now emits a richer bundle under backups/daily/<YYYYMMDD>/, including pg_stat_statements snapshots, pg_buffercache heatmaps, role/extension/autovacuum/replication CSVs, pg_cron schedules, pg_squeeze activity, and a security checklist alongside logs, dumps, pgBadger HTML, and pgaudit summaries. The workflow also records the most recent sidecar dump run in logical_backup_status.txt, runs partman.run_maintenance_proc() across each database so freshly created partitions land even if the background worker interval has not elapsed, and captures version drift in version_status.csv (focusing on out-of-date components). Pair those reports with config-check to keep the rendered configs aligned with the templates. Tune the thresholds via DAILY_PG_STAT_LIMIT, DAILY_BUFFERCACHE_LIMIT, DAILY_DEAD_TUPLE_THRESHOLD, DAILY_DEAD_TUPLE_RATIO, and DAILY_REPLICATION_LAG_THRESHOLD as needed.
Nightly cron jobs also refresh pg_squeeze targets, reset pg_stat_statements, and run a safe VACUUM (ANALYZE, SKIP_LOCKED, PARALLEL 4) so statistics stay current without blocking hot tables.
Set DAILY_EMAIL_REPORT=true and [email protected] in .env to have the maintenance job email the HTML summary via sendmail (if available inside the container).
To compare performance snapshots between runs, capture CSVs with snapshot-pgstat --output /backups/pg_stat_before.csv and snapshot-pgstat --output /backups/pg_stat_after.csv, then run ./scripts/manage.sh diff-pgstat --base /backups/pg_stat_before.csv --compare /backups/pg_stat_after.csv --limit 25 for a ranked delta report.
./scripts/manage.sh compact provides escalating space-recovery options:
- Level 1 — run the autovacuum audit (no changes, just reporting).
- Level 2 — rerun
core_data_admin.refresh_pg_squeeze_targets()and emit updatedpg_squeezefindings. - Level 3 — execute
pg_repackfor specific tables (--tables schema.table[,schema.table...]) without heavy locks. - Level 4 — run
VACUUM (FULL, ANALYZE, VERBOSE)across all tables or a comma-delimited--scope(requires--yes). Expect exclusive locks; schedule during maintenance windows.
All runs write logs under backups/ for auditing (pg_repack-*.log, vacuum-full-*.log).
- CI Workflow:
.github/workflows/ci.ymlbuilds the image, runspython -m pytest -k full_workflow, and uploads generated backups for inspection. - Smoke Test:
tests/test_manage.pyspins up a disposable environment, exercises key CLI commands (includingdaily-maintenance, pgBackRest, andupgrade), and tears everything down. Run locally withpython -m pytest -k full_workflow(Docker required). - Fast Tests:
tests/test_lightweight.pyvalidates offline flows like help output and the vendored tooling without needing Docker. - Extension Smoke:
./scripts/manage.sh exercise-extensions --db <name>pluspgtap-smokeprovide quick feedback that the entire core extension bundle (vector/PostGIS/AGE/hstore/citext/pgcrypto/pg_partman/etc.) is ready for use. - Synthetic Fixture:
./scripts/manage.sh test-dataset bootstrap --db core_data_testkit --owner demo_user --print-password --forcecreates repeatable spatial/vector/graph/routing sample data so PgBouncer, REST, and GraphQL clients can run end-to-end queries without touching production schemas. - Documentation:
AGENTS.mdoffers contributor runbooks and on-call notes.
Thank you to the maintainers and communities behind the components that make core_data possible:
- PostgreSQL – the database at the heart of the platform.
- Docker & Docker Compose – containerization and orchestration.
- PgBackRest – resilient backup and restore tooling.
- PgHero – query insights and monitoring.
- pgBadger – PostgreSQL log analytics.
- pg_cron – database-native scheduling.
- pg_squeeze – automatic bloat mitigation.
- pgvector – high-dimensional vector search.
- PostGIS – spatial superpowers for PostgreSQL (including Tiger Geocoder & Address Standardizer).
- pgRouting – network routing & graph analysis atop PostGIS.
- Apache AGE – graph database extension.
- pgaudit – enhanced auditing.
- pg_partman – automated time/ID partition management.
- HypoPG – hypothetical index exploration for query tuning.
- pg_repack & pgtap – maintenance & testing extensions.
Their work powers the database-as-code experience delivered by core_data.