Skip to content

linusali/ocp-virt-async-dr

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

43 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

⚠️ Disclaimer: This README is Work In Progress. Please verify all commands, variable names, and steps before using it in any environment.

OpenShift Virtualization Async DR with VolSync (Ansible-driven)

Orchestrate asynchronous disaster recovery (DR) for KubeVirt VMs across OpenShift clusters using VolSync. This project ships Ansible roles and playbooks that:

  • install the required operators (VolSync, optional MetalLB),
  • discover VM disks and create ReplicationSource/ReplicationDestination,
  • schedule periodic syncs, pick up new VM disks via re-scans,
  • capture sanitized VM specs on the destination cluster for like‑for‑like restore (CPU, RAM, disks, NICs, MACs), and
  • perform failover by pausing RD and restoring the captured VM spec pointing to replicated PVCs.

Table of contents


Architecture

Goal. Keep one or more namespaces / VMs on a source OpenShift cluster asynchronously replicated to a destination cluster using VolSync. Dataflow is PVC→PVC with a transport (typically restic to an object store or rsync direct), scheduled by Kubernetes CronJobs generated by VolSync CRs.

What the automation does.

  1. Installs VolSync (and optionally MetalLB) via OperatorHub API resources.
  2. Discovers a VM’s data volumes / PVCs on the source cluster and generates a matching ReplicationSource.
  3. Creates a ReplicationDestination on the destination cluster with compatible storageClass/size.
  4. Schedules periodic syncs and optional retention.
  5. Captures/exports a sanitized VM manifest on destination and stores it for DR (same CPU, memory, disks, network interfaces, MACs and NADs when possible).
  6. For failover, pauses destination RD, performs a final sync/promote, rebinds PVCs, and recreates VM from the captured manifest.

Note: Live migration of VolSync‑replicated disks isn’t a thing; this is asynchronous, point‑in‑time replication. RPO≈your schedule; RTO depends on PVC promotion + VM restore.


Prerequisites

Workstation

  • Ansible Core and Python 3 with Kubernetes client libs:

    dnf install -y ansible-core git python3-pip  # or: apt/yum
    pip3 install kubernetes
  • oc CLI installed and logged into both clusters at least once (to seed kubeconfigs/contexts) or provide paths to kubeconfig files in the inventory.

Clusters

  • Two OpenShift clusters: source (primary) and destination (DR).
  • Working storage classes on both sides with sufficient capacity.
  • Object storage credentials if using the restic transport (S3/compatible) — recommended for geo DR.

Repository layout

.
├── ansible.cfg
├── requirements.yml
├── inventories/
│   └── lab/           # example inventory
├── playbooks/         # task entry points (install, discover, configure, capture, failover, etc.)
└── roles/             # role-ized logic used by the playbooks

Tip: Keep your own inventory (e.g., inventories/prod/) separate from the sample lab one.


Quick start

# 1) Clone
git clone https://github.com/linusali/ocp-virt-async-dr
cd ocp-virt-async-dr

# 2) Prepare Python and Ansible bits (once)
pip3 install kubernetes
ansible-galaxy collection install -r requirements.yml

# 3) Copy the sample inventory and edit
cp -r inventories/lab inventories/my-site
$EDITOR inventories/my-site/group_vars/all.yml   # see sections below
$EDITOR inventories/my-site/hosts.ini            # set contexts/kubeconfigs

# 4) Install operators on both clusters (not tested)
ansible-playbook -i inventories/my-site playbooks/install-operators.yml

# 5) Discover PVCs & configure VolSync for the selected VMs
ansible-playbook -i inventories/my-site playbooks/configure-sync.yml

# 6) Test a planned failover (namespaced)
ansible-playbook -i inventories/my-site playbooks/failover.yml 

All playbooks are idempotent. Re-running configure after editing the inventory will reconcile (create/update) the VolSync CRs.


Inventory and variables

The project expects a local connection (you talk to clusters via the Kubernetes API), so hosts.ini usually just targets localhost.

Minimal inventory

inventories/my-site/hosts.ini

[localhost]
127.0.0.1 ansible_connection=local

inventories/my-site/group_vars/all.yml

# Identify clusters by kubeconfig+context
source:
  kubeconfig: "{{ lookup('env', 'HOME') }}/.kube/source.kubeconfig"  # or leave empty to use default
  context:    "admin/source-cluster"

destination:
  kubeconfig: "{{ lookup('env', 'HOME') }}/.kube/destination.kubeconfig"
  context:    "admin/destination-cluster"

# Default storage classes and PVC sizing behavior at DR
storage:
  default_sc: "ocs-storagecluster-ceph-rbd"  # adjust to your DR class
  expand_to_source_size: true                 # ensure dest ≥ source

# Select which namespaces are in scope (optional, otherwise VM list drives scope)
namespaces: ["my-workload-ns"]

Defining which VMs to replicate

You can choose VMs explicitly or by label selectors. The roles will discover the relevant DataVolumes/PVCs for each VM and configure VolSync CRs accordingly.

Explicit list (recommended for first run):

vms:
  - name: web-01
    namespace: my-workload-ns

  - name: db-01
    namespace: my-workload-ns

Typical workflows

Install operators

Installs/ensures VolSync (and optionally MetalLB) operators exist in both clusters. Assumes OperatorHub installation via Subscription/OperatorGroup resources.

ansible-playbook -i inventories/my-site playbooks/install-operators.yml

Discover VM disks and configure replication

Discovers the DataVolumes/PVCs for each selected VM on source, then creates/updates ReplicationSource and ReplicationDestination CRs across clusters with your schedule & transport.

ansible-playbook -i inventories/my-site playbooks/configure-replication.yml

Planned failover (promote DR)

For a controlled switchover of a namespace:

ansible-playbook -i inventories/my-site playbooks/failover.yml 

Running with AWX/Automation Controller

TODO


Operational tips

  • Start with one namespace, one VM, and verify RPO/RTO.
  • For databases, consider application‑level quiesce hooks before sync windows.
  • Ensure time sync (NTP/Chrony) on nodes; VolSync cron scheduling depends on it.
  • If using restic, test repo credentials and retention windows outside of prod.
  • Keep storage classes compatible (block vs filesystem, access modes, volumeModes).

Troubleshooting

TODO


FAQ

TODO

License

Apache-2.0 (see repository for details).

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages