Backups
Here, we describe the data and file backups that are kept to enable recovery of the installation. In order to provide sufficient material to accomplish a complete recovery of the installation we keep copies of the following: -
The rancher server. A docker-based service running in a asmake RKE cluster
The production and development cluster _infrastructure_ database, used by Keycloak for authentication and AWX, the ansible playbook server
The production stack database
The production stack media files
Rancher Server
This service is responsible for managing the cluster VMs and provides kubectl/k9s/lens access to the clusters. Backup is relatively complex and is currently a manual operation that requires the server to be stopped. We could invest time in an automated backup but this would have to be one that can detect errors and alert a human operator. Probably a day or two to develop.
Backup process: Manual
Backup schedule: We recommend a manual backup is taken on Fridays at the end of the day (prior to a Sunday-night/Monday-morning fs-trim issue)
Backup location: STFC S3 Echo bucket (
/nw-rancher)
Infrastructure Database (Development)
This database manages Fragalysis & Squonk2 application logins in the development cluster, providing federated access to CAS.
Backup process: This is handled by a CronJob container using the
informaticsmatters/sql-backupcontainer image. As the server hosts multiple databases it uses thepg_dumpallutility to dump the server contents to a backup volume (in the cluster) andrcloneto copy this off cluster to an S3 bucketBackup schedule: Every day at 03:07, keeping 28 copies
Backup location: STFC S3 Echo bucket (
/im-infra-backup)
Infrastructure Database (Production)
This database manages Fragalysis & Squonk2 application logins for the production cluster, providing federated access to CAS. It is also used by the AWX ansible playbook server.
Backup process: This is handled by a CronJob container using the
informaticsmatters/sql-backupconatiner image. As the server hosts multiple databases it uses thepg_dumpallutility to dump the server contents to a backup volume (in the cluster) andrcloneto copy this off cluster to an S3 bucketBackup schedule: Every day at 03:07, keeping 28 copies
Backup location: STFC S3 Echo bucket (
/im-infra-production-backup)
Production Stack Database
This database is the Fragalysis django application database.
Backup process: This is handled by a CronJob container using the
informaticsmatters/sql-backupcontainer image. This backup only consists of the frag database, collected using thepg_dumputility to dump the server contents to a backup volume (in the cluster) andrcloneto copy this off cluster to an S3 bucketBackup schedule Every hour, keeping 24 copies
Backup location: STFC S3 Echo bucket (
/nw-xch-prod-v2-production-stack-backup)
Production Stack Media
The production Fragalysis media directory, consisting of about 135,000 individula files consuming about 240GiB of disk space (September 2025).
Backup process: This is handled by a CronJob container using the
/informaticsmatters/volume-replicatorcontainer image. The container usesrsyncto synchronise data to an NFS volume andrcloneto copy the data off the cluster to an S3 bucketBackup schedule: Every day at 04:04, only the latest copy is kept
Backup location: STFC S3 Echo bucket (
/fragalysis-stack-production-media)