######################### Installation and recovery ######################### .. image:: ../images/frag-actions/frag-actions.019.png ************* Prerequisites ************* For a Stack **without access to a Graph database** you will need the following: - * A ``KUBECONFIG`` file (providing admin access to each cluster) * A compatible kubectl client (i.e. kubectl 1.23) * An ACME/let's encrypt account (for SSL certificates) (`letsencrypt`_) * `Poetry`_ .. note:: You could create a cluster with one worker of 16 cores and 32Gi RAM, or two worker nodes, each providing 8 cores and 16Gi RAM. On AWS having multiple nodes will probably be no real advantage. AWS EKS is extremely robust and resilient and the cost of will ultimately depend on the total cores and RAM you're using. ************************* Installation and recovery ************************* Creating an EKS cluster ======================= .. warning:: To avoid the following steps for disturbing any local **KUBECONFIG** file you may have defined you should run ``unset KUBECONFIG`` before proceeding. Create a cluster in AWS using `eksctl`_. The best way to do this is by defining your cluster in a ``cluster.yaml`` file. The following example, which creates a Kubernetes 1.23 cluster in London (``eu-west-2``), should be sufficient for our needs:: --- apiVersion: eksctl.io/v1alpha5 kind: ClusterConfig metadata: name: fragalysis-production region: eu-west-2 version: '1.23' availabilityZones: - eu-west-2a - eu-west-2b - eu-west-2c managedNodeGroups: - name: mng-1 # The 2xlarge is an 8 core 32Gi instance instanceType: m5.2xlarge minSize: 1 maxSize: 1 desiredCapacity: 1 volumeSize: 80 volumeType: gp2 labels: informaticsmatters.com/purpose-core: 'yes' informaticsmatters.com/purpose-worker: 'yes' informaticsmatters.com/purpose-application: 'yes' This file can be found in the `fragalysis-stack-kubernetes`_ repository (as ``eks-relocation/cluster.yaml``). .. note:: The schema for the ``cluster.yaml`` file can be found on the `eksctl schema`_ page. It is vitally important that the cluster version you have chosen is compatible with the kubernetes cluster we are relocating. At the time of writing this is **1.23**. With a cluster configuration available, create it with the following command:: eksctl create cluster -f cluster.yaml The cluster should be ready in about 15 minutes. Once it is ready you will find that cluster credentials were added in ``~/.kube/config``. If you need to you can list and select a kubernetes context using the context ``NAME`` using ``kubectl``:: kubectl config get-contexts kubectl config set current-context MY-CONTEXT You'll now be able to inspect your new cluster with ``kubectl``, where you should discover one node:: kubectl get nodes Core components =============== Before installing Keycloak and the Fragalysis Stack you will need to configure and install some core components, namely: - * Configure Amazon EBS CSI driver to create a GP2 **StorageClass** * Install an NGINX **Ingress Controller** * Install the SSL **Certificate Manager** * Configure the cluster's load balancer * Setup domain routing * Create a cluster admin **ServiceAccount** But first, if you need to, set the ``KUBECONFIG`` environment variable to point to your ``KUBECONFIG`` file. This will be used by the ``kubectl`` client to access your cluster and our playbooks:: export KUBECONFIG=/path/to/your/kubeconfig EBS CSI driver -------------- From EKS 1.23 a Container Storage Interface (CSI) driver is needed in order to get your **PersistentVolumeClaims** served by a **PersistentVolume** as you are used to from earlier EKS versions (see `aws-ebs-csi-driver`_ for more information). Firstly, setup the driver permissions using ``kubectl`` to create a secret from your AWS credentials:: kubectl create secret generic aws-secret \ --namespace kube-system \ --from-literal "key_id=${AWS_ACCESS_KEY_ID}" \ --from-literal "access_key=${AWS_SECRET_ACCESS_KEY}" Then, use the ``kubectl`` **kustomize** feature to deploy the driver:: kubectl apply -k "github.com/kubernetes-sigs/aws-ebs-csi-driver/deploy/kubernetes/overlays/stable/?ref=release-1.23" Ingress Controller ------------------ Use ``kubectl`` to install a recent NGINX Ingress Controller, used as an in-cluster *load balancer* and required by the various application **Ingress** definitions:: repo=https://raw.githubusercontent.com/kubernetes/ingress-nginx path=deploy/static/provider/cloud/deploy.yaml version=controller-v1.9.1 kubectl apply -f ${repo}/${version}/${path} .. note:: You can check the condition of the installation (which may take a few minutes) by inspecting the **Pods** in the ``ingress-nginx`` namespace:: kubectl get pods --namespace ingress-nginx Certificate Manager ------------------- Use ``kubectl`` to install a recent Certificate Manager, used to automatically provision SSL certificates for the kubernetes **Ingress** definitions:: repo=https://github.com/cert-manager/cert-manager/releases/download path=cert-manager.yaml version=v1.13.1 kubectl apply -f ${repo}/${version}/${path} .. note:: You can check the condition of the installation (which may take a few minutes) by inspecting the **Pods** in the ``cert-manager`` namespace:: kubectl get pods --namespace cert-manager You will also need to provide a **ClusterIssuer** definition that allows the application **Ingress** definitions to trigger the automatic creation of SSL certificates. We use ``ACME`` (Let's encrypt) and suggest you do to. For this you will need to have registered and have the email address you used to register. Armed with your let's encrypt account email address create a file called ``cluster-issuer.yaml`` with the following content (replacing ```` by one appropriate for you):: --- kind: ClusterIssuer apiVersion: cert-manager.io/v1 metadata: name: letsencrypt-nginx-production spec: acme: email: privateKeySecretRef: name: letsencrypt-nginx-production server: https://acme-v02.api.letsencrypt.org/directory solvers: - http01: ingress: ingressClassName: nginx You will find a template file in the ``eks-relocation`` directory that you can edit. The name of the **ClusterIssuer** is important, and it is expected to be called ``letsencrypt-nginx-production``. Once you have a valid **ClusterIssuer** you can then apply the definition to your cluster:: kubectl apply -f cluster-issuer.yaml Configure the cluster's load balancer ------------------------------------- Check on what is probably an "inactive" *Classic* Load Balancer that will have been created in your AWS region and then **Migrate** it by clicking the **Launch NLB migration wizard** button. From the new page simply click the **Create** button to create a **Network Load Balancer** (**NLB**), and close the final window upon success. .. note:: If you return to the Load Balancers page you will probably find the LB **State** to be *Provisioning*. This may take a few minutes so refresh the page after a minute or two. When it is *Active* make sure your EKS cluster EC2 instances are in the **Listeners Target Group** for the pre-assigned Protocols. Setup domain routing -------------------- With the cluster prepared now is the time to arrange for any applicable domain names to be re-routed to the assigned DNS name of the **NLB** created for your EKS cluster. For us we'll need to make sure the following domains are routed to the NLB via a suitable *A record*:: fragalysis.diamond.ac.uk *.xchem.diamond.ac.uk (for the keycloak server) The DNS name for the **NLB** will be of the form ``000000-000000.elb.eu-west-2.amazonaws.com``, and this should be used as an **A record** (or **A record alias**) for the appropriate domains. Do this as soon as you can as DNS changes may take a few minutes but they can also take several hours. A cluster service account ------------------------- To allow users other than the cluster creator to access the cluster you will need to add a **ServiceAccount** that will allow you to create a token that can be used in the ``KUBECONFIG`` file. Create the required **Namespace**, **ServiceAccount** and **ClusterRoleBinding** with the following command, run from the ``eks-relocation`` directory:: kubectl create -f im-eks-admin Now, add the service account and its token as a new user definition to the ``KUBECONFIG`` file. You can refer to the documentation for `Adding a Service Account`_:: TOKEN=$(kubectl get secrets -n im-eks-admin \ -o jsonpath="{.items[?(@.metadata.annotations['kubernetes\.io/service-account\.name']=='im-eks-admin')].data.token}"\ |base64 --decode) kubectl config set-credentials im-eks-admin --token=${TOKEN} Now you can set the new user for future kubectl commands:: kubectl config set-context --current --user=im-eks-admin Infrastructure components ========================= With the base components installed you can now install the infrastructure. Because we are recovering the infrastructure database from elsewhere the creation of the infrastructure will take several steps: - - Create the infrastructure database server - Restore the infrastructure databases - Create the keycloak instance For our application **Pods** we will need to label the worker nodes in the cluster. If you've used the example ``cluster.yaml`` file you can skip these labelling commands as the ``eksctl`` utility will ensure that any nodes it creates will have the appropriate labels applied. To label nodes we apply them to each node. Run the following for each node in your cluster:: node= kubectl label nodes ${node} informaticsmatters.com/purpose-core=yes kubectl label nodes ${node} informaticsmatters.com/purpose-worker=yes kubectl label nodes ${node} informaticsmatters.com/purpose-application=yes From this point we rely on Ansible playbooks that are provided in the the Informatics Matters `ansible-infrastructure`_ repository, so you will need to clone the recommended version now:: git clone https://github.com/InformaticsMatters/ansible-infrastructure.git cd ansible-infrastructure git checkout 2023.4 All the playbooks are controlled by variables that we typically define in a YAML *parameter* file. A number of parameter files exist in the root of the repository, encrypted using `ansible-vault`_. You will need to create your own parameter file and decide whether you want to encrypt it. You might want to if the parameters contain sensitive information (but encryption is not covered here). .. note:: Use ``parameters.template`` in the `ansible-infrastructure`_ repository in as a template for your own parameter file. Infrastructure database server ------------------------------ For this exercise the following, written to ``parameter.yaml`` (ignored by the project gitignore file), should suffice. Replace ````, ````, ````, and ```` as appropriate:: --- cm_state: absent ic_state: absent efs_state: absent cinder_state: absent ax_state: absent pg_version: 12.3-alpine pg_vol_storageclass: gp2 pg_vol_size_g: 18 pg_create_users_and_databases: no pg_user: postgres pg_user_password: pg_database: postgres pg_bu_state: absent kc_state: absent kc_hostname: kc_user_password: kc_admin_password: .. warning:: As we're replicating an existing installation be sure to use a different admin user and password (``NEW-ADMIN-PASSWORD``). With parameters set we should now be able to deploy an "empty" infrastructure database server:: ansible-playbook site.yaml -e @parameters.yaml Restore the database -------------------- With a new "empty" infrastructure installed we can now restore the database from a backup of the original. You can use the **AWS CLI** and ``kubectl`` to copy the backup from S3 to the PostgreSQL Pod's database volume, and then restore the data using ``psql`` from within the Database **Pod**. Copy the backup from your AWS S3 bucket onto your control machine and then write it into the database **Pod**:: aws s3 cp \ s3://im-fragalysis/production-keycloak-db/backup-2023-10-16T12\:07\:01Z-dumpall.sql.gz \ ./dumpall.sql.gz kubectl cp ./dumpall.sql.gz \ database-0:/tmp/dumpall.sql.gz \ -n im-infra You can now shell into the **Pod**, and decompress and load the backup:: kubectl exec -it database-0 -n im-infra -- bash cd /tmp gzip -d dumpall.sql.gz psql -q -U postgres -f dumpall.sql template1 When the load is complete exit the **Pod**:: exit With the database restored use the Database **StatefulSet** to scale down the **Pod** (to remove it) and then scale it up again (to restart it), essentially rebooting the database server:: kubectl scale statefulset database --replicas=0 -n im-infra Wait for the Pod to terminate and then:: kubectl scale statefulset database --replicas=1 -n im-infra Installing Keycloak ------------------- With the original database restored we can install Keycloak by adjusting our parameter file and re-running the same infrastructure playbook. Ensure the following parameter values are now set in your parameter file:: kc_state: present kc_version: 10.0.2 And then re-run the infrastructure playbook:: ansible-playbook site.yaml -e @parameters.yaml Verify that you are able to reach the Keycloak server at the hostname you defined by appending ``/auth``. Production Stack ================ From this point we rely on Ansible playbooks that are provided in the the `fragalysis-stack-kubernetes`_ repository, so you will need to clone a recommended version now:: git clone https://github.com/xchem/fragalysis-stack-kubernetes.git cd fragalysis-stack-kubernetes git checkout 2025.23 Deploy the database ------------------- We need a new set of parameters to replicate the database installation. You will find a ``parameters.template.yaml`` in the ``eks-relocation`` directory. You can use this to create a ``parameters.yaml`` file in the project root (which is protected by the ``.gitignore``). Create a ``parameters.yaml`` and populate it with the following (always check with the installation you are relocating to ensure the parameters are compatible):: --- database_image_tag: '15.8' database_vol_size_g: 18 database_vol_storageclass: gp2 database_root_user: postgres database_root_password: anything-you-like database_create_users_and_databases: no database_bu_state: present database_bu_vol_storageclass: gp2 database_bu_vol_size_g: 18 stack_namespace: production-stack stack_is_for_developer: no stack_skip_deploy: yes stack_discourse_host: '' install_prerequisite_python_modules: no The root user password can be any value you like, the database has no public facing surface. Only those with access to the cluster will be able to access it. And then run the stack playbook. Because we are including sensitive material that's encrypted in this repository we'll need to provide a vault password. (more on this later):: ansible-playbook site-fragalysis-stack.yaml \ -e @parameters.yaml \ --ask-vault-password Restore the database -------------------- Just as we did with the infrastructure database we restore the database from a backup of the original production stack. Copy the backup from your AWS S3 bucket onto your control machine and then write it into the database **Pod**:: aws s3 cp \ s3:///nw-xch-prod-v2-production-stack-backup/backup-2023-10-16T12\:51\:01Z-frag.sql.gz \ ./frag.sql.gz kubectl cp ./frag.sql.gz \ database-0:/tmp/frag.sql.gz \ -n production-stack This is likely to be a large file, so it may take a while to copy into the **Pod**. Once done you can shell into the **Pod**, and decompress and load the backup:: kubectl exec -it database-0 -n production-stack -- bash cd /tmp gzip -d frag.sql.gz psql -X -U postgres frag -f frag.sql This is likely to be a large file, so it may take a while to load. When the load is complete exit the **Pod**:: exit With the database restored use the Database **StatefulSet** to scale down the **Pod** (to remove it) and then scale it up again (to restart it), essentially rebooting the database server:: kubectl scale statefulset database --replicas=0 -n production-stack Wait for the Pod to terminate and then:: kubectl scale statefulset database --replicas=1 -n production-stack Deploy the Stack ---------------- Now we can adjust our ``parameters.yaml`` so that it can now be re-executed to install the stack against the recovered database. Importantly, set your existing ``stack_skip_deploy`` to ``no`` and then add the following to your ``parameters.yaml``. The memory and volume sizes are correct for the production stack deployed at the time of writing. Set the ```` to that used during the database backup:: stack_image_tag: stack_mem_limit: 15Gi stack_mem_request: 15Gi stack_media_vol_size_g: 200 stack_media_vol_storageclass: gp2 Remember to check that the ``stack_media_vol_size_g`` suits your needs. .. note:: A number of crucial Ansible variables and values are also encrypted in the file ``roles/fragalysis-stack/vars/sensitive.vault``, and includes configuration values. You can view the sensitive file, without permanently decrypting, using the command ``ansible-vault view roles/fragalysis-stack/vars/sensitive.vault``. With suitable values in our revised ``parameters.yaml`` file, which will complement those in the ``sensitive.vault`` file, we can re-run the stack playbook:: ansible-playbook site-fragalysis-stack.yaml \ -e @parameters.yaml \ --ask-vault-password Populate the media directory ---------------------------- As the media directory resides on a volume in the stack **Pod**, which is a python container, it will be faster to copy the media from your chosen S3 bucket directly to the ``/code/media`` directory from within the **Pod** (rather than downloading to your control machine and then then uploading into the Pod). Shell into the **Pod**:: kubectl exec -it stack-0 -n production-stack -- bash Add your AWS credentials (ones that allow you to access the S3 bucket):: export AWS_ACCESS_KEY_ID=00000000 export AWS_SECRET_ACCESS_KEY=00000000 export AWS_DEFAULT_REGION=eu-west-2 Then install the **AWS CLI** and copy the media from your S3 bucket:: pip install awscli cd /code/media aws s3 cp --recursive s3://im-fragalysis/production-stack-media/ . The media directory is likely to consist of a lot of files. Expect the copy to take a while, probably 20 to 30 minutes per 150Gi. ***Congratulations!** Your relocated production stack should be ready to use. .. _adding a service account: https://docs.cloud.oracle.com/en-us/iaas/Content/ContEng/Tasks/contengaddingserviceaccttoken.htm .. _ansible-infrastructure: https://github.com/InformaticsMatters/ansible-infrastructure .. _ansible-vault: https://docs.ansible.com/ansible/latest/vault_guide/index.html .. _fragalysis-stack-kubernetes: https://github.com/xchem/fragalysis-stack-kubernetes .. _poetry: https://python-poetry.org .. _letsencrypt: https://letsencrypt.org .. _eksctl: https://eksctl.io/getting-started .. _eksctl schema: https://eksctl.io/usage/schema .. _aws-ebs-csi-driver: https://github.com/kubernetes-sigs/aws-ebs-csi-driver/blob/master/docs/install.md