################################# Performance Monitoring (sysstats) ################################# ********** Collection ********** For clusters where the linux `sysstats`_ monitoring tool is available you can run our playbook to collect daily machine statistics and write them to an S3 bucket and path. The playbook uses the standard tool ``sadf`` to generate raw statistics, which are then written to a ``.csv`` file. Wew collect **CPU** and **Memory** statistics. To simplify post-processing the columns in the files are separated by the TAB character. New values are available based on an *interval*, which is set when ``sysstats`` is installed, typically 10 minutes. ``sysstats`` is installed and configured by default on our cluster hardware. The ``.csv`` created by the playbook are kept on S3 until removed (currently a manual process). For CPU stats ``sadf`` generates a number of values at each time interval in the file:: nw-xch-prod-xchem-2404-etcd-y3 133 2025-11-19 23:58:26 UTC all %user 2.33 nw-xch-prod-xchem-2404-etcd-y3 133 2025-11-19 23:58:26 UTC all %nice 0.02 nw-xch-prod-xchem-2404-etcd-y3 133 2025-11-19 23:58:26 UTC all %system 2.11 nw-xch-prod-xchem-2404-etcd-y3 133 2025-11-19 23:58:26 UTC all %iowait 0.22 nw-xch-prod-xchem-2404-etcd-y3 133 2025-11-19 23:58:26 UTC all %steal 0.15 nw-xch-prod-xchem-2404-etcd-y3 133 2025-11-19 23:58:26 UTC all %idle 95.17 A number of values For **Memory** are also generated:: nw-xch-prod-xchem-2404-etcd-y3 133 2025-11-19 23:58:26 UTC - kbmemfree 1186804 nw-xch-prod-xchem-2404-etcd-y3 133 2025-11-19 23:58:26 UTC - kbavail 6711360 nw-xch-prod-xchem-2404-etcd-y3 133 2025-11-19 23:58:26 UTC - kbmemused 905168 nw-xch-prod-xchem-2404-etcd-y3 133 2025-11-19 23:58:26 UTC - %memused 11.14 nw-xch-prod-xchem-2404-etcd-y3 133 2025-11-19 23:58:26 UTC - kbbuffers 378148 nw-xch-prod-xchem-2404-etcd-y3 133 2025-11-19 23:58:26 UTC - kbcached 4528596 nw-xch-prod-xchem-2404-etcd-y3 133 2025-11-19 23:58:26 UTC - kbcommit 3476356 nw-xch-prod-xchem-2404-etcd-y3 133 2025-11-19 23:58:26 UTC - %commit 42.77 nw-xch-prod-xchem-2404-etcd-y3 133 2025-11-19 23:58:26 UTC - kbactive 3458768 nw-xch-prod-xchem-2404-etcd-y3 133 2025-11-19 23:58:26 UTC - kbinact 2217484 nw-xch-prod-xchem-2404-etcd-y3 133 2025-11-19 23:58:26 UTC - kbdirty 240 Columns are: - - Machine name - Interval (seconds) - Data and time - ``all`` for CPU and ``-`` for Memory - Statistic name - Statistic value The playbook and its corresponding **Schedule** in AWX creates ``.csv`` files with a name that's ``[vm]-cpu-[YYYY-MM-DD].csv`` for CPU stats, and ``[vm]-mem-[YYYY-MM-DD].csv`` for memory stats. e.g.:: nw-xch-prod-xchem-2404-etcd-y3-mem-2025-11-19.csv nw-xch-prod-xchem-2404-etcd-y3-cpu-2025-11-19.csv Worker machines, where our Pods run, will have the word ``app`` or ``worker`` in the ``[VM]`` section of the filename. All VMs in the **dummy**, **dev**, and **prod** clusters are *scraped* for statistics at 5am each day, where the the previous day's stats are collected. The statistics files can be found in the `sysstats` bucket path. ******* Summary ******* A second playbook, which runs shortly after the collection of raw statistics, takes the previous day's files and generates a summary that consists of: - - Average non-idle VCPU cores used - Maximum non-idle VCPU cores used - Total VCPU cores available - Maximum memory consumed - Total memory available The total VCPU cores and memory is the sum of all cores and memory on each machine in the summary. The playbook and its corresponding **Schedule** in AWX creates a ``summary.csv`` file that contains a header line and then a line for each day that has been processed:: Date, nproc (avg), nproc (peak), nproc (total), mem (MB peak), mem (MB total) 2025-11-20, 16.7, 28.1, 392, 562793, 2624644 2025-11-21, 16.6, 25.0, 392, 555470, 2624644 2025-11-22, 16.6, 25.2, 392, 552781, 2624644 2025-11-23, 16.6, 24.9, 392, 559744, 2624644 2025-11-24, 16.7, 33.1, 392, 579070, 2624644 2025-11-25, 17.1, 29.4, 392, 566701, 2624644 2025-11-26, 17.3, 29.0, 392, 567824, 2624644 ********* Playbooks ********* The playbooks are located in the ``topology/playbooks`` directory of this repo. The playbook that is Scheduled to run to collect the statistics is ``generate-sadf-for-yesterday.yaml``, which is expected to write collected results to S3. The playbook that is Scheduled to run to summarise the statistics is ``summarise-collected-sadf.yaml``. It updates the ``sysstats/summary.csv`` file. ********************** Displaying the summary ********************** If you have credentials you can display the ``summary.csv`` file with ``rclone``. For example, from the bastion:: export AWS_ACCESS_KEY_ID=00000000 export AWS_SECRET_ACCESS_KEY=00000000 rclone cat dls-echo:/sysstats/summary.csv Playbook variables ================== The following variables (and environment variables) need to be provided to run the *yesterday* playbook. Ansible variables unless otherwise stated: - - ``s3_bucket`` - ``s3_url`` - ``AWS_ACCESS_KEY_ID`` (Environment variable) - ``AWS_SECRET_ACCESS_KEY`` (Environment variable) These variables are provided by the **Job Template** configured in AWX. Inventory ========= The playbooks rely on an Ansible *inventory* to identify the machines (VMs) that form the clusters. The inventory is imported as an *inventory source* in AWX (the inventory in AWX is called **Clusters**). The inventory (a YAML file) can be found in this repository:: ``topology/inventory.yaml`` If you add ro remove nodes in any of the clusters you **MUST** update this topology file (re-tag the repository) and *synchronise* the AWX **Clusters** *inventory source* so that AWX can operate on the new machines. *** AWX *** You will find two items in the AWX server: an **Inventory** and a **Hob Template**. Inventory ========= The **Clusters** *Inventory* on AWX has a *Source* that is this repository's *Project* with the *Inventory File* set to ``topology/inventory.yaml``. .. warning:: If changes are made to the inventory you **MUST** *Sync* the AWX *Inventory* so that the playbooks can collect statistics for the correct machines. Job Templates ============= The **Collect cluster sysstats (sadf)** *Job Template* on AWX uses the playbooks in this repo, providing suitable credentials and variables. It has a *Schedule* that ensures it is run at 05:00 (UTC) every day. The **Summarise cluster sysstats (sadf)** *Job Template* on AWX has a *Schedule* that ensures it is run at 05:15 (UTC) every day. .. _sysstats: https://en.wikipedia.org/wiki/Sysstat