Performance Monitoring (sysstats)
Collection
For clusters where the linux sysstats monitoring tool is available you can run our playbook to collect daily machine statistics and write them to an S3 bucket and path.
The playbook uses the standard tool sadf to generate raw statistics, which are
then written to a .csv file.
Wew collect CPU and Memory statistics. To simplify post-processing the columns
in the files are separated by the TAB character. New values are available based on
an interval, which is set when sysstats is installed, typically 10 minutes.
sysstats is installed and configured by default on our cluster hardware.
The .csv created by the playbook are kept on S3 until removed
(currently a manual process).
For CPU stats sadf generates a number of values at each time interval in the file:
nw-xch-prod-xchem-2404-etcd-y3 133 2025-11-19 23:58:26 UTC all %user 2.33
nw-xch-prod-xchem-2404-etcd-y3 133 2025-11-19 23:58:26 UTC all %nice 0.02
nw-xch-prod-xchem-2404-etcd-y3 133 2025-11-19 23:58:26 UTC all %system 2.11
nw-xch-prod-xchem-2404-etcd-y3 133 2025-11-19 23:58:26 UTC all %iowait 0.22
nw-xch-prod-xchem-2404-etcd-y3 133 2025-11-19 23:58:26 UTC all %steal 0.15
nw-xch-prod-xchem-2404-etcd-y3 133 2025-11-19 23:58:26 UTC all %idle 95.17
A number of values For Memory are also generated:
nw-xch-prod-xchem-2404-etcd-y3 133 2025-11-19 23:58:26 UTC - kbmemfree 1186804
nw-xch-prod-xchem-2404-etcd-y3 133 2025-11-19 23:58:26 UTC - kbavail 6711360
nw-xch-prod-xchem-2404-etcd-y3 133 2025-11-19 23:58:26 UTC - kbmemused 905168
nw-xch-prod-xchem-2404-etcd-y3 133 2025-11-19 23:58:26 UTC - %memused 11.14
nw-xch-prod-xchem-2404-etcd-y3 133 2025-11-19 23:58:26 UTC - kbbuffers 378148
nw-xch-prod-xchem-2404-etcd-y3 133 2025-11-19 23:58:26 UTC - kbcached 4528596
nw-xch-prod-xchem-2404-etcd-y3 133 2025-11-19 23:58:26 UTC - kbcommit 3476356
nw-xch-prod-xchem-2404-etcd-y3 133 2025-11-19 23:58:26 UTC - %commit 42.77
nw-xch-prod-xchem-2404-etcd-y3 133 2025-11-19 23:58:26 UTC - kbactive 3458768
nw-xch-prod-xchem-2404-etcd-y3 133 2025-11-19 23:58:26 UTC - kbinact 2217484
nw-xch-prod-xchem-2404-etcd-y3 133 2025-11-19 23:58:26 UTC - kbdirty 240
Columns are: -
Machine name
Interval (seconds)
Data and time
allfor CPU and-for MemoryStatistic name
Statistic value
The playbook and its corresponding Schedule in AWX creates .csv files with a
name that’s [vm]-cpu-[YYYY-MM-DD].csv for CPU stats, and [vm]-mem-[YYYY-MM-DD].csv
for memory stats. e.g.:
nw-xch-prod-xchem-2404-etcd-y3-mem-2025-11-19.csv
nw-xch-prod-xchem-2404-etcd-y3-cpu-2025-11-19.csv
Worker machines, where our Pods run, will have the word app or worker in the
[VM] section of the filename.
All VMs in the dummy, dev, and prod clusters are scraped for statistics at 5am each day, where the the previous day’s stats are collected.
The statistics files can be found in the sysstats bucket path.
Summary
A second playbook, which runs shortly after the collection of raw statistics, takes the previous day’s files and generates a summary that consists of: -
Average non-idle VCPU cores used
Maximum non-idle VCPU cores used
Total VCPU cores available
Maximum memory consumed
Total memory available
The total VCPU cores and memory is the sum of all cores and memory on each machine in the summary.
The playbook and its corresponding Schedule in AWX creates a summary.csv file
that contains a header line and then a line for each day that has been processed:
Date, nproc (avg), nproc (peak), nproc (total), mem (MB peak), mem (MB total)
2025-11-20, 16.7, 28.1, 392, 562793, 2624644
2025-11-21, 16.6, 25.0, 392, 555470, 2624644
2025-11-22, 16.6, 25.2, 392, 552781, 2624644
2025-11-23, 16.6, 24.9, 392, 559744, 2624644
2025-11-24, 16.7, 33.1, 392, 579070, 2624644
2025-11-25, 17.1, 29.4, 392, 566701, 2624644
2025-11-26, 17.3, 29.0, 392, 567824, 2624644
Playbooks
The playbooks are located in the topology/playbooks directory of this repo.
The playbook that is Scheduled to run to collect the statistics
is generate-sadf-for-yesterday.yaml, which is expected to write collected
results to S3.
The playbook that is Scheduled to run to summarise the statistics
is summarise-collected-sadf.yaml. It updates the sysstats/summary.csv file.
Displaying the summary
If you have credentials you can display the summary.csv file with rclone.
For example, from the bastion:
export AWS_ACCESS_KEY_ID=00000000
export AWS_SECRET_ACCESS_KEY=00000000
rclone cat dls-echo:/sysstats/summary.csv
Playbook variables
The following variables (and environment variables) need to be provided to run the yesterday playbook. Ansible variables unless otherwise stated: -
s3_buckets3_urlAWS_ACCESS_KEY_ID(Environment variable)AWS_SECRET_ACCESS_KEY(Environment variable)
These variables are provided by the Job Template configured in AWX.
Inventory
The playbooks rely on an Ansible inventory to identify the machines (VMs) that form the clusters. The inventory is imported as an inventory source in AWX (the inventory in AWX is called Clusters).
The inventory (a YAML file) can be found in this repository:
``topology/inventory.yaml``
If you add ro remove nodes in any of the clusters you MUST update this topology file (re-tag the repository) and synchronise the AWX Clusters inventory source so that AWX can operate on the new machines.
AWX
You will find two items in the AWX server: an Inventory and a Hob Template.
Inventory
The Clusters Inventory on AWX has a Source that is this repository’s Project
with the Inventory File set to topology/inventory.yaml.
Warning
If changes are made to the inventory you MUST Sync the AWX Inventory so that the playbooks can collect statistics for the correct machines.
Job Templates
The Collect cluster sysstats (sadf) Job Template on AWX uses the playbooks in this repo, providing suitable credentials and variables. It has a Schedule that ensures it is run at 05:00 (UTC) every day.
The Summarise cluster sysstats (sadf) Job Template on AWX has a Schedule that ensures it is run at 05:15 (UTC) every day.