Problem was probably initiated by adding more storage nodes(disks) to the cluster. This creates an extra IO load as the cluster seeks to rebalance the data over the entire diskset including the new and empty disks. The software handling the nodes started sporadically dying when the io load was particularily high, creating further IO loads when restarting.
The underlying problem is still there, but the cluster has been reported as more responsive again. Backfills will continue through the next week or two. We will create a new incident on further problems.
Posted May 07, 2020 - 15:58 CEST
We are currently investigating high IO utilization on our Ceph cluster. This will manifest itself as slower response from the filesystem.