Partial downtime/failures FRA3 Monday 27th November 2023 18:26:00


we observe partial failures on fr3

Everything is up and running, and the last host is also showing normal performance again.

The root cause was likely an unintentionally installed package on the Ceph nodes during an OS update, which caused the IPs in Ceph to become disoriented.

All hosts are reachable again, and their services are running. However, we still observe high load on some of them. We are addressing this issue, but overall, the situation has already significantly eased.

After a network/configuration problem of our ceph-cluster at FRA3, we have the situation under control again, but still have some follow-up errors to fix. some hosts and services are still down. We are working on it