|
| 1 | +--- |
| 2 | +title: Ceph |
| 3 | +description: Troubleshooting Ceph storage |
| 4 | +weight: 11 |
| 5 | +--- |
| 6 | + |
| 7 | +Kaktus HCI nodes rely on [Ceph](https://ceph.io/en/) for underlying distributed storage. |
| 8 | + |
| 9 | +Ceph provides both: |
| 10 | + |
| 11 | +- RBD block-device images for **Kompute** virtual instances |
| 12 | +- CephFS distributed file system for **Kylo** storage. |
| 13 | + |
| 14 | +Ceph is awesome. Ceph is fault-tolerant. Ceph hashes your file objects into thousands of pieces, distributed and replicated over dozens if not hundreds of SSDs on countless machines. And yet, Ceph sometimes crashes or fails to recover (even though it has incredible self healing capabilities). |
| 15 | + |
| 16 | +While Ceph perfeclty survives some occasional nodes failure, have a try when you have a complete network or power-supply outage in your region, and you'll figure it out ;-) |
| 17 | + |
| 18 | +So let's so how we can restore Ceph cluster. |
| 19 | + |
| 20 | +## Unable to start OSDs |
| 21 | + |
| 22 | +If Ceph OSDs can't be started, it is likely because of un-detected (and un-mounted) LVM partition. |
| 23 | + |
| 24 | +A proper **mount** command should provide the following: |
| 25 | + |
| 26 | +```sh |
| 27 | +$ mount | grep /var/lib/ceph/osd |
| 28 | +tmpfs on /var/lib/ceph/osd/ceph-0 type tmpfs (rw,relatime,inode64) |
| 29 | +tmpfs on /var/lib/ceph/osd/ceph-2 type tmpfs (rw,relatime,inode64) |
| 30 | +tmpfs on /var/lib/ceph/osd/ceph-1 type tmpfs (rw,relatime,inode64) |
| 31 | +tmpfs on /var/lib/ceph/osd/ceph-3 type tmpfs (rw,relatime,inode64) |
| 32 | +``` |
| 33 | + |
| 34 | +If not, that means that **/var/lib/ceph/osd/ceph-X** directories are empty and OSD can't run. |
| 35 | + |
| 36 | +Run the following command to re-scan all LVM partitions, remount and start OSDs. |
| 37 | + |
| 38 | +```sh |
| 39 | +$ sudo ceph-volume lvm activate --all |
| 40 | +``` |
| 41 | + |
| 42 | +Check for **mount** output (and/or re-run command) until all target disks are mounted. |
| 43 | + |
| 44 | +## Fix damaged filesystem and PGs |
| 45 | + |
| 46 | +In case of health error and damaged filesystem/PGs, one can easily fix those: |
| 47 | + |
| 48 | +```sh |
| 49 | +$ ceph status |
| 50 | + |
| 51 | + cluster: |
| 52 | + id: be45512f-8002-438a-bf12-6cbc52e317ff |
| 53 | + health: HEALTH_ERR |
| 54 | + 25934 scrub errors |
| 55 | + Possible data damage: 7 pgs inconsistent |
| 56 | +``` |
| 57 | + |
| 58 | +Isolate the damaged PGs: |
| 59 | + |
| 60 | +```sh |
| 61 | +$ ceph health detail |
| 62 | +HEALTH_ERR 25934 scrub errors; Possible data damage: 7 pgs inconsistent |
| 63 | +[ERR] OSD_SCRUB_ERRORS: 25934 scrub errors |
| 64 | +[ERR] PG_DAMAGED: Possible data damage: 7 pgs inconsistent |
| 65 | + pg 2.16 is active+clean+scrubbing+deep+inconsistent+repair, acting [5,11] |
| 66 | + pg 5.20 is active+clean+scrubbing+deep+inconsistent+repair, acting [8,4] |
| 67 | + pg 5.26 is active+clean+scrubbing+deep+inconsistent+repair, acting [11,3] |
| 68 | + pg 5.47 is active+clean+scrubbing+deep+inconsistent+repair, acting [2,9] |
| 69 | + pg 5.62 is active+clean+scrubbing+deep+inconsistent+repair, acting [8,1] |
| 70 | + pg 5.70 is active+clean+scrubbing+deep+inconsistent+repair, acting [11,2] |
| 71 | + pg 5.7f is active+clean+scrubbing+deep+inconsistent+repair, acting [5,3] |
| 72 | +``` |
| 73 | + |
| 74 | +Proceed with PG repair (iterate on all inconsistent PGs): |
| 75 | + |
| 76 | +```sh |
| 77 | +$ ceph pg repair 2.16 |
| 78 | +```` |
| 79 | + |
| 80 | +and wait until everything's fixed. |
| 81 | +
|
| 82 | +```sh |
| 83 | +$ ceph status |
| 84 | + cluster: |
| 85 | + id: be45512f-8002-438a-bf12-6cbc52e317ff |
| 86 | + health: HEALTH_OK |
| 87 | +``` |
| 88 | +
|
| 89 | +## MDS daemon crashloop |
| 90 | +
|
| 91 | +If your Ceph MDS daemon (i.e. CephFS) is in a crashloop, probably because of corrupted journal, let's see how we can proceed: |
| 92 | + |
| 93 | +### Get State |
| 94 | + |
| 95 | +Check for global CephFs status, including clients list, number of active MDS servers etc ... |
| 96 | + |
| 97 | +```sh |
| 98 | +$ ceph fs status |
| 99 | +``` |
| 100 | + |
| 101 | +Additionnally, you can get a dump of all filesystem, trying to find MDS daemons' status (laggy, replay ...): |
| 102 | +
|
| 103 | +```sh |
| 104 | +$ ceph fs dump |
| 105 | +``` |
| 106 | +
|
| 107 | +### Prevent client connections |
| 108 | +
|
| 109 | +If you suspect the filesystem's to be damaged, first thing to do is to preserve any more corruption. |
| 110 | + |
| 111 | +Start by stopping all CephFs clients, if under control. |
| 112 | + |
| 113 | +For Kowabunga, that means stopping NFS Ganesha server on all Kaktus instances: |
| 114 | + |
| 115 | +```sh |
| 116 | +$ sudo systemctl stop nfs-ganesha |
| 117 | +``` |
| 118 | + |
| 119 | +Prevent all client connections from server-side (i.e. Kaktus). |
| 120 | + |
| 121 | +We consider that filesystem name is **nfs**: |
| 122 | + |
| 123 | +```sh |
| 124 | +$ ceph config set mds mds_deny_all_reconnect true |
| 125 | +$ ceph config set mds mds_heartbeat_grace 3600 |
| 126 | +$ ceph fs set nfs max_mds 1 |
| 127 | +$ ceph fs set nfs refuse_client_session true |
| 128 | +$ ceph fs set nfs down true |
| 129 | +``` |
| 130 | + |
| 131 | +Stop server-side MDS instances on all Kaktus servers: |
| 132 | + |
| 133 | +```sh |
| 134 | +$ sudo systemctl stop ceph-mds@$(hostname) |
| 135 | +``` |
| 136 | + |
| 137 | +### Fix metadata journal |
| 138 | + |
| 139 | +You may refer to [Ceph Troubleshooting guide](https://docs.ceph.com/en/latest/cephfs/disaster-recovery-experts/) for more details on disaster recovery. |
| 140 | + |
| 141 | +Start backing up journal: |
| 142 | + |
| 143 | +```sh |
| 144 | +$ cephfs-journal-tool --rank nfs:all journal export backup.bin |
| 145 | +``` |
| 146 | + |
| 147 | +Inspect journal: |
| 148 | + |
| 149 | +```sh |
| 150 | +$ cephfs-journal-tool --rank nfs:all journal inspect |
| 151 | +``` |
| 152 | + |
| 153 | +Then proceed with dentries recovery and journal truncation |
| 154 | + |
| 155 | +```sh |
| 156 | +$ cephfs-journal-tool --rank=nfs:all event recover_dentries summary |
| 157 | +$ cephfs-journal-tool --rank=nfs:all journal reset |
| 158 | +``` |
| 159 | + |
| 160 | +Optionally reset session entries: |
| 161 | + |
| 162 | +```sh |
| 163 | +$ cephfs-table-tool all reset session |
| 164 | +$ ceph fs reset nfs --yes-i-really-mean-it |
| 165 | +``` |
| 166 | + |
| 167 | +Verify Ceph MDS can be brought up again: |
| 168 | + |
| 169 | +```sh |
| 170 | +$ sudo /usr/bin/ceph-mds -f --cluster ceph --id $(hostname) --setuser ceph --setgroup ceph |
| 171 | +```` |
| 172 | +
|
| 173 | +If ok, then kill it ;-) (Ctrl+C) |
| 174 | +
|
| 175 | +### Resume Operations |
| 176 | +
|
| 177 | +Flush all OSD blocklisted MDS clients: |
| 178 | +
|
| 179 | +```sh |
| 180 | +$ for i in $(ceph osd blocklist ls 2>/dev/null | cut -d ' ' -f 1); do ceph osd blocklist rm $i; done |
| 181 | +``` |
| 182 | +
|
| 183 | +Ensure we're all fine: |
| 184 | +
|
| 185 | +```sh |
| 186 | +$ ceph osd blocklist ls |
| 187 | +``` |
| 188 | +
|
| 189 | +There should be no entry anymore. |
| 190 | +
|
| 191 | +Start server-side MDS instances on all Kaktus servers: |
| 192 | +
|
| 193 | +```sh |
| 194 | +$ sudo systemctl start ceph-mds@$(hostname) |
| 195 | +``` |
| 196 | +
|
| 197 | +Enable back client connections: |
| 198 | +
|
| 199 | +```sh |
| 200 | +$ ceph fs set nfs down false |
| 201 | +$ ceph fs set nfs max_mds 2 |
| 202 | +$ ceph fs set nfs refuse_client_session false |
| 203 | +$ ceph config set mds mds_heartbeat_grace 15 |
| 204 | +$ ceph config set mds mds_deny_all_reconnect false |
| 205 | +``` |
| 206 | +
|
| 207 | +Start back all CephFs clients, if under control. |
| 208 | +
|
| 209 | +For Kowabunga, that means starting NFS Ganesha server on all Kaktus instances: |
| 210 | +
|
| 211 | +```sh |
| 212 | +$ sudo systemctl start nfs-ganesha |
| 213 | +``` |
0 commit comments