Skip to content

Commit f9655ba

Browse files
committed
add troubleshooting guide
1 parent 872fd8d commit f9655ba

2 files changed

Lines changed: 229 additions & 0 deletions

File tree

Lines changed: 16 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,16 @@
1+
---
2+
title: Troubleshooting
3+
description: Always get a plan B ...
4+
weight: 15
5+
---
6+
7+
Google's [Site Reliability Engineering](https://sre.google/workbook/preface/) book says so:
8+
9+
```txt
10+
Hope is not a strategy; wish for the best, but prepare for the worst.
11+
```
12+
13+
We're working hard to make Kowabunga as resilient and fault-tolerant as possible but human nature will always prevail. There's always going to be one point in time where your database will get corrupted, when you'll face a major power-supply incident, when you'll have to bring everything back from ashes, in a timely manner ...
14+
15+
Breath up, let's see how we can help !
16+
Lines changed: 213 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,213 @@
1+
---
2+
title: Ceph
3+
description: Troubleshooting Ceph storage
4+
weight: 11
5+
---
6+
7+
Kaktus HCI nodes rely on [Ceph](https://ceph.io/en/) for underlying distributed storage.
8+
9+
Ceph provides both:
10+
11+
- RBD block-device images for **Kompute** virtual instances
12+
- CephFS distributed file system for **Kylo** storage.
13+
14+
Ceph is awesome. Ceph is fault-tolerant. Ceph hashes your file objects into thousands of pieces, distributed and replicated over dozens if not hundreds of SSDs on countless machines. And yet, Ceph sometimes crashes or fails to recover (even though it has incredible self healing capabilities).
15+
16+
While Ceph perfeclty survives some occasional nodes failure, have a try when you have a complete network or power-supply outage in your region, and you'll figure it out ;-)
17+
18+
So let's so how we can restore Ceph cluster.
19+
20+
## Unable to start OSDs
21+
22+
If Ceph OSDs can't be started, it is likely because of un-detected (and un-mounted) LVM partition.
23+
24+
A proper **mount** command should provide the following:
25+
26+
```sh
27+
$ mount | grep /var/lib/ceph/osd
28+
tmpfs on /var/lib/ceph/osd/ceph-0 type tmpfs (rw,relatime,inode64)
29+
tmpfs on /var/lib/ceph/osd/ceph-2 type tmpfs (rw,relatime,inode64)
30+
tmpfs on /var/lib/ceph/osd/ceph-1 type tmpfs (rw,relatime,inode64)
31+
tmpfs on /var/lib/ceph/osd/ceph-3 type tmpfs (rw,relatime,inode64)
32+
```
33+
34+
If not, that means that **/var/lib/ceph/osd/ceph-X** directories are empty and OSD can't run.
35+
36+
Run the following command to re-scan all LVM partitions, remount and start OSDs.
37+
38+
```sh
39+
$ sudo ceph-volume lvm activate --all
40+
```
41+
42+
Check for **mount** output (and/or re-run command) until all target disks are mounted.
43+
44+
## Fix damaged filesystem and PGs
45+
46+
In case of health error and damaged filesystem/PGs, one can easily fix those:
47+
48+
```sh
49+
$ ceph status
50+
51+
cluster:
52+
id: be45512f-8002-438a-bf12-6cbc52e317ff
53+
health: HEALTH_ERR
54+
25934 scrub errors
55+
Possible data damage: 7 pgs inconsistent
56+
```
57+
58+
Isolate the damaged PGs:
59+
60+
```sh
61+
$ ceph health detail
62+
HEALTH_ERR 25934 scrub errors; Possible data damage: 7 pgs inconsistent
63+
[ERR] OSD_SCRUB_ERRORS: 25934 scrub errors
64+
[ERR] PG_DAMAGED: Possible data damage: 7 pgs inconsistent
65+
pg 2.16 is active+clean+scrubbing+deep+inconsistent+repair, acting [5,11]
66+
pg 5.20 is active+clean+scrubbing+deep+inconsistent+repair, acting [8,4]
67+
pg 5.26 is active+clean+scrubbing+deep+inconsistent+repair, acting [11,3]
68+
pg 5.47 is active+clean+scrubbing+deep+inconsistent+repair, acting [2,9]
69+
pg 5.62 is active+clean+scrubbing+deep+inconsistent+repair, acting [8,1]
70+
pg 5.70 is active+clean+scrubbing+deep+inconsistent+repair, acting [11,2]
71+
pg 5.7f is active+clean+scrubbing+deep+inconsistent+repair, acting [5,3]
72+
```
73+
74+
Proceed with PG repair (iterate on all inconsistent PGs):
75+
76+
```sh
77+
$ ceph pg repair 2.16
78+
````
79+
80+
and wait until everything's fixed.
81+
82+
```sh
83+
$ ceph status
84+
cluster:
85+
id: be45512f-8002-438a-bf12-6cbc52e317ff
86+
health: HEALTH_OK
87+
```
88+
89+
## MDS daemon crashloop
90+
91+
If your Ceph MDS daemon (i.e. CephFS) is in a crashloop, probably because of corrupted journal, let's see how we can proceed:
92+
93+
### Get State
94+
95+
Check for global CephFs status, including clients list, number of active MDS servers etc ...
96+
97+
```sh
98+
$ ceph fs status
99+
```
100+
101+
Additionnally, you can get a dump of all filesystem, trying to find MDS daemons' status (laggy, replay ...):
102+
103+
```sh
104+
$ ceph fs dump
105+
```
106+
107+
### Prevent client connections
108+
109+
If you suspect the filesystem's to be damaged, first thing to do is to preserve any more corruption.
110+
111+
Start by stopping all CephFs clients, if under control.
112+
113+
For Kowabunga, that means stopping NFS Ganesha server on all Kaktus instances:
114+
115+
```sh
116+
$ sudo systemctl stop nfs-ganesha
117+
```
118+
119+
Prevent all client connections from server-side (i.e. Kaktus).
120+
121+
We consider that filesystem name is **nfs**:
122+
123+
```sh
124+
$ ceph config set mds mds_deny_all_reconnect true
125+
$ ceph config set mds mds_heartbeat_grace 3600
126+
$ ceph fs set nfs max_mds 1
127+
$ ceph fs set nfs refuse_client_session true
128+
$ ceph fs set nfs down true
129+
```
130+
131+
Stop server-side MDS instances on all Kaktus servers:
132+
133+
```sh
134+
$ sudo systemctl stop ceph-mds@$(hostname)
135+
```
136+
137+
### Fix metadata journal
138+
139+
You may refer to [Ceph Troubleshooting guide](https://docs.ceph.com/en/latest/cephfs/disaster-recovery-experts/) for more details on disaster recovery.
140+
141+
Start backing up journal:
142+
143+
```sh
144+
$ cephfs-journal-tool --rank nfs:all journal export backup.bin
145+
```
146+
147+
Inspect journal:
148+
149+
```sh
150+
$ cephfs-journal-tool --rank nfs:all journal inspect
151+
```
152+
153+
Then proceed with dentries recovery and journal truncation
154+
155+
```sh
156+
$ cephfs-journal-tool --rank=nfs:all event recover_dentries summary
157+
$ cephfs-journal-tool --rank=nfs:all journal reset
158+
```
159+
160+
Optionally reset session entries:
161+
162+
```sh
163+
$ cephfs-table-tool all reset session
164+
$ ceph fs reset nfs --yes-i-really-mean-it
165+
```
166+
167+
Verify Ceph MDS can be brought up again:
168+
169+
```sh
170+
$ sudo /usr/bin/ceph-mds -f --cluster ceph --id $(hostname) --setuser ceph --setgroup ceph
171+
````
172+
173+
If ok, then kill it ;-) (Ctrl+C)
174+
175+
### Resume Operations
176+
177+
Flush all OSD blocklisted MDS clients:
178+
179+
```sh
180+
$ for i in $(ceph osd blocklist ls 2>/dev/null | cut -d ' ' -f 1); do ceph osd blocklist rm $i; done
181+
```
182+
183+
Ensure we're all fine:
184+
185+
```sh
186+
$ ceph osd blocklist ls
187+
```
188+
189+
There should be no entry anymore.
190+
191+
Start server-side MDS instances on all Kaktus servers:
192+
193+
```sh
194+
$ sudo systemctl start ceph-mds@$(hostname)
195+
```
196+
197+
Enable back client connections:
198+
199+
```sh
200+
$ ceph fs set nfs down false
201+
$ ceph fs set nfs max_mds 2
202+
$ ceph fs set nfs refuse_client_session false
203+
$ ceph config set mds mds_heartbeat_grace 15
204+
$ ceph config set mds mds_deny_all_reconnect false
205+
```
206+
207+
Start back all CephFs clients, if under control.
208+
209+
For Kowabunga, that means starting NFS Ganesha server on all Kaktus instances:
210+
211+
```sh
212+
$ sudo systemctl start nfs-ganesha
213+
```

0 commit comments

Comments
 (0)