I have a home lab and run a docker swarm across 3 Linux (Debian) nodes. The nodes also have a ceph cluster serving a CephFS. The CephFS uses a pool which has x3 replication (in my head, this means that each node has a 100% copy of the data - I'm sure this logic is incorrect). I mount the CephFS to: `/mnt/ceph-fs-storage` on each node. The docker containers use this to bind volumes. This means that the docker swarm is able to place a container on any node and it has access to the data to mount.
Now, I'm not rocking a 45 drives rack or anything. I have these [ODROID H-3](https://www.hardkernel.com/shop/odroid-h3/) computers. They are quite beefy: 4CPUs, 64GB RAM, and I put a 1TB Samsung m.2 SSD in it.
# bind that empty file as a block device using loop
losetup /dev/loop0 /vdisk/ceph01
# create LVM volume group on the loop device
vgcreate cephVolGrp /dev/loop0
# create LVM volume in the group
lvcreate -n cephVol01 -L 200g cephVolGrp
# backup the LVM config - no seriously, DO THIS
vgcfgbackup -f /root/cephVolGrp-cephVol01
```
Now I have a logical block volume on the server that I can use for OSD data. Note that ceph, rightfully, doesn't present it when scanning for disks. Instead, I have to add it manually. But it still works fine:
I recently encountered a failure with my ceph cluster. I'm accepting blame as it was 100% my fault for various reasons:
* Used debian's main repo for ceph packages (should use ceph's repo instead).
* Failed to upgrade ceph. Was running v16-pacific instead of v18-reef (which was latest/stable at the time I build the cluster too...).
* Was running old ceph (v16-pacific) on Debian Bookworm (old binaries _should_ not have even worked).
* Followed ceph's documentation for setting up a cluster. I'll only take half-blame here as their documentation is:
* Moderately lacking in relavent information - There are lots of commands that are out-of-date or unsupported. Some commands require being executed in podman/docker containers (without indicating as such in the docs).
* Failed to convey the importance of their cluster setup configurations.
* Upon realizing I was out-of-date, I switched to their package repo and upgraded from 16-pacific to 18-reef without any consideration.
* Destroyed my cluster without backing up configurations (except I did backup /var/lib/ceph, though it didn't matter).
* Not backing up my data.
That said, there were some saving graces:
* I didn't encrypt my cluster/OSDs. I likely would not be able to recover if I encrypted them.
* My disks were still "good". Meaning I didn't have a hardware failure.
* The ceph tooling allowed the ability to fuse-mount an OSD. More on this below.
# Time Taken
I spent around 36 hours troubleshooting and effectively _becoming_ a ceph master (journeyman?) simply by trying to recover my cluster. I want to make this repo a point where ceph noobs can come for a sliver of hope in recovering their data.
# Considerations
Obviously, the holy grail recovery would be to magically do _something_ where my ceph cluster comes back. However, this is likely not going to happen. Instead, I set myself up with some realistic goals in descending order of hope (which, coincidentally, is increasing order of likelyhood):
* I want to setup a new ceph cluster and re-attach the old OSDs
* I want to setup a new ceph cluster with new OSDs and clone the data from the old OSDs.
* I want to extract my files from my old OSDs into a new filesystem.
* I want to extract file contents from my old OSDs.
* I want to get my documents/pictures back :(
# Troubleshooting Journey
I spent 60% of those 36 hours trying to re-import the old OSDs into either (none of which worked BTW):
## Recreation of the old ceph cluster
```
:'(
```
## New ceph cluster with _same_ cluster GUID (ie. re-use old cluster's fsid)
```
# bootstrap new cluster with old cluster's fsid
#NOTE: make sure to move the /var/lib/ceph/d8c1c426-cf4c-11ee-aa6f-001e06453165 dir to a new name: /var/lib/ceph/d8c1c426-cf4c-11ee-aa6f-001e06453165_orig
cephadm bootstrap \
--fsid d8c1c426-cf4c-11ee-aa6f-001e06453165 \
--allow-overwrite \
--mon-ip 192.168.60.4 \
--cluster-network 192.168.60.0/24 \
--skip-mon-network \
--skip-ssh;
# ... skipping other ceph cluster setup details ...
# now try to activate the OSDs on each node:
# node 01 - had osd.0 with an osd fsid 9f570bd7-ea93-44ff-9628-67a9c0a05b51
## Brand new shiny ceph cluster (ie. new cluster fsid)
```
# bootstrap new cluster with new cluster fsid
cephadm bootstrap \
--allow-overwrite \
--mon-ip 192.168.60.4 \
--cluster-network 192.168.60.0/24 \
--skip-mon-network \
--skip-ssh;
# try to adopt old OSDs on each node:
# node 01
cephadm adopt --style legacy --name osd.0
# node 02
cephadm adopt --style legacy --name osd.1
# node 03
cephadm adopt --style legacy --name osd.2
```
I spend the remaining 40% simply trying to recover the files. Since I was using this ceph cluster to store files for docker mounts, they are simple "files" and nothing crazy like sym/hardlinks, etc. This makes recovery _easier_ (again, in my head).
## First up
It seems that, no, you cannot simply browse the OSDs. The way that ceph OSDs work (to my understanding) is that it doesnt create a filesystem as you know it. Yes, the "filesystem" is called Bluestore and it is a really real file system, but tools like (g)parted can't read it. So I have to rely on ceph tooling.
I knew I would fall into a recurring pattern of extracting/decompressing possible scenarios when that may not even be whats happening. I also couldn't garuntee that I would have filenames or paths. I really needed all 3:
* data
* filename
* path
I don't care _too_ much about file owner/permissions or modified date. If i get it, then bonus. But often times the docker services will perform all relavent commands to fix permissions when executing.
This is a ceph tool that is used to work with OSDs. It is probably the tool I need to use to extract my files from an OSD. There are a few problems though:
* Documentation is __severly__ lacking. I'm not kidding.
* Half the commands will only work on OSDs that are already part of a cluster, which is exactly what I don't have.
There are a few command options that seem like they may be able to help me. Coincidentally, these are the commands with the _least_ documentation:
*`--op fsck` and `--op repair` - I hoped that I simply needed to repair the OSD's filesystem to join it to my new cluster - nope
*`--op export` - I was hoping this exports the files - it does not
*`--op dup` - This was promising. This appears to duplicate an OSD. However, it requires the src and dest OSDs to have the __same__ OSD fsid... what?
* NOTE: I did a bad thing and used a hexeditor to replace all references of the old OSD's fsid with the new OSD's fsid. It did not like it lol
*`--op fuse` - This mounts the OSD to a mountpoint using fuse
__WAIT WHAT?__
You read that right folks. There is a way to mount the OSD using fuse. This is the __closest__ I have gotten to seeing my data!
+- data (this is a binary file of the file's data)
|
+- attr
|
+- _parent (this is a binary file which contain's the file's full path within the CephFS)
```
Holy cow, I had my data!! The problem now is extracting it. I took a peek inside the `_parent` file. It looks something like this:
```
<redacted lol, I've included it as a file in this repo>
```
Using a hexeditor, I took a few hours to map out the binary file and how I can select each folder name. I'm still missing __a lot__, but I was able to determine the following:
* There are "file/folder definition" blocks
* The first block always starts at address `0x12`
* The "file/folder definition" block always starts with a `0x02 0x02`. Address mappings from here are relative:
* address `0x03 - 0x04`: I thought the next 2 bytes were file permissions, but it wasn't consistant. ignoring for now
* address `0x05 - 0x06`: this is 2 byte null: `0x00 0x00`
* address `0x07 - 0x0A`: I thought these 4 bytes were date, but it wasn't consistant. ignoring for now
* address `0x0B - 0x0B`: this is 1 byte null: `0x00`
* address `0x0C - 0x0C`: "_the one true byte_" - my guess is that it is some kind of alignment byte: `0x01`
* address `0x0D - 0x0E`: this is 2 byte null: `0x00 0x00`
* address `0x0F - 0x0F`: this is the length of the file/folder name. it tells you how many bytes to select in the next area.
* __EXAMPLE: `0x04` for file name that is 4 bytes long__
* address `0x10 - 0x13`: this is 3 byte null: `0x00 0x00 0x00`
* address `0x14 - 0x18`: this is the start of the file/folder name. its length is determined by the above result.
* __NOTE: I'm using an example of 4 byte file name length__
* address `0x19 - 0x19`: this byte is unknown. it has a value, but i didn't see any point in collecting it
* address `0x1A - 0x23`: honestly, i didn't care for the information at this point. i had what i needed which was the file/folder names. Importantly, from this byte to the next file/folder definition block is 8 bytes
* The "file/folder definition" blocks define the file's full path, but in reverse order (ie. file/parent/parent_parent/etc.../) so it needs to be reversed
After mapping it out, I then proceeded to write a python script that would:
* Walk over each placement group folder (ie. 2.0_head)
* NOTE: exclude the metadata folder: `/mnt/test/meta`
* Find the `_parent` file in the `attr` directory
* Scrape the file's full path information
* Copy the data file to the new cluster with the path information
Once I ran the script to start scraping the data, I found that there were several cases where the script will crash due to an error similar to:
```
old file: /mnt/test/5.1b_head/all/#5:d8b4db0d:::1000009b975.00000000:head#/data
new file: /mnt/ceph-fs-storage/data/some_dir/some_file
Traceback (most recent call last):
File "/root/scrape.py", line 158, in <module>
shutil.copyfile(pgDataFile, newFile)
File "/usr/lib/python3.11/shutil.py", line 258, in copyfile
with open(dst, 'wb') as fdst:
^^^^^^^^^^^^^^^
NotADirectoryError: [Errno 20] Not a directory: '/mnt/ceph-fs-storage/data/some_dir/some_file'
```
This occurs when:
* Previously, the script found a `_parent` and `data` file for `/mnt/ceph-fs-storage/data/some_dir`.
* The `data` file had a size of 0 bytes.
* The script created a __file__ at the path `'/mnt/ceph-fs-storage/data/some_dir`.
* Now, the script is trying to create a file at `/mnt/ceph-fs-storage/data/some_dir/some_file` only to find that `some_dir` is a file and not a directory.
There was a byte in the `_parent` file tells me the file type (ie. `0x04` for folder, `0x05` for file). I parsed that out and fixed the error.
## Large Files > 4MB
At this point, I thought I had recovered all my files. However, upon starting services (namely DBs), I found that there was data corruption. I investigated and found that files that were suppose to be be larger than 4MB, we re truncated to 4MB. To be honest, I should have seen this coming as ceph operates by striping data.
### Investigation
I had to deep dive some more into the binary files to find out more about the structure of Bluestore file system. I will work with the example path for this part:
*`5.f_head` - I still think this is the placement group
*`all` - I think this means "all" files in this placement group
*`#5:f1988779:::10002413871.00000000:head#` - This is broken down into the below sections. Note that I think the `#` here actually start/end blocks for the name. Also, I think the `:` is a separator for each component.
*`5` - The `5` matches the placement group major number
*`f1988779` - This is a bitwise hash for the data file. I wont bother calculating matching hashes as that would take __forever__
*`10002413871` - This is a unique file indicator
*`00000000` - This is the (hexidecimal) sequential chunk number for the unique file
* Only the `00000000` folder has the `_parent` file
The way that this works is that file larger than 4MB, it is ripped into 4MB chunks. The chunks are then spread across the OSD. Here is an example of what that could look like: