Merge pull request #1 from chucklz1515/add-story-to-readme

add-story-to-readme
This commit is contained in:
chucklz
2025-01-31 12:13:43 -05:00
committed by GitHub
2 changed files with 248 additions and 0 deletions

248
README.md
View File

@@ -1,2 +1,250 @@
# ceph-osd-file-extractor
This is a python script which will extract files from a fuse-mounted ceph OSD.
# Background
I have a home lab and run a docker swarm across 3 Linux (Debian) nodes. The nodes also have a ceph cluster serving a CephFS. The CephFS uses a pool which has x3 replication (in my head, this means that each node has a 100% copy of the data - I'm sure this logic is incorrect). I mount the CephFS to: `/mnt/ceph-fs-storage` on each node. The docker containers use this to bind volumes. This means that the docker swarm is able to place a container on any node and it has access to the data to mount.
Now, I'm not rocking a 45 drives rack or anything. I have these [ODROID H-3](https://www.hardkernel.com/shop/odroid-h3/) computers. They are quite beefy: 4CPUs, 64GB RAM, and I put a 1TB Samsung m.2 SSD in it.
The trouble is: I don't have external disks.
So, I created them, _virtually_:
```
# run below on each node
# 206GB empty file
dd if=/dev/zero of=/vdisk/ceph01 bs=100M count=2100
# bind that empty file as a block device using loop
losetup /dev/loop0 /vdisk/ceph01
# create LVM volume group on the loop device
vgcreate cephVolGrp /dev/loop0
# create LVM volume in the group
lvcreate -n cephVol01 -L 200g cephVolGrp
# backup the LVM config - no seriously, DO THIS
vgcfgbackup -f /root/cephVolGrp-cephVol01
```
Now I have a logical block volume on the server that I can use for OSD data. Note that ceph, rightfully, doesn't present it when scanning for disks. Instead, I have to add it manually. But it still works fine:
```
ceph orch daemon add osd node01:/dev/cephVolGrp/cephVol01
ceph orch daemon add osd node02:/dev/cephVolGrp/cephVol01
ceph orch daemon add osd node03:/dev/cephVolGrp/cephVol01
```
I recently encountered a failure with my ceph cluster. I'm accepting blame as it was 100% my fault for various reasons:
* Used debian's main repo for ceph packages (should use ceph's repo instead).
* Failed to upgrade ceph. Was running v16-pacific instead of v18-reef (which was latest/stable at the time I build the cluster too...).
* Was running old ceph (v16-pacific) on Debian Bookworm (old binaries _should_ not have even worked).
* Followed ceph's documentation for setting up a cluster. I'll only take half-blame here as their documentation is:
* Moderately lacking in relavent information - There are lots of commands that are out-of-date or unsupported. Some commands require being executed in podman/docker containers (without indicating as such in the docs).
* Failed to convey the importance of their cluster setup configurations.
* Upon realizing I was out-of-date, I switched to their package repo and upgraded from 16-pacific to 18-reef without any consideration.
* Destroyed my cluster without backing up configurations (except I did backup /var/lib/ceph, though it didn't matter).
* Not backing up my data.
That said, there were some saving graces:
* I didn't encrypt my cluster/OSDs. I likely would not be able to recover if I encrypted them.
* My disks were still "good". Meaning I didn't have a hardware failure.
* The ceph tooling allowed the ability to fuse-mount an OSD. More on this below.
# Time Taken
I spent around 36 hours troubleshooting and effectively _becoming_ a ceph master (journeyman?) simply by trying to recover my cluster. I want to make this repo a point where ceph noobs can come for a sliver of hope in recovering their data.
# Considerations
Obviously, the holy grail recovery would be to magically do _something_ where my ceph cluster comes back. However, this is likely not going to happen. Instead, I set myself up with some realistic goals in descending order of hope (which, coincidentally, is increasing order of likelyhood):
* I want to setup a new ceph cluster and re-attach the old OSDs
* I want to setup a new ceph cluster with new OSDs and clone the data from the old OSDs.
* I want to extract my files from my old OSDs into a new filesystem.
* I want to extract file contents from my old OSDs.
* I want to get my documents/pictures back :(
# Troubleshooting Journey
I spent 60% of those 36 hours trying to re-import the old OSDs into either (none of which worked BTW):
## Recreation of the old ceph cluster
```
:'(
```
## New ceph cluster with _same_ cluster GUID (ie. re-use old cluster's fsid)
```
# bootstrap new cluster with old cluster's fsid
#NOTE: make sure to move the /var/lib/ceph/d8c1c426-cf4c-11ee-aa6f-001e06453165 dir to a new name: /var/lib/ceph/d8c1c426-cf4c-11ee-aa6f-001e06453165_orig
cephadm bootstrap \
--fsid d8c1c426-cf4c-11ee-aa6f-001e06453165 \
--allow-overwrite \
--mon-ip 192.168.60.4 \
--cluster-network 192.168.60.0/24 \
--skip-mon-network \
--skip-ssh;
# ... skipping other ceph cluster setup details ...
# now try to activate the OSDs on each node:
# node 01 - had osd.0 with an osd fsid 9f570bd7-ea93-44ff-9628-67a9c0a05b51
ceph-volume lvm activate 0 9f570bd7-ea93-44ff-9628-67a9c0a05b51
# node 02 - had osd.1 with an osd fsid 66e38bde-3ee3-46ba-ad01-2201e534fdea
ceph-volume lvm activate 1 66e38bde-3ee3-46ba-ad01-2201e534fdea
# node 03 - had osd.2 with an osd fsid dbe70c6f-f235-4616-ae48-a47839355eb6
ceph-volume lvm activate 2 dbe70c6f-f235-4616-ae48-a47839355eb6
```
## Brand new shiny ceph cluster (ie. new cluster fsid)
```
# bootstrap new cluster with new cluster fsid
cephadm bootstrap \
--allow-overwrite \
--mon-ip 192.168.60.4 \
--cluster-network 192.168.60.0/24 \
--skip-mon-network \
--skip-ssh;
# try to adopt old OSDs on each node:
# node 01
cephadm adopt --style legacy --name osd.0
# node 02
cephadm adopt --style legacy --name osd.1
# node 03
cephadm adopt --style legacy --name osd.2
```
I spend the remaining 40% simply trying to recover the files. Since I was using this ceph cluster to store files for docker mounts, they are simple "files" and nothing crazy like sym/hardlinks, etc. This makes recovery _easier_ (again, in my head).
## First up
It seems that, no, you cannot simply browse the OSDs. The way that ceph OSDs work (to my understanding) is that it doesnt create a filesystem as you know it. Yes, the "filesystem" is called Bluestore and it is a really real file system, but tools like (g)parted can't read it. So I have to rely on ceph tooling.
### OR DO I?
```
> binwalk /dev/cephVolGrp/cephVol01
DECIMAL HEXADECIMAL DESCRIPTION
--------------------------------------------------------------------------------
6160384 0x5E0000 JPEG image data, EXIF standard
6160396 0x5E000C TIFF image data, little-endian offset of first image directory: 8
...
```
Yay! I see file markers! That means I should be able to extract them right?
```
> binwalk -e /dev/cephVolGrp/cephVol01
<results in a 200GB zlib file>
```
...okay, that's not what I was expecting. Oh! maybe the OSD was compressed and I have a compressed file. Let's try to decompress it!
```
> pigz -dc _extracted/file.zlib
<results in a 200GB uncompressed file
```
...there's no files here are there?
## I DO!
I knew I would fall into a recurring pattern of extracting/decompressing possible scenarios when that may not even be whats happening. I also couldn't garuntee that I would have filenames or paths. I really needed all 3:
* data
* filename
* path
I don't care _too_ much about file owner/permissions or modified date. If i get it, then bonus. But often times the docker services will perform all relavent commands to fix permissions when executing.
Enter: [ceph-objectstore-tool](https://docs.ceph.com/en/reef/man/8/ceph-objectstore-tool/)
This is a ceph tool that is used to work with OSDs. It is probably the tool I need to use to extract my files from an OSD. There are a few problems though:
* Documentation is __severly__ lacking. I'm not kidding.
* Half the commands will only work on OSDs that are already part of a cluster, which is exactly what I don't have.
There are a few command options that seem like they may be able to help me. Coincidentally, these are the commands with the _least_ documentation:
* `--op fsck` and `--op repair` - I hoped that I simply needed to repair the OSD's filesystem to join it to my new cluster - nope
* `--op export` - I was hoping this exports the files - it does not
* `--op dup` - This was promising. This appears to duplicate an OSD. However, it requires the src and dest OSDs to have the __same__ OSD fsid... what?
* NOTE: I did a bad thing and used a hexeditor to replace all references of the old OSD's fsid with the new OSD's fsid. It did not like it lol
* `--op fuse` - This mounts the OSD to a mountpoint using fuse
__WAIT WHAT?__
You read that right folks. There is a way to mount the OSD using fuse. This is the __closest__ I have gotten to seeing my data!
```
ceph-objectstore-tool --no-mon-config --op fuse --data-path /var/lib/ceph/d8c1c426-cf4c-11ee-aa6f-001e06453165_orig/osd.2/ --mountpoint /mnt/test
ls -la /mnt/test
total 0
drwx------ 0 root root 0 Dec 31 1969 2.0_head
drwx------ 0 root root 0 Dec 31 1969 3.0_head
drwx------ 0 root root 0 Dec 31 1969 3.10_head
drwx------ 0 root root 0 Dec 31 1969 3.11_head
drwx------ 0 root root 0 Dec 31 1969 3.12_head
...
```
However, the fight is not over yet.
## The Fuse-Mounted OSD
I poked around a bit inside the fuse-mounted OSD. After a short while I was able to piece some things together:
* The top-level folder names (ie. 2.0_head) _appear_ to be the placement group data.
* There is a `_parent` binary file which explains the file's full path within the CephFS
* There is a `data` binary file which is the file's data
Folder structure looks like this:
```
/mnt/test
|
+-3.4_head (i think this is the placement group)
|
+- all
|
+- #3:20000000::::head#
|
+- data (this is a binary file of the file's data)
|
+- attr
|
+- _parent (this is a binary file which contain's the file's full path within the CephFS)
```
Holy cow, I had my data!! The problem now is extracting it. I took a peek inside the `_parent` file. It looks something like this:
```
<redacted lol, I've included it as a file in this repo>
```
Using a hexeditor, I took a few hours to map out the binary file and how I can select each folder name. I'm still missing __a lot__, but I was able to determine the following:
* There are "file/folder definition" blocks
* The first block always starts at address `0x12`
* The "file/folder definition" block always starts with a `0x02 0x02`. Address mappings from here are relative:
* address `0x00 - 0x01`: beginning `0x02 0x02` block marker
* address `0x03 - 0x04`: I thought the next 2 bytes were file permissions, but it wasn't consistant. ignoring for now
* address `0x05 - 0x06`: this is 2 byte null: `0x00 0x00`
* address `0x07 - 0x0A`: I thought these 4 bytes were date, but it wasn't consistant. ignoring for now
* address `0x0B - 0x0B`: this is 1 byte null: `0x00`
* address `0x0C - 0x0C`: "_the one true byte_" - my guess is that it is some kind of alignment byte: `0x01`
* address `0x0D - 0x0E`: this is 2 byte null: `0x00 0x00`
* address `0x0F - 0x0F`: this is the length of the file/folder name. it tells you how many bytes to select in the next area.
* __EXAMPLE: `0x04` for file name that is 4 bytes long__
* address `0x10 - 0x13`: this is 3 byte null: `0x00 0x00 0x00`
* address `0x14 - 0x18`: this is the start of the file/folder name. its length is determined by the above result.
* __NOTE: I'm using an example of 4 byte file name length__
* address `0x19 - 0x19`: this byte is unknown. it has a value, but i didn't see any point in collecting it
* address `0x1A - 0x23`: honestly, i didn't care for the information at this point. i had what i needed which was the file/folder names. Importantly, from this byte to the next file/folder definition block is 8 bytes
* The "file/folder definition" blocks define the file's full path, but in reverse order (ie. file/parent/parent_parent/etc.../) so it needs to be reversed
After mapping it out, I then proceeded to write a python script that would:
* Walk over each placement group folder (ie. 2.0_head)
* NOTE: exclude the metadata folder: `/mnt/test/meta`
* Find the `_parent` file in the `attr` directory
* Scrape the file's full path information
* Copy the data file to the new cluster with the path information
# Conclusion
I was able to successfully recover my files. Granted, they have no metadata (correct permissions, datetime, etc), but I haven't lost anything.
# Final Words
__BACKUP YOUR DATA YOU FOOLS!!!__

BIN
_parent Normal file

Binary file not shown.