Project Saturn (Daphnis/v2) - HC2 and MooseFS - Scale-out Storage

Post Reply
Late night coder
Posts: 3
Joined: Fri Jul 03, 2020 11:14 am
languages_spoken: english
ODROIDs: HC2
Location: Sydney, Australia
Has thanked: 2 times
Been thanked: 6 times
Contact:

Project Saturn (Daphnis/v2) - HC2 and MooseFS - Scale-out Storage

Post by Late night coder »

Hello,


Although there seems to be interest in it, I haven't seen any end-to-end build guides so I thought I would share this with you.

Project Saturn was started in 2006 to construct a robust Network Attached Storage capability: to provide a NAS device which caters for the most common data storage needs (distributed/scalable, replicated/available and striped/performant storage models), within a native storage architecture that will enable multi-box, multi-rack, multi-site configurations. Generation one ("Pan") was built from GlusterFS on x86/x64, however the most recent incarnation ("Daphnis") has been built from MooseFS on Odroid HC2.

The build guide describes a highly-available scale-out storage fabric built on 16TB drives (ST16000VE000), whole-disk encryption, hardware security token authentication, and is passively cooled. In addition to a step-by-step build process for both hardware and software stages, the document includes some guidance for operations (including details on security, performance and thermal management). Further, for those interested, it also includes a complete Bill of Materials and Cost Benchmarking against the market for solution context at the time of writing.

Please find the Project Saturn (Daphnis) web site here: http://midnightcode.org/projects/saturn/daphnis/

Please find the current Saturn Installation and Operations Manual (v2.0) here: http://midnightcode.org/papers/Saturn%2 ... aphnis.pdf

There will be future updates of that manual, so keep an eye on the project site for newer versions if you're interested.

We are huge fans of the HC2, and can't wait to see what the HC3 brings when the time comes. Thanks for making such awesome equipment, team HardKernel.


Cheers,


From your friends at Midnight Code.
Attachments
A ten (10) node rack of Odroid HC2 MooseFS storage nodes powered up!
A ten (10) node rack of Odroid HC2 MooseFS storage nodes powered up!
daphnis-015-full.png (3.4 MiB) Viewed 448 times
A ten (10) node rack of Odroid HC2 MooseFS storage nodes
A ten (10) node rack of Odroid HC2 MooseFS storage nodes
daphnis-014-full.png (3.46 MiB) Viewed 448 times
These users thanked the author Late night coder for the post (total 3):
odroid (Fri Jul 03, 2020 12:12 pm) • mad_ady (Fri Jul 03, 2020 1:41 pm) • joy (Tue Jul 07, 2020 8:38 am)

mad_ady
Posts: 8316
Joined: Wed Jul 15, 2015 5:00 pm
languages_spoken: english
ODROIDs: XU4, C1+, C2, C4, N1, N2, H2, Go, Go Advance
Location: Bucharest, Romania
Has thanked: 573 times
Been thanked: 434 times
Contact:

Re: Project Saturn (Daphnis/v2) - HC2 and MooseFS - Scale-out Storage

Post by mad_ady »

Wow, quite a build you have there! I've gone through the documentation and it's great! Clearly this should be featured in Odroid Magazine.

One thing I didn't see (or maybe I missed) is how it handles node failure or how you restore a broken node.
These users thanked the author mad_ady for the post:
Late night coder (Fri Jul 03, 2020 4:44 pm)

Late night coder
Posts: 3
Joined: Fri Jul 03, 2020 11:14 am
languages_spoken: english
ODROIDs: HC2
Location: Sydney, Australia
Has thanked: 2 times
Been thanked: 6 times
Contact:

Re: Project Saturn (Daphnis/v2) - HC2 and MooseFS - Scale-out Storage

Post by Late night coder »

Thanks mad_ady, much appreciated.

Regarding node failure, I can tell you from first-hand experience that it handles failure very well indeed!

In April, while experimenting with passive cooling techniques, I put four HC2's on a book shelf that had a smooth plastic coating on it, with the CPU end of the HC2 over-hanging the shelf. After one of the HC2s fell off the shelf days into a week-long test cycle, it was confirmed that the mass of the “hanging” power and ethernet cables combined with the drive vibration resulted in a “guided walk” that drove the HC2 off the shelf, and destroyed the hard drive that it contained (though notably, not the HC2 itself!). That broken week-old 16TB drive was the most expensive "unnecessary new technology destruction" I've caused since I broke my old man's turntable in the mid 80's. ;-)

As expensive as that little catastrophe was, there was no data lost in the accident.

It is only covered lightly in the document, mostly because it is covered in existing material for MooseFS. The most relevant sections being 3.3.1 Key Concepts, and 4.3.1 Storage Tiers and Storage Policy. I will add extra anecdotal detail here for technical context.

It comes down to the MooseFS "Chunk Server" - The Chunk Server stores the data (file contents) in 64MB "chunks". Communicating with the Metadata Server, the Chunk Server will periodically validate chunk integrity using stored checksums, it will optionally replicate chunks amongst the other chunk servers in the cluster and, if corrupt, a chunk will automatically be invalidated and another copy replicated (if replication is configured, by policy). To store the chunks on disk it relies on the underlying local file system (XFS is recommended).

We tested the effectiveness of this checksum validation approach, because this was a key feature we wanted from the first generation of Saturn that we never landed (mitigation against "bit rot"). MooseFS continually actively assesses chunks on disk - both when they're accessed by the client and randomly over time in the background. When even a single bit is flipped in a chunk, the chunk is invalidated and removed. MooseFS notes the loss, as the number of available copies of that chunk is below goal (i.e. there is now only 1 of 2 copies in the cluster), and automatically replicates it - to the next available node.

MooseFS has a labelling and policy scheme that it refers too as storage classes. There is a native MooseFS manual (the MooseFS Storage Classes Manual, at https://moosefs.com/Content/Downloads/m ... manual.pdf) dedicated entirely to storage classes. The Saturn build leverages the MooseFS manual example of "Scenario 1: Two server rooms (A and B)" - on page 21 in their document - only we do that across two shelves to assure predictable availability within a site.

Out of the box, MooseFS will write two copies of each chunk, and distribute chunks between nodes based on available storage capacity. So when one node fell off a shelf using the default storage class - for every chunk of data on the floor, there was at least one other copy on the remaining shelf nodes. Not only does this approach "survive failure", it also self-recovers. Within a couple of hours MooseFS decides that the missing Chunk Server has been gone for too long, and the chunks that were now "below goal" (i.e. there's only 1 of each chunk when there should be 2) are replicated amongst the remaining/available nodes. Within a few hours (the time based on the amount of data to be copied), all chunks are back at goal (i.e. 2 copies each) and the infrastructure is as resilient as it was prior to the accident, but with less spare capacity available to the user.

The default behaviour is an excellent model for high availability but it lacks predictability. Turning off any two nodes simultaneously is likely to result in the user missing data. Turning off 50% of the nodes simultaneously is likely to leave the user with almost no "whole" files.

The Saturn manual sets out to produce highly available storage via two shelves of storage where one shelf is "mirrored" to/fro the other. In order to align physical availability features, such as switch ports and uninterruptible power supplies (UPS), we need to know which group of physical devices will always have “the other copy”.

As a part of the node installation process (section 3.3.4 Activity 3: Install and Configure the Software;
Step 10, Chunk Server sub-section) the nodes were grouped – Nodes c1, c2, c3, c4 and c5 were give the label “A”, and Nodes c6, c7, c8, c9 and c10 were given the label “B”. All of the “A” nodes are to be deployed to one "location" (shelf, with its own power and network infrastructure), while all of the “B” nodes are to be deployed to a separate independent "location" (shelf).

To ensure that one complete copy of the data set is available when either "location" (shelf) goes offline, we need to create a storage class that defines that policy intent. We call this storage class "Gold" - as we’re going to give this data Gold Class treatment. Mounting the presented MooseFS file-system (mount -t moosefs 192.168.1.111:/ /mnt/daphnis), attaching the new policy to it (mfsscadmin /mnt/daphnis create A,B Gold), and then applying that policy to the root directory (mfssetsclass Gold /mnt/daphnis) ensures that all chunks are in two places - 1 copy on any A node and 1 copy on any B node. The result is that now the logical availability grouping matches the physical availability grouping - and you can test this by turning off a whole shelf and see that you have not lost any content.

There is one caveat to all of the above, and that is the lone Metadata Server in the cluster, but this is also talked to, within the constraints of the software (MooseFS Pro is a commercial version of the software that doesn't have those constraints).

For historical context, the above disk/node/policy approach (which is the two copy version of what we have referred to as the "three copy principle", aka "store it once, plus one copy in the same rack, and another copy in a different rack") was originally (to my knowledge) described by Google: http://web.archive.org/web/200902201517 ... s/gfs.html
These users thanked the author Late night coder for the post:
mad_ady (Fri Jul 03, 2020 5:08 pm)

mad_ady
Posts: 8316
Joined: Wed Jul 15, 2015 5:00 pm
languages_spoken: english
ODROIDs: XU4, C1+, C2, C4, N1, N2, H2, Go, Go Advance
Location: Bucharest, Romania
Has thanked: 573 times
Been thanked: 434 times
Contact:

Re: Project Saturn (Daphnis/v2) - HC2 and MooseFS - Scale-out Storage

Post by mad_ady »

Thanks for the extra details.

Though I value my data, not all of it is that important in case of failure. So, I guess you could have different storage classes - one where important data is backuped on more disks, and one where it's not.
I can see this as important for "enterprise"-class use cases, but for my home needs, I'm against having the disks always on, or running bit-rot checks periodically, considering they're probably being used about 1-2 hours per day.
Can your setup (or moosefs) keep disks spun down to minimize energy usage and prolong disk life, or is it against the main goals?

Late night coder
Posts: 3
Joined: Fri Jul 03, 2020 11:14 am
languages_spoken: english
ODROIDs: HC2
Location: Sydney, Australia
Has thanked: 2 times
Been thanked: 6 times
Contact:

Re: Project Saturn (Daphnis/v2) - HC2 and MooseFS - Scale-out Storage

Post by Late night coder »

Yes, definitely able to apply different storage classes - by directory even. I.e. you could have /my/bills stored at 1 copy, /my/work stored at 2 copies and /my/familyphotos stored at 3 copies - all from a single mount point to the user. Storage classes can also be applied at any time, without forcing the user to remount; meaning /my/familyphotos could be changed to a policy of 4 copies, or 2 copies, while in active use.

The constant chunk evaluations were a key driver for us. We came from Saturn Pan which had no bit rot detection/correction and we saw single bit flips in large files with proprietary data formats. So between two sites we have four copies of the same file where one site has two copies with the same sha1sum and the other site has two copies with the same sha1sum, but the sha1sums from the two sites don't match. So one site has locally replicated a bit-flipped copy, but as to which site .. ?

To your point though, the benefit was that we were able to let drives spin down.

Purely from the MooseFS perspective, it isn't an option I've looked at. You might find that some sneaky tuning of /etc/mfs/mfschunkserver.cfg may allow you to disable the chunk tests (perhaps extremely low values of HDD_TEST_SPEED and/or extremely high values of HDD_MIN_TEST_INTERVAL), but I couldn't tell you what the implications of those changes would be for downstream functions (goals/replications).

Regarding drive life, one of the advantages of choosing the Seagate Surveillance line is that it is intended for continuous use (https://www.seagate.com/www-content/dat ... -en_AU.pdf). And when layering capability the limits from that product line - like having no encryption, having a maximum number of drives per NAS enclosure, etc - are almost (as it depends on your workload) irrelevant.
These users thanked the author Late night coder for the post (total 2):
mad_ady (Sat Jul 04, 2020 1:44 pm) • odroid (Mon Jul 06, 2020 9:52 am)

JeanSmith
Posts: 1
Joined: Sun Apr 19, 2020 1:54 am
languages_spoken: english
Has thanked: 0
Been thanked: 1 time
Contact:

Re: Project Saturn (Daphnis/v2) - HC2 and MooseFS - Scale-out Storage

Post by JeanSmith »

i like this project, i think it is very useful
These users thanked the author JeanSmith for the post:
Late night coder (Tue Jul 07, 2020 7:31 am)

Post Reply

Return to “Projects”

Who is online

Users browsing this forum: No registered users and 0 guests