Computer data storage

Failure, error, and how to deal (concepts)
Noticing errors and failure
- Reading SMART reports
Partitioning and filesystems
- ZFS notes
Network storage
RAID notes
- mdadm notes, aacraid notes, OMSA notes, LSI notes
General & RAID performance tweaking
SSD notes
LVM notes
Some glossary
Semi-sorted

📃 These are primarily notes, intended to be a collection of useful fragments, that will probably never be complete in any sense.

Filesystems

(with a focus on distributed filesystems)

http://en.wikipedia.org/wiki/List_of_file_systems#Distributed_parallel_fault-tolerant_file_systems http://en.wikipedia.org/wiki/Comparison_of_distributed_file_systems

Distinctions around multiple disks - LVM, NAS, SAN, and distributed, syncing storage, etc.

NFS notes

See NFS notes

Relevant here: pNFS / PanFS / Panasas

SMB notes

See SMB, CIFS, Samba, Windows File Sharing notes

GlusterFS

✎ This article/section is a stub — some half-sorted notes, not necessarily checked, not necessarily correct. Feel free to ignore, or tell me about it.

Fault tolerance:  with replication there is a self-healing scrub operation 
Speed:            scales well. Metadata not distributed so some fs operations not as fast as read/write
Load balancing:   implied by striping, distribution
Access:           library (gfapi), POSIX-semantics mount (via FUSE), built-in NFSv3, or block device (since  2013)
Expandable:       yes   (add bricks, update configuration. with some important side notes, though)
Networking:       TCP/IP, Infiniband (RDMA), or SDP

CAP-wise it seems to be AP.

Pretty decent streaming throughput. Some metadata commands are slowish. (so arguably good for large-dataset stuff, less ideal for many-client stuff).

Relatively easy to manage compared to various others - though resolving problem is reportedly more involved.

Wide-ishly used, implying a bunch of support in the form of experience (forums, IRC, RedHat support).

No authentication

You can use your generic-purpose firewall to not allow the wrong machines in (...or...)

gluster itself can have per-volume host whitelists, e.g. gluster volume set volname auth.allow 192.168.1.*

POSIX interface means it stores UIDs and GIDs, so you probably want to synchronize what those mean among participating hosts.

That probably means YP or similar.

Any host that runs glusterfsd is effectively a server. Each server can have one or more storage bricks (see terms), which can be dynamically included into one storage pool.

How to use bricks (e.g. stripe, mirror) is up to configuration, which is client-controlled, can be changed at will, and part of a storage pool's shared state.

To deal (consistency-wise) with bricks that were temporarily offline, a daemon dealing with heal operations was introduced. (Before that it was more manual, a.k.a. you didn't want that to happen).

On latency

Some operations (mostly on metadata) slow down in proportion to RTT, some because all servers involved in a volume must be contacted (for example for self-heal checks, done at file open time) (and some because they are sequential because of classical POSIX interface design (consider e.g. ls) and true for any distributed filesystem's POSIX interface - also why some choose not to have one).

This also depends on the translators in place - consider for example what replication implies. Also means geo-replication using the basic replication translator is probably a bad idea (there is clever geo-replication you can use instead).

Terminology

storage pool - a trusted network of storage servers.

basically a cooperating set of hosts, which may cooperatively manage zero or more volumes

server - a host that can expose bricks to clients

client - Will have a configuration file mentioning bricks on servers, and translators to use.

brick - typically corresponds to a distinct backing storage disk. Any particular network node may easily have a few. In more concrete terms, it is a directory on an existing local filesystem location exposed as usable by gluster.

translator - given a brick/subvolume, applies some feature/behaviour, and presents it as a subvolume

volume - the final subvolume, the one you consider considered mountable

subvolume - a brick once processed by a translator

internal terminology you may not really need to care about

On translators

Translators are client-side (!), pluggable, stackable things, including:

Storage stuff:

distribute - different files go to different bricks

to just one - gives no data safety, only gives an apparently larger drive

sort of file-level RAID0

stripe - different parts of a file go to different bricks

like distribute, but more fine-grained. Often faster for concurrent and/or random access to large files

sort of block-level RAID0

replicate - stores every file on more than one brick (2 to all, depending on settings)

sort of file-level RAID1 (if set to all, something inbetween if not). (If you want more efficient use of your storage, look at things like RozoFS, which is like network RAID6)

And some functional things like

load balancing (between bricks)

Volume failover

Scheduling and disk caching

Storage quotas

Note that a brick is often just a backing directory with files.

In many cases (distribute, replicate) those will have 1:1 correspondence to the files it would present (which can be nice for recovery, and maybe some informed cheating).

For other cases (stripe) it's not.

Getting started

Peers within a storage pool

✎ This article/section is a stub — some half-sorted notes, not necessarily checked, not necessarily correct. Feel free to ignore, or tell me about it.

To see the list (and status) of peer servers:

gluster peer status

To add hosts to the storage pool:

gluster peer probe servername

You can do this from any current member.

Do a gluster status again to see what happened. (...on all nodes if you wish; if you don't use DNS you may find that the host you probed knows the prober only by IP. You may wish to do an explicit probe, just to make it realize it has a name.)

You can detach peers too, though not while they are part of a volume.

Creating volumes

Mounting volumes

One-shot:

mount -t glusterfs host:/volname /mnt/point

fstab:

host:/volname /mnt/point glusterfs defaults,_netdev 0 0

Maintenance, failure

expand, shrink; migrate, rebalance

MooseFS

✎ This article/section is a stub — some half-sorted notes, not necessarily checked, not necessarily correct. Feel free to ignore, or tell me about it.

Similar to Google File System, Lustre, Ceph

Fault tolerance: replication (per file or directory)
Speed: Striping (for more aggregate bandwidth)
Load balancing: Yes
Security: user auth, POSIX-style permissions

Userspace.

Fault-tolerance — MooseFS uses replication, data can be replicated across chunkservers, the replication ratio (N) is set per file/directory

Easy enough to set up.

Hot-add/remove

Single metadata server?

http://en.wikipedia.org/wiki/Moose_File_System

LizardFS

Fork of MooseFS

Ceph FS

✎ This article/section is a stub — some half-sorted notes, not necessarily checked, not necessarily correct. Feel free to ignore, or tell me about it.

Fault tolerance: Replication (settings are pool-wide(verify)), journaling (verify).
Speed:           scales well, generally good
Load balancing:  implied by striping
Access:          POSIX-semantics mount, library, block device. Integrates with some VMs.
Expandable:      yes
Networking:      
License? Paid?   Open source and free. Paid support offered.

Seems to focus more on scalability, failure resistance, and some features useful in virtualization environment, and to some degree easy management.

...at some cost of throughput on typical use e.g. compared to gluster, but some of that can be informedly mitigated.

Drive failure is dealt with well, so there is no critical replacement window as there is with RAID5, RAID6.

Common, apparently still leading gluster a bit.

Still marked as a work in progress with hairy bits - but quite mature in many ways. Opinion seems to be "a bit of a bother, but works very well".

Its documentation not quite so much yet, so not the easiest to set up.

Can be used as a block device as well as for files (verify)

Lustre

Sheepdog

BeeGFS (previously Fraunhofer Parallel File System, FhGFS)

RozoFS

✎ This article/section is a stub — some half-sorted notes, not necessarily checked, not necessarily correct. Feel free to ignore, or tell me about it.

Fault tolerance: can deal with missing nodes Speed: Seems good, and better than some on small IO Load balancing: distributed storage Security: Access: POSIX-like

Roughly: like gluster, but deals with missing nodes RAID-style (more specifically, an erasure coding algorithm ).

Has a single(?) metadata server

SeaweedFS

MogileFS

✎ This article/section is a stub — some half-sorted notes, not necessarily checked, not necessarily correct. Feel free to ignore, or tell me about it.

Fault tolerance:  configurable replication
                  avoids single point of failure - all components can be run on multiple machines
Speed:            
Load balancing:   
Access:           
Expandable:       
Networking:

Userspace (no kernel modules)

Files are replicated (to the wishes of the class they are in), so you can have different kinds of files be safer, while saving disk space for things you could cheaply rebuild.

XtreemFS

HDFS

Gfarm

CXFS (Clustered XFS)

https://en.wikipedia.org/wiki/CXFS

pCIFS - clustered Samba

(with gluster underneath?) http://wiki.samba.org/index.php/CTDB_Setup

OCFS2

PVFS2

OrangeFS

OpenAFS

Tahoe-LAFS

DFS

http://www.windowsnetworking.com/articles-tutorials/windows-2003/Windows2003-Distributed-File-System.html

GPFS (IBM General Parallel File System)

Object stores

Block devices

Computer data storage - Network storage

Contents

Filesystems

Distinctions around multiple disks - LVM, NAS, SAN, and distributed, syncing storage, etc.

NFS notes

SMB notes

GlusterFS

Terminology

On translators

Getting started

Peers within a storage pool

Creating volumes

Mounting volumes

Maintenance, failure

expand, shrink; migrate, rebalance

See also

MooseFS

LizardFS

Ceph FS

Lustre

Sheepdog

BeeGFS (previously Fraunhofer Parallel File System, FhGFS)

RozoFS

SeaweedFS

MogileFS

XtreemFS

HDFS

Gfarm

CXFS (Clustered XFS)

pCIFS - clustered Samba

OCFS2

PVFS2

OrangeFS

OpenAFS

Tahoe-LAFS

DFS

GPFS (IBM General Parallel File System)

Object stores

Block devices

Navigation menu