Computer data storage - Network storage: Difference between revisions
mNo edit summary |
mNo edit summary |
||
Line 1: | Line 1: | ||
{{#addbodyclass:tag_tech}} | |||
{{ComputerHardDrives}} | {{ComputerHardDrives}} | ||
{{notes}} | {{notes}} |
Latest revision as of 12:02, 24 April 2024
Computer data storage |
📃 These are primarily notes, intended to be a collection of useful fragments, that will probably never be complete in any sense. |
Filesystems
(with a focus on distributed filesystems)
http://en.wikipedia.org/wiki/List_of_file_systems#Distributed_parallel_fault-tolerant_file_systems http://en.wikipedia.org/wiki/Comparison_of_distributed_file_systems
Distinctions around multiple disks - LVM, NAS, SAN, and distributed, syncing storage, etc.
NFS notes
See NFS notes
Relevant here: pNFS / PanFS / Panasas
SMB notes
See SMB, CIFS, Samba, Windows File Sharing notes
GlusterFS
Fault tolerance: with replication there is a self-healing scrub operation Speed: scales well. Metadata not distributed so some fs operations not as fast as read/write Load balancing: implied by striping, distribution Access: library (gfapi), POSIX-semantics mount (via FUSE), built-in NFSv3, or block device (since 2013) Expandable: yes (add bricks, update configuration. with some important side notes, though) Networking: TCP/IP, Infiniband (RDMA), or SDP
CAP-wise it seems to be AP.
Pretty decent streaming throughput. Some metadata commands are slowish. (so arguably good for large-dataset stuff, less ideal for many-client stuff).
Relatively easy to manage compared to various others - though resolving problem is reportedly more involved.
Wide-ishly used, implying a bunch of support in the form of experience (forums, IRC, RedHat support).
No authentication
- You can use your generic-purpose firewall to not allow the wrong machines in (...or...)
- gluster itself can have per-volume host whitelists, e.g. gluster volume set volname auth.allow 192.168.1.*
POSIX interface means it stores UIDs and GIDs, so you probably want to synchronize what those mean among participating hosts.
That probably means YP or similar.
Any host that runs glusterfsd is effectively a server. Each server can have one or more storage bricks (see terms), which can be dynamically included into one storage pool.
How to use bricks (e.g. stripe, mirror) is up to configuration, which is client-controlled, can be changed at will, and part of a storage pool's shared state.
To deal (consistency-wise) with bricks that were temporarily offline, a daemon dealing with heal operations was introduced. (Before that it was more manual, a.k.a. you didn't want that to happen).
- On latency
Some operations (mostly on metadata) slow down in proportion to RTT, some because all servers involved in a volume must be contacted (for example for self-heal checks, done at file open time) (and some because they are sequential because of classical POSIX interface design (consider e.g. ls) and true for any distributed filesystem's POSIX interface - also why some choose not to have one).
This also depends on the translators in place - consider for example what replication implies. Also means geo-replication using the basic replication translator is probably a bad idea (there is clever geo-replication you can use instead).
Terminology
storage pool - a trusted network of storage servers.
- basically a cooperating set of hosts, which may cooperatively manage zero or more volumes
server - a host that can expose bricks to clients
client - Will have a configuration file mentioning bricks on servers, and translators to use.
brick - typically corresponds to a distinct backing storage disk. Any particular network node may easily have a few. In more concrete terms, it is a directory on an existing local filesystem location exposed as usable by gluster.
translator - given a brick/subvolume, applies some feature/behaviour, and presents it as a subvolume
volume - the final subvolume, the one you consider considered mountable
subvolume - a brick once processed by a translator
- internal terminology you may not really need to care about
On translators
Translators are client-side (!), pluggable, stackable things, including:
Storage stuff:
- distribute - different files go to different bricks
- to just one - gives no data safety, only gives an apparently larger drive
- sort of file-level RAID0
- stripe - different parts of a file go to different bricks
- like distribute, but more fine-grained. Often faster for concurrent and/or random access to large files
- sort of block-level RAID0
- replicate - stores every file on more than one brick (2 to all, depending on settings)
- sort of file-level RAID1 (if set to all, something inbetween if not). (If you want more efficient use of your storage, look at things like RozoFS, which is like network RAID6)
And some functional things like
- load balancing (between bricks)
- Volume failover
- Scheduling and disk caching
- Storage quotas
Note that a brick is often just a backing directory with files.
In many cases (distribute, replicate) those will have 1:1 correspondence to the files it would present (which can be nice for recovery, and maybe some informed cheating).
For other cases (stripe) it's not.
Getting started
Peers within a storage pool
To see the list (and status) of peer servers:
gluster peer status
To add hosts to the storage pool:
gluster peer probe servername
You can do this from any current member.
Do a gluster status again to see what happened. (...on all nodes if you wish; if you don't use DNS you may find that the host you probed knows the prober only by IP. You may wish to do an explicit probe, just to make it realize it has a name.)
You can detach peers too, though not while they are part of a volume.
Creating volumes
Mounting volumes
One-shot:
mount -t glusterfs host:/volname /mnt/point
fstab:
host:/volname /mnt/point glusterfs defaults,_netdev 0 0
Maintenance, failure
expand, shrink; migrate, rebalance
See also
and/or TOREAD myself:
MooseFS
Similar to Google File System, Lustre, Ceph
Fault tolerance: replication (per file or directory) Speed: Striping (for more aggregate bandwidth) Load balancing: Yes Security: user auth, POSIX-style permissions
Userspace.
Fault-tolerance — MooseFS uses replication, data can be replicated across chunkservers, the replication ratio (N) is set per file/directory
Easy enough to set up.
Hot-add/remove
Single metadata server?
http://en.wikipedia.org/wiki/Moose_File_System
LizardFS
Fork of MooseFS
Ceph FS
Fault tolerance: Replication (settings are pool-wide(verify)), journaling (verify). Speed: scales well, generally good Load balancing: implied by striping Access: POSIX-semantics mount, library, block device. Integrates with some VMs. Expandable: yes Networking: License? Paid? Open source and free. Paid support offered.
Seems to focus more on scalability, failure resistance, and some features useful in virtualization environment, and to some degree easy management.
...at some cost of throughput on typical use e.g. compared to gluster, but some of that can be informedly mitigated.
Drive failure is dealt with well, so there is no critical replacement window as there is with RAID5, RAID6.
Common, apparently still leading gluster a bit.
Still marked as a work in progress with hairy bits - but quite mature in many ways. Opinion seems to be "a bit of a bother, but works very well".
Its documentation not quite so much yet, so not the easiest to set up.
Can be used as a block device as well as for files (verify)
See also:
- http://en.wikipedia.org/wiki/Ceph_(file_system)
- http://www.anchor.com.au/blog/2012/09/a-crash-course-in-ceph/
- http://www.penguincomputing.com/blog/post/why-we-love-ceph
Lustre
Sheepdog
BeeGFS (previously Fraunhofer Parallel File System, FhGFS)
RozoFS
Fault tolerance: can deal with missing nodes Speed: Seems good, and better than some on small IO Load balancing: distributed storage Security: Access: POSIX-like
Roughly: like gluster, but deals with missing nodes RAID-style (more specifically, an erasure coding algorithm ).
Has a single(?) metadata server
SeaweedFS
MogileFS
Fault tolerance: configurable replication avoids single point of failure - all components can be run on multiple machines Speed: Load balancing: Access: Expandable: Networking:
Userspace (no kernel modules)
Files are replicated (to the wishes of the class they are in), so you can have different kinds of files be safer, while saving disk space for things you could cheaply rebuild.
See also:
XtreemFS
HDFS
Gfarm
CXFS (Clustered XFS)
https://en.wikipedia.org/wiki/CXFS
pCIFS - clustered Samba
(with gluster underneath?) http://wiki.samba.org/index.php/CTDB_Setup
OCFS2
PVFS2
OrangeFS
OpenAFS
Tahoe-LAFS
DFS