Difference between revisions of "Computer data storage - Network storage"

From Helpful
Jump to: navigation, search
m (GlusterFS)
m (Maintenance, failure)
 
(19 intermediate revisions by the same user not shown)
Line 29: Line 29:
  
 
[http://en.wikipedia.org/wiki/CAP_theorem CAP]-wise it seems to be AP.
 
[http://en.wikipedia.org/wiki/CAP_theorem CAP]-wise it seems to be AP.
 +
  
 
Pretty decent streaming throughput. Some metadata commands are slowish. (so arguably good for large-dataset stuff, less ideal for many-client stuff).
 
Pretty decent streaming throughput. Some metadata commands are slowish. (so arguably good for large-dataset stuff, less ideal for many-client stuff).
Line 35: Line 36:
  
 
Wide-ishly used, implying a bunch of support in the form of experience (forums, IRC, RedHat support).
 
Wide-ishly used, implying a bunch of support in the form of experience (forums, IRC, RedHat support).
 +
 +
 +
 +
'''No authentication'''
 +
: You can use your generic-purpose firewall to not allow the wrong machines in (...or...)
 +
: gluster itself can have per-volume host whitelists, e.g. {{inlinecode|gluster volume set ''volname'' auth.allow 192.168.1.*}}
 +
 +
 +
POSIX interface means it '''stores UIDs and GIDs''', so you probably want to synchronize what those mean among participating hosts.
 +
 +
That probably means [[YP]] or similar.
  
  
Line 64: Line 76:
  
 
====Terminology====
 
====Terminology====
'''brick''' - typically corresponds to a distinct backing storage disk. Any particular network node may easily have a few. In more concrete terms, it is a directory on an existing local filesystem location exposed as usable by gluster.
 
  
'''client''' - Will have a configuration file mentioning bricks on servers, and translators to use.
+
'''storage pool''' - a trusted network of storage servers.
 +
: basically a cooperating set of hosts, which may cooperatively manage zero or more volumes
  
 
'''server''' - a host that can expose bricks to clients
 
'''server''' - a host that can expose bricks to clients
  
'''storage pool''' - a trusted network of storage servers.
+
'''client''' - Will have a configuration file mentioning bricks on servers, and translators to use.
 +
 
 +
'''brick''' - typically corresponds to a distinct backing storage disk. Any particular network node may easily have a few. In more concrete terms, it is a directory on an existing local filesystem location exposed as usable by gluster.
  
 
'''translator''' - given a brick/subvolume, applies some feature/behaviour, and presents it as a subvolume
 
'''translator''' - given a brick/subvolume, applies some feature/behaviour, and presents it as a subvolume
 
'''subvolume''' - a brick once processed by a translator
 
  
 
'''volume''' - the final subvolume, the one you consider considered mountable
 
'''volume''' - the final subvolume, the one you consider considered mountable
  
 +
'''subvolume''' - a brick once processed by a translator
 +
: internal terminology you may not really need to care about
  
 
====On translators====
 
====On translators====
<!--
 
In many cases, gluster just stores files in backing directories and presents them as-is.
 
  
This is intentionally trivial.
 
Any further features are implemented via client-side (!), pluggable, stackable '''translators''', including:
 
* '''distribute''' - different files go to different hosts
 
: gives no data safety
 
  
* '''stripe''' - different parts of a file go to different hosts
+
Translators are client-side (!), pluggable, stackable things, including:
: good for concurrent and/or random IO to large files (for smaller files at scale, you already have speed)
+
 
 +
Storage stuff:
 +
* '''distribute''' - different files go to different bricks
 +
: to just one - gives no data safety, only gives an apparently larger drive
 +
: sort of file-level RAID0
  
* '''replicate''' - stores every file on more than one brick/host
+
* '''stripe''' - different ''parts'' of a file go to different bricks
 +
: like distribute, but more fine-grained. Often faster for concurrent and/or random access ''to large files''
 +
: sort of block-level RAID0
  
* File-based mirroring and replication (across bricks)
+
* '''replicate''' - stores every file on more than one brick (2 to all, depending on settings)
 +
: sort of file-level RAID1 (if set to all, something inbetween if not). {{comment|(If you want more efficient use of your storage, look at things like RozoFS, which is like network RAID6)}}
  
* File-based striping (across bricks)
 
  
* File-based load balancing (between bricks)
+
And some functional things like
 +
* load balancing (between bricks)
  
 
* Volume failover
 
* Volume failover
Line 106: Line 121:
  
  
{{comment|Note that distribute and stripe are essentially network RAID0 / RAID1 (RAID10 if combined).
 
If you want more efficient use of your storage, look at things like RozoFS, which is like network RAID6)}}
 
  
Note: With the exception of striping,
+
Note that a brick is often just a backing directory with files.
there is basically a 1:1 correspondence between the files that glusterfs stores
+
and the files it stores on each server.
+
  
In some situations (compute nodes also used as storage nodes) you can gain some speed by informedle cheating your way around your client, direcly reading files stored on the node you need them on - though there are some things to pay attention to.
+
In many cases (distribute, replicate) those will have 1:1 correspondence to the files it would present (which can be nice for recovery, and maybe some informed cheating).
  
-->
+
For other cases (stripe) it's not.
 
+
====Auth and networking====
+
 
+
tl;dr:
+
: No auth
+
 
+
: You can use your generic-purpose firewall to not allow the wrong machines in (...or...)
+
 
+
: gluster itself can have per-volume host whitelists, e.g. {{inlinecode|gluster volume set ''volname'' auth.allow 192.168.1.*}}
+
  
 
====Getting started====
 
====Getting started====
 
<!--
 
<!--
  
Most management is done via the {{inlinecode|gluster}} command so install whatever package gives you that. It will tend to also install the server.
+
Getting to the point where you have a volume:
  
 +
* install gluster
 +
: Most management is done via the {{inlinecode|gluster}} command so install whatever package gives you that.
 +
: It will tend to also install the server.
  
'''On users''''
 
  
Keep in mind that filename owners (UIDs and GIDs) are numbers, so the user maps ought to be synchronized between the clients involved. In clusters you may like to use something like [[YP]].
+
* make peers know each other.  More technically: '''set up a trusted storage pool'''
 +
: you're setting up so that each node can identify all other nodes, by their name.
 +
: For larger setups it is recommended you do this via DNS, as in the long run it is less (added) work than making sure that a lot of <tt>/etc/hosts</tt> files are correct.
 +
: Use IP addresses ''only'' when they're static (if they're DHCP'd then you may get into a world of pain later)
  
 +
* create a volume
  
Rough steps:
+
* start a volume
* make peers know each other. More technically: '''set up a trusted storage pool'''
+
: you're setting up things so that each node can identify all other nodes, by their name.
+
: For larger setups it is recommended you do this via DNS, as in the long run it is less (added) work than making sure that a lot of <tt>/etc/hosts</tt> files are correct.
+
: You ''can'' use IP addresses, but if they're DHCP'd then you may get into a world of pain later.
+
  
* add a volume
 
  
 +
...then probabl learn the commands dealing with changes, maintenance, and failures.
  
  
 
-->
 
-->
=====Peers=====
+
 
 +
 
 +
=====Peers within a storage pool=====
 
{{stub}}
 
{{stub}}
To see the status of peer servers:
+
 
 +
To see the list (and status) of peer servers:
 
  gluster peer status
 
  gluster peer status
 +
  
 
To '''add''' hosts to the storage pool:
 
To '''add''' hosts to the storage pool:
 
  gluster peer probe ''servername''
 
  gluster peer probe ''servername''
You can do this from any member.
+
You can do this from any current member.
  
 +
Do a {{inlinecode|gluster status}} again to see what happened. {{comment|(...on all nodes if you wish; if you don't use DNS you may find that the host you probed knows the prober only by IP. You may wish to do an explicit probe, just to make it realize it has a name.)}}
  
Do a status again to see what happened.
 
  
...on all nodes if you wish; if you don't use DNS you may find that the host you probed knows the prober only by IP. You may wish to do an explicit probe, just to make it realize it has a name.
+
You can detach peers too, though not while they are part of a volume. <!--
  
 
+
you'll want to
You can detach peers too, though not while they are part of a volume.
+
-->
  
 
<!--
 
<!--
Line 170: Line 178:
 
Means connection error. Check whether firewalls apply.  
 
Means connection error. Check whether firewalls apply.  
 
The easiest is to have cooperating nodes to trust each other.
 
The easiest is to have cooperating nodes to trust each other.
 
  
 
-->
 
-->
 
 
  
 
=====Creating volumes=====
 
=====Creating volumes=====
 
<!--
 
<!--
First:
+
'''Figure out what precisely you want''', because translators are settled when creating a volume.
* Figure out where to put the backing data on each brick.
+
: It's useful for sysadmin sanity to make this short, and identical on all nodes.
+
  
* Figure out the translators used - these are settled when creating a volume.
+
You can add and remove bricks later, but e.g. the basic volume type depends on the volume create command.
  
  
Default arguments imply a volume, e.g.
+
'''Figure out where on each host to store the brick data''''
 +
: It's useful for sysadmin sanity to make this short
 +
: ...and maybe identical on all nodes, to avoid having to always go to your notes
 +
: and, if you plan to more than one volume ever, maybe named for the volume
 +
: you ''usually'' want at most one of a volume's bricks per server (for distribution/mirror safety to make sense)
 +
 
 +
 
 +
 
 +
 
 +
Default arguments imply a ''distribute'' volume{{verify}}, e.g.
 
  gluster volume create ''volumename''          transport tcp server1:/exp1 server2:/exp2
 
  gluster volume create ''volumename''          transport tcp server1:/exp1 server2:/exp2
  
To create a volume where all data is stored on two bricks:
+
To create a volume where all data is '''replicated''', e.g. on two bricks:
 
  gluster volume create ''volumename'' replica 2 transport tcp server1:/exp1 server2:/exp2
 
  gluster volume create ''volumename'' replica 2 transport tcp server1:/exp1 server2:/exp2
  
To create a volume striped on two servers
+
To create a volume '''striped''', e.g. on two servers
 
  gluster volume create ''volumename'' stripe 2  transport tcp server1:/exp1 server2:/exp2
 
  gluster volume create ''volumename'' stripe 2  transport tcp server1:/exp1 server2:/exp2
  
  
 +
Cases become interesting when you specify more bricks than necessary for the replica/stripe - basically, any that are not required left are used to distribute the features you've mentioned.
  
Things become more interesting when you specify more bricks than necessary for the replica/stripe - basically, any that are not required left are used to distribute the features you've mentioned.
+
To use six nodes and keep two copies in two distinct bricks among them:
 
+
To use six nodes and keep two copies of each file:
+
 
  gluster volume create test-volume replica 2 transport tcp ''mention six bricks''
 
  gluster volume create test-volume replica 2 transport tcp ''mention six bricks''
  
 
To create a striped-and-distributed volume (where a given file is striped to two bricks{{verify}})
 
To create a striped-and-distributed volume (where a given file is striped to two bricks{{verify}})
 
  gluster volume create ''volumename'' stripe 2  transport tcp ''mention eight bricks''
 
  gluster volume create ''volumename'' stripe 2  transport tcp ''mention eight bricks''
 +
 +
  
  
Line 233: Line 246:
 
fstab:
 
fstab:
 
  host:/volname /mnt/point glusterfs defaults,_netdev 0 0
 
  host:/volname /mnt/point glusterfs defaults,_netdev 0 0
 
 
You mount a network URL because the client doesn't have to be a server
 
 
 
<!--
 
<!--
 
 
Mount options you can use include
 
Mount options you can use include
 
  transport=transport-type
 
  transport=transport-type
Line 246: Line 254:
 
  log-file=logfile
 
  log-file=logfile
 
  direct-io-mode=[enable|disable]
 
  direct-io-mode=[enable|disable]
 +
-->
  
 +
 +
<!--
 +
You mount a network URL because the client doesn't have to be a server
 
-->
 
-->
  
====expand, shrink; migrate, rebalance====
+
 
 +
====Maintenance, failure====
 +
 
 +
<!--
 +
=====Replacing bricks=====
 +
 
 +
If you anticipate drive failure, i.e. still trust all the data on the brick
 +
* replace-brick - this will migrate data
 +
 
 +
 
 +
If a brick dies (e.g. because the host or disk storing it died), you probably want 
 +
* if you set up absolutely no replication, curse loudly
 +
 
 +
* If you did have replication
 +
** (heal / check?)
 +
** remove-brick
 +
** add-brick
 +
** rebalance (with migration)
 +
 
 +
-->
 +
 
 +
 
 +
=====expand, shrink; migrate, rebalance=====
  
 
<!--
 
<!--
 
: '''The need to rebalance afterwards''
 
: '''The need to rebalance afterwards''
  
There are no problems changing the layout of a volume while using it,
+
You can change the layout of a volume while using it.
but it does need some decisive attention from you.
+
  
Also, new files in ''existing'' directories will still be distributed only to old bricks{{verify}} {{comment|(directories created after layout changes will automatically distribute well)}}.
+
This makes more sense on some layouts than others, because you ''will'' be changing how it works.
To fix this (without moving existing data around), you want:
+
 
gluster volume rebalance ''volumename'' fix-layout start
+
And since you often do this for specific speed or safety reasons, you often want to rebalance now.
  
  
Line 268: Line 301:
  
  
You can check the progress of that command with:
+
Also, new files in ''existing'' directories will still be distributed only to old bricks{{verify}} {{comment|(directories created after layout changes will automatically distribute well)}}.
 +
To fix this (without moving existing data around), you want:
 +
gluster volume rebalance ''volumename'' fix-layout start
 +
 
 +
 
 +
You can check the progress of such operations with:
 
  gluster volume rebalance ''volumename'' status
 
  gluster volume rebalance ''volumename'' status
  
Line 274: Line 312:
  
  
 +
 +
 +
Removing a brick may mean an implied rebalance, e.g. in ''distribute'' type,
 +
meaning this is a long-running operation the command ''start''s.
 +
 +
You can  also force it, but usually don't want to.
 +
 +
 +
 +
.
  
  
Line 289: Line 337:
 
The command:
 
The command:
 
  gluster volume remove-brick ''volumename'' ''brick''
 
  gluster volume remove-brick ''volumename'' ''brick''
It doesn't remove the files from the brick, though any files that ''require'' the removed brick will no longer appear in the mountpoint. To not lose anything, you'll probably want to do this only on volumes with replication - and rebalance afterwards.
+
It doesn't remove the files from the brick, though any files that ''require'' the removed brick will no longer appear in the mountpoint.
 +
 
 +
To not lose anything, you'll probably want to do this only on volumes with replication - and rebalance afterwards.
 +
 
  
  
Line 295: Line 346:
 
  gluster volume replace-brick ''volumename'' ''brick'' ''newbrick'' start
 
  gluster volume replace-brick ''volumename'' ''brick'' ''newbrick'' start
  
-->
 
  
====On failure====
 
  
<!--
+
-->
If a server or brick dies, you probably want 
+
 
* to be able to recover all data by having previously set up duplication
+
* remove the old brick/server
+
* add a new server/brick to get consistent amount of space (and avoid some moving in the rebalance?{{verify}})
+
* rebalance (with migration)
+
  
  
If you anticipate drive failure, you can do a migrate instead - see replace-brick above.
 
-->
 
  
====Use RAID?====
 
 
<!--
 
<!--
 +
====Use RAID?====
 
...on the bricks, if backed by multiple disks.
 
...on the bricks, if backed by multiple disks.
  
Line 331: Line 374:
 
-->
 
-->
  
 +
<!--
 
====Errors====
 
====Errors====
  
Line 336: Line 380:
 
In my case, this was caused by having mismatched versions of glusterfs.
 
In my case, this was caused by having mismatched versions of glusterfs.
 
   
 
   
 
+
-->
  
 
====See also====
 
====See also====

Latest revision as of 17:40, 9 October 2019

Computer data storage
These are primarily notes
It won't be complete in any sense.
It exists to contain fragments of useful information.


Filesystems

(with a focus on distributed filesystems)

http://en.wikipedia.org/wiki/List_of_file_systems#Distributed_parallel_fault-tolerant_file_systems http://en.wikipedia.org/wiki/Comparison_of_distributed_file_systems

NFS notes

See NFS notes

Relevant here: pNFS / PanFS / Panasas

SMB notes

See SMB, CIFS, Samba, Windows File Sharing notes


GlusterFS

This article/section is a stub — probably a pile of half-sorted notes, is not well-checked so may have incorrect bits. (Feel free to ignore, fix, or tell me)
Fault tolerance:  with replication there is a self-healing scrub operation 
Speed:            scales well. Metadata not distributed so some fs operations not as fast as read/write
Load balancing:   implied by striping, distribution
Access:           library (gfapi), POSIX-semantics mount (via FUSE), built-in NFSv3, or block device (since  2013)
Expandable:       yes   (add bricks, update configuration. with some important side notes, though)
Networking:       TCP/IP, Infiniband (RDMA), or SDP


CAP-wise it seems to be AP.


Pretty decent streaming throughput. Some metadata commands are slowish. (so arguably good for large-dataset stuff, less ideal for many-client stuff).

Relatively easy to manage compared to various others - though resolving problem is reportedly more involved.

Wide-ishly used, implying a bunch of support in the form of experience (forums, IRC, RedHat support).


No authentication

You can use your generic-purpose firewall to not allow the wrong machines in (...or...)
gluster itself can have per-volume host whitelists, e.g.
gluster volume set volname auth.allow 192.168.1.*


POSIX interface means it stores UIDs and GIDs, so you probably want to synchronize what those mean among participating hosts.

That probably means YP or similar.


Any host that runs glusterfsd is effectively a server. Each server can have one or more storage bricks (see terms), which can be dynamically included into one storage pool.

How to use bricks (e.g. stripe, mirror) is up to configuration, which is client-controlled, can be changed at will, and part of a storage pool's shared state.

To deal (consistency-wise) with bricks that were temporarily offline, a daemon dealing with heal operations was introduced. (Before that it was more manual, a.k.a. you didn't want that to happen).


On latency
Some operations (mostly on metadata) slow down in proportion to RTT, some because all servers involved in a volume must be contacted (for example for self-heal checks, done at file open time) (and some because they are sequential because of classical POSIX interface design (consider e.g.
ls
) and true for any distributed filesystem's POSIX interface - also why some choose not to have one)
.

This also depends on the translators in place - consider for example what replication implies. Also means geo-replication using the basic replication translator is probably a bad idea (there is clever geo-replication you can use instead).


Terminology

storage pool - a trusted network of storage servers.

basically a cooperating set of hosts, which may cooperatively manage zero or more volumes

server - a host that can expose bricks to clients

client - Will have a configuration file mentioning bricks on servers, and translators to use.

brick - typically corresponds to a distinct backing storage disk. Any particular network node may easily have a few. In more concrete terms, it is a directory on an existing local filesystem location exposed as usable by gluster.

translator - given a brick/subvolume, applies some feature/behaviour, and presents it as a subvolume

volume - the final subvolume, the one you consider considered mountable

subvolume - a brick once processed by a translator

internal terminology you may not really need to care about

On translators

Translators are client-side (!), pluggable, stackable things, including:

Storage stuff:

  • distribute - different files go to different bricks
to just one - gives no data safety, only gives an apparently larger drive
sort of file-level RAID0
  • stripe - different parts of a file go to different bricks
like distribute, but more fine-grained. Often faster for concurrent and/or random access to large files
sort of block-level RAID0
  • replicate - stores every file on more than one brick (2 to all, depending on settings)
sort of file-level RAID1 (if set to all, something inbetween if not). (If you want more efficient use of your storage, look at things like RozoFS, which is like network RAID6)


And some functional things like

  • load balancing (between bricks)
  • Volume failover
  • Scheduling and disk caching
  • Storage quotas


Note that a brick is often just a backing directory with files.

In many cases (distribute, replicate) those will have 1:1 correspondence to the files it would present (which can be nice for recovery, and maybe some informed cheating).

For other cases (stripe) it's not.

Getting started

Peers within a storage pool
This article/section is a stub — probably a pile of half-sorted notes, is not well-checked so may have incorrect bits. (Feel free to ignore, fix, or tell me)

To see the list (and status) of peer servers:

gluster peer status


To add hosts to the storage pool:

gluster peer probe servername

You can do this from any current member.

Do a
gluster status
again to see what happened. (...on all nodes if you wish; if you don't use DNS you may find that the host you probed knows the prober only by IP. You may wish to do an explicit probe, just to make it realize it has a name.)


You can detach peers too, though not while they are part of a volume.


Creating volumes

Mounting volumes

One-shot:

mount -t glusterfs host:/volname /mnt/point

fstab:

host:/volname /mnt/point glusterfs defaults,_netdev 0 0



Maintenance, failure

expand, shrink; migrate, rebalance

See also

and/or TOREAD myself:

MooseFS

This article/section is a stub — probably a pile of half-sorted notes, is not well-checked so may have incorrect bits. (Feel free to ignore, fix, or tell me)

Similar to Google File System, Lustre, Ceph

Fault tolerance: replication (per file or directory)
Speed: Striping (for more aggregate bandwidth)
Load balancing: Yes
Security: user auth, POSIX-style permissions

Userspace.

Fault-tolerance — MooseFS uses replication, data can be replicated across chunkservers, the replication ratio (N) is set per file/directory

Easy enough to set up.

Hot-add/remove

Single metadata server?

http://en.wikipedia.org/wiki/Moose_File_System

LizardFS

Fork of MooseFS


Ceph FS

This article/section is a stub — probably a pile of half-sorted notes, is not well-checked so may have incorrect bits. (Feel free to ignore, fix, or tell me)
Fault tolerance: Replication (settings are pool-wide(verify)), journaling (verify).
Speed:           scales well, generally good
Load balancing:  implied by striping
Access:          POSIX-semantics mount, library, block device. Integrates with some VMs.
Expandable:      yes
Networking:      
License? Paid?   Open source and free. Paid support offered.

Seems to focus more on scalability, failure resistance, and some features useful in virtualization environment, and to some degree easy management.

...at some cost of throughput on typical use e.g. compared to gluster, but some of that can be informedly mitigated.

Drive failure is dealt with well, so there is no critical replacement window as there is with RAID5, RAID6.

Common, apparently still leading gluster a bit.

Still marked as a work in progress with hairy bits - but quite mature in many ways. Opinion seems to be "a bit of a bother, but works very well".

Its documentation not quite so much yet, so not the easiest to set up.


Can be used as a block device as well as for files (verify)


See also:


Lustre

Sheepdog

BeeGFS (previously Fraunhofer Parallel File System, FhGFS)

RozoFS

This article/section is a stub — probably a pile of half-sorted notes, is not well-checked so may have incorrect bits. (Feel free to ignore, fix, or tell me)

Fault tolerance: can deal with missing nodes Speed: Seems good, and better than some on small IO Load balancing: distributed storage Security: Access: POSIX-like

Roughly: like gluster, but deals with missing nodes RAID-style (more specifically, an erasure coding algorithm ).

Has a single(?) metadata server

SeaweedFS

MogileFS

This article/section is a stub — probably a pile of half-sorted notes, is not well-checked so may have incorrect bits. (Feel free to ignore, fix, or tell me)
Fault tolerance:  configurable replication
                  avoids single point of failure - all components can be run on multiple machines
Speed:            
Load balancing:   
Access:           
Expandable:       
Networking:       


Userspace (no kernel modules)

Files are replicated (to the wishes of the class they are in), so you can have different kinds of files be safer, while saving disk space for things you could cheaply rebuild.


See also:


XtreemFS

HDFS

Gfarm

CXFS (Clustered XFS)

https://en.wikipedia.org/wiki/CXFS


pCIFS - clustered Samba

(with gluster underneath?) http://wiki.samba.org/index.php/CTDB_Setup


OCFS2

PVFS2

OrangeFS

OpenAFS

Tahoe-LAFS

DFS

http://www.windowsnetworking.com/articles-tutorials/windows-2003/Windows2003-Distributed-File-System.html


GPFS (IBM General Parallel File System)

Object stores

Block devices