Computer data storage - Partitioning and filesystems

From Helpful
Jump to: navigation, search
These are primarily notes
It won't be complete in any sense.
It exists to contain fragments of useful information.
Computer data storage

Contents

On 512B versus 4K (Advanced Format, AF) sectors

This article/section is a stub — probably a pile of half-sorted notes, is not well-checked so may have incorrect bits. (Feel free to ignore, fix, or tell me)

More technically

More practically

On Hard drive size barriers

In the past, there were some hardware constrains, related to addressing.

Initiall BIOS CHS and int13 stuff -- 99% irrelevant these days since you won't have a motherboard that old.



The one that you may still run into today is the 2.2TB limit.

This one is largely a software limit, due to 32-bit LBA - 232 times 512-byte sectors is 2.2TB.

Keep in mind that in some cases, the it's only the BIOS that won't understand the disk, showing it as a smaller size), but the OS driver actually talk to the controller properly.

So if it does show properly in the OS, (and preferably you're not booting off it - that can be troublesome) you're likely fine.

If it does not show properly in the OS, it will typically show up as its real size modulo 2.2TB, e.g. 3TB as 0.8TB (3TB - 2.2TB), 6TB as 1.6TB (6TB - 4.4TB). When you see this, unplug it now and think very hard about your next step.

Things can get funny when some part of the system mixes 512 and 4096-byte sectors - old OSes, drive firmware, USB-to-SATA controllers.

This can cause anything from

not being able to use more than a fraction of the drive,
corrupting the drive when you plug it into something else (verify), to
being able to safely use a many-TB external USB disk on WinXP even though that would never work with an internal disk.


Software limit: addresses in partitioning

Even if the BIOS and OS doesn't have this problem, use of MBR-style partitioning still does, because its address field is 32-bit, so the same 232 * 512-byte sectors is the same 2.2TB. So with MBR you can safely use the first 2.2TB, but not the rest.

This would be a good reason to use GPT: GPT uses 64-bit LBA addressing.

You do need an OS that can read GPT. Initially it also mattered whether it could also boot off GPT (This is only an issue when you use WinXP, where only the 64-bit variant can read/write it, and no variant can boot from it).

Partitioning notes

Things like floppies are not partitioned, they are one volume.

CDs have their own standards.

USB Flash sticks are usually one-partition things. They can be partitioned, but it seems not every OS understands the result equally well.


Alignment

This article/section is a stub — probably a pile of half-sorted notes, is not well-checked so may have incorrect bits. (Feel free to ignore, fix, or tell me)

Some recent tools align to 1MiB into the device, because it covers these and more cases with a single (somewhat coarse) rule.

If you don't care about the details, that'll usually do. The only setups that may need further consideration are RAID setups with very large stripe blocks, and RAID arrays with many disks.


Bulk storage is done on block devices, which means the units involved in operations are rather larger than a byte.

Historically it made little difference, because most things used 512-byte units.

These days, aligning things (like partitions) to the underlying things can help avoid some unnecessary work, meaning the system can use the device a little bit more efficiently.

Think a few percent on performance benchmarks on drives, on RAID sometimes a little more, and on SSDs some small effect on longevity.

Recent platter drives (since approx. 2010) use Advanced Format, currently meaning 4KiB sector size. (They are still reported in units of 512-byte sectors there is too much code that thinks that way. That will probably change in the pretty-long run).

SSDs are more interesting, but an overly simple summary is that they think in 4KiB units as well. ...so for SSDs and recent platter drives you want to align to a multiple of 4KiB. Start on a (512-byte-)sector number that is divisible by 8.


For RAID (in particular parity RAID) it is a good idea to align to the underlying stripe blocks (often 64KiB, 128KiB, or 256KiB) to avoid cases where a write has to do more work just because it a write touches a little data in the next underlying block.



See also:


Partition styles

This article/section is a stub — probably a pile of half-sorted notes, is not well-checked so may have incorrect bits. (Feel free to ignore, fix, or tell me)

MBR

The historically common variant on PCs (from the DOS era) and still used, though there now is movement toward EFI GPT.

The Master Boot Record refers to the first 512 bytes of the disk, which contains basic boot code, and also a simple four-entry partition table (which was extended to some degree).

Can address at most 2TiB (~2.2TB), which is (32-bit addressing of 512-byte sectors, the 2.2TB figure is 241), so for larger drives most use GPT instead.


See also:


GPT (GUID Partition Table)

This article/section is a stub — probably a pile of half-sorted notes, is not well-checked so may have incorrect bits. (Feel free to ignore, fix, or tell me)

GPT is part of the EFI standard that various OSes are now starting to follow.

Popular for large drives and RAID arrays, where MBR/msdos style is limited to 2TiB.


GPT can coexist with MBR-style partition table, although this isn't necessarily a good idea.

It's arguably a good idea for OSes that understand MBR but not GUID, which may view the drive as unpartitioned and offer one-click formatting.

It can be a bad idea because they can go out of sync. Disk utilities that don't know about both may alter only the MBR or only the GPT, which makes it very confusing to know what is actually on the disk.


Whether GPT disks are bootable depends

  • on whether the BIOS follows EFI or not
  • on whether the OS supports it

See also:

APM, Apple partition map

See also:


BSD disklabel

See also:


Others / unsorted

Practical partitioning notes

GPT

Related booting stuff

Partition recovery

This article/section is a stub — probably a pile of half-sorted notes, is not well-checked so may have incorrect bits. (Feel free to ignore, fix, or tell me)

There are some tools that can scan a disk for known filesystem types (usually by something recognizable that they start with), which can be useful when your partition table is, say, corrupted by a stupid utility or an administrator mistake.

These tools include:

I personally had more luck with testdisk than gpart, but this probably varies with the type of filesystems you lost.

Filesystem choice

simple filesystems

ext2

OSes: *nix, mostly linux.


Journaling filesystems

ext3, ext4

This article/section is a stub — probably a pile of half-sorted notes, is not well-checked so may have incorrect bits. (Feel free to ignore, fix, or tell me)


reiser, reiser4

XFS

ZFS

See Computer data storage - ZFS notes (it was getting a bit too large for this page)

btrfs

OSes: *nix

Values features like robustness and scalability over speed.

Going in the same direction as ZFS (and sharing half its features already). If ZFS remains GPL-incompatible, this may become a realistic alternative to ZFS for the feature set both of those provide.

Probably the direction linux will take; ext3 and ext4 are extensions to ext2 but sort of dead-ended.

As of this writing (early 2014), still considered beta/experimental/"Not yet production ready", but it seems many of its developers run it as their primary filesystem, so seems to be well on its way.


See also:

HammerFS

Non-*nix

FAT (FAT32, FAT16, FAT12)

OSes: Most of them

FAT without a number regularly refers to FAT16 (or, on floppies, FAT12)

FAT16 itself has seen a history of not-always-compatible variants.

FAT32

  • Readable and writable by most versions of common OSes (Windows, OSX, Linux)
  • Max file size: 4GB
  • Max filesystem size: 2TB

See also:


NTFS

exFAT

New, proprietary, licensed (Microsoft).

Not all that related to FAT32 and earlier (though people informally call it FAT64). Partly meant as a useful filesystem for USB flash drives.

Basically something between FAT32 and NTFS: larger, deals a little better with large files, has ACLs, simpler than NTFS.


  • not yet widely supported (for media players you'ld want FAT32)
    • Windows: Supported in Win7, WinCE6, Vista since SP1. WinServer2003 and XP only with a driver[1]
    • OSX since 10.6.5 (late 2010; updated versions of Snow Leopard)
    • Linux: FUSE version in beta, kernel version under development (verify)
  • Max filesystem size: 64ZB theoretical, 512TB recommended
  • Max file size: 16EB
  • Well suited to Flash (why?(verify))

See also:


Storage Spaces

Basically a thin-provisioned LVM, with redundancy via software RAID1 / RAID5 (block level), and which lets you mix different-sized drives.

Has its limitations, though. See e.g.:


ReFS

http://en.wikipedia.org/wiki/ReFS

More specific filesystem notes

fuse and fuseblk

This article/section is a stub — probably a pile of half-sorted notes, is not well-checked so may have incorrect bits. (Feel free to ignore, fix, or tell me)


In particular automounters may mount something as type fuse or fuseblk, which is basically a wrapper that hands the work over to be done in userspace, via FUSE.


Has a few of its own mount options.


Permission related:

  • uid, gid
  • umask
  • default_permissions
    • default behaviour (omitting this option) by the underlying filesystem driver, regularly meaning no permission checking(verify)
    • use of this option means permissions checking based on file mode. (meaning? the one defaulted to / specified by (verify))


  • allow_other - instead of the default behaviour of restricting permissions to the mounting user, this allows other users too. (how exactly?)
  • allow_root - instead of allowing only the mounting user, allow them and root. Exclusive with allow_other.



(what are these default permissions, and/or what are they based on?(verify))


See also:

The 5% reserve

The ext[234] family of filesystems reserve 5% of their capacity. Only (programs running as) root can use this space.

Inspect:

tune2fs -l /dev/sda1  | grep eserved
# that's in blocks. Multiply by the block size to get it in bytes

Change:

# probably easiest to do via a percentage (can be fractional, e.g. 0.5):
tune2fs -m 1 /dev/sda1


Before you change, you should know why having reserved space can be a good idea.


One reason is that it helps OS stability on system partitions: when a non-privileged process accidentally fills up the disk, it and all other non-privileged will be denied space, while system processes can still happily write. This may give the sysadmin ample time to notice this and clean up, before system processes stop or crash because of IO errors.


The other reason is fragmentation. Modern filesystems have clever allocators that try their best to avoid fragmentation. However, once a filesystem is very nearly full, the speed at which it fragment necessarily increases, simply because the scarcity of the space left to give away.

The 5% reserve is a sort of blunt solution to lessen this to a degree. You are trading the ability to use space, with better performance in the longer run. Details still vary with use patterns, and also the relation of file size and drive size.

If you really dislike fragmentation, you may wish to use 10%, even 20% or 30%. If this is an write-only archive disk where you don't care about performance, you can set it to 0.


Note that for large disks, 5% can be a lot, larger than justifiable by either of the above reasons. For example, on a 60TB RAID array, 5% is 3TB. That's probably overkill unless your files are typically hundreds of GB.


There are other valid reasons to lower the reserve.

If it's a storage drive, the superuser reason doesn't really apply.
If it's long term archiving, you can argue that fragmentation doesn't really matter.
If it's a scratch drive you could lower it, as long as you remember to clean out completely every now and then.

Filename limitations (characters, length)

This article/section is a stub — probably a pile of half-sorted notes, is not well-checked so may have incorrect bits. (Feel free to ignore, fix, or tell me)

Notes below are for various filesystems. (For programmers like me, it's interesting to know what is safe for all common filesystems.)


Note that in many cases, the API you use is more restrictive than the backing filesystem, which can mean that strange things may exist without problems, and you can rename them from unsafe to safe (usually), but but not from safe to unsafe. In my experience, this has happened multiple times under NTFS, where I had nearly undeletable files.


Limitations from filesystem and/or APIs

FAT

  • maximum file/folder name length: up to 255 (UTF-16) characters with LFN, otherwise fixed at 8.3
  • maximum path length: not defined
  • Invalid characters:
    • ...in general:
      \/:*?"<>|
      , control characters (0x00..0x20 and 0x80)
    • Allowed in LFN but not 8.3 names:
      +,.;=[]
      , lowercase a-z,

NTFS:

  • maximum file/folder name length: 255 (UTF-16) characters
  • maximum path length: 32767 UTF-16 characters - seems to be an arbitrary imposed restriction (windows itself allowed you to create longer paths by moving into deeper directories - but that then confused programs)
  • Invalid characters:
    \/:*?"<>|
    and NUL (U+0000)

ext2 / ext3 / ext4:

  • Invalid characters:
    /
    (as per POSIX) and presumably NUL
  • file/folder names: 255, 255, and 256 bytes (respectively)


Windows:

  • doesn't like a space at the end of the filename (which can sometimes act up indirectly. For example,
    test .zip
    is valid, but tring to expand that to a new folder
    test
    instead of
    test
    would be a problem.
  • doesn't like a period at the end of a filename
  • doesn't like a period at the start of a filename

Apple OS:

  • OS9 and the Carbon API only restrict
    :
  • OSX and the POSIX API do not allow
    /
  • File/folder length of 31 in OS9, 255 in OSX

POSIX:

  • disallows
    /
    and pretty much nothing else


See also:



More notes on invalid characters

In general, stay away from
\/:*?"<>|
to be fairly safe on windows, linux, and OSX.


There are further things that can cause problems (even if they're valid), such as in shell scripting (where it's usually damn hard to be completely safe).

These include:

  • newlines
  • control characters (ASCII bytes 0x01 through 0x1F, and 0x7F. The same set in Unicode codepoints)
  • backspace
  • tab
  • [
    and
    ]
    - usually fine
  • "
  • =
  • +
  • :
  • ;
  • ,
  • *
  • "
  • $
  • !
  • &
  • -
    at the start of a filename - shells can interpret this as options. There are a few workarounds to that, such as -- is for (but not everything supports it), using ./
  • .
    in some places (Explorer doesn't like to rename to dotfiles, but otherwise seems not to mind them)


  • In web-publishing contexts there may be further restrictions but they are usually arbitrary software restrictions


See also:


lost+found

Huge filesystems - practicalities

This article/section is a stub — probably a pile of half-sorted notes, is not well-checked so may have incorrect bits. (Feel free to ignore, fix, or tell me)

for reference: [2]

  • GB is 10003, 109
  • TB is 10004, 1012 (tera)
  • PB is 10005, 1015 (peta)
  • EB is 10006, 1018 (exa)
  • ZB is 10007, 1021 (zetta)


Partitioning

If you have a drive (or a RAID array) larger than 2TiB, ~2.2TB, then you probably want GPT partitioning instead of MBR/msdos-style.

To boot from GPT, you'll need UEFI support (both BIOS and OS?(verify)), which is getting more common but not yet ubiquitous.


Filesystem size limits

(note: These things change, this is 'at the time of writing' stuff)

JFS

  • apparently limited to 32PB
  • ...but there was one bug that made for problems for partitions larger than 32TB
  • if its fsck reports "Unrecoverable error writing M to /dev/sdb1", this seems to be a bug in jfsutils (1.1.12?), not in the filesystem data (verify)

XFS

  • limit is 16EB on 64-bit platforms, 16TB on 32-bit platforms
  • keep in mind xfs_check is memory-hungry in proportion to the amount of entries, although xfs_repair -n (dry run) is essentially equivalent and tends to work much better

ZFS

  • limit is 16EB

ext4

  • default: 32-bit addressing (*4KiB blocks = 16TiB)
    • Trying to mkfs.ext4 on something over 16TB yields "size of device /dev/sdb1 too big to be expressed in 32 bits using a blocksize of 4096"
  • You can use 64-bit addressing (*4KiB blocks = 64ZiB, though it seems it's currently 48-bit for 1EiB)

ext2 and ext3

  • depends on block size at mkfs time:
    • block sizes of 1KB makes for 4/2TB, 2KB for 8TB, 4KB (default?) for 16TB
    • ...and 8KB for 32TB, but this is only available on architectures with 8KB pages, so on Intel you're limited to 16TB(verify).

btrfs

Linux admin

Intro

First some background: The concept of a file on a filesystem typically means that you have

  • a chunk of data (in a large pool of space)
  • referenced by some unique number (in linux this is an inode)
  • a name in a directory tree referring to that inode


Most general-purpose filesystems have

  • directory entries (which can contain directory entries and file entries)
  • file entries, conceptually each (name, at_location, bunch_of_metadata)

This combination is enough to build what you think of as a tree.


You can use the names to resolve to fetch the metadata (stat by name), the location of the data (fetch inode by name), open the data there (open via name), and so on. (open by inode is rarely implemented, as that would bypass permissions)

As filesystems typically want to entangle into other kernel-level subsystems filesystem code is often part of the kernel. Or actually still managed by it, e.g. with FUSE's userspace filesystems.


The kernel exposes a relatively minimal set of syscalls, and libraries usually add to that, to make things like "open file by path" easier.

Usually the deepest you dive into any of this is when you want to walk directories (and perhaps avoid unnecessary stat() calls), understand the symlink/hardlink difference, and such.


Filesystems are databases

On symlinks and hardlinks

Note that the separation between name and inode can in theory allow for more than one file entry to point to the same inode, meaning two filenames can refer to the same stored data.

Since that's extra bookkeeping (in particular "only delete the data after all names resolving to it are removed"), a filesystem has to allow that. These additional entries are called hardlinks.

(Note that they are not really links at all. Hardlinks were not called hardlinks until symlinks existed. Yes, you create the additional entries by referring to the first, but once they exist they are all created equal.)

Since inodes are unique only per filesystem, hardlinks cannot point to things on other filesystems.


Symlinks (a.k.a. softlinks) are a special type of entry in a directory. On a low level they just store the path string (in early implementations in the file data, now optionally in the metadata - faster). It's like a redirect, and most most file/path libraries will read, resolve, and actually open the path it points to.

Symlinks will work across across filesystems, because they use path strings instead of inodes. For the same reason, it is not necessarily valid at all times, e.g. when you delete the file it points to. There is no related filesystem bookkeeping.


The symlink path may be relative or absolute. Absolute can be a little more secure, but potentially more fragile. Consider e.g. what happens when the target is replaced or the symlink is moved around.


Symlinks do not have permissions of their own; they always appear as lrwxrwxrwx, and these permissions are not used. It seems that typical use of the symlink means the permissions of the target file is used, while access to the symlink itself (e.g. to alter it) is based on the permissions on the directory it is in(verify).


While there are clear technical differences, in practice there isn't much you can do with hardlinks that you can't do with symlinks.

There are practical details and security issues to both. For example, consider the combination with a chroot jail - symlinks won't work, hardlinks give access to content that before this was outside the jail (remember, equal status).

There's also a significant detail to symlinks and trailing slashes:

rm symlinktodir         #removes the symlink
rm symlinktodir/        #implies the directory pointed to

If you meant to remove the symlink, but you added -rf out of (bad) habit, and path autocompletion added the slash, then you might've just thrown away a lot of data...



fstab

These are primarily notes
It won't be complete in any sense.
It exists to contain fragments of useful information.

fstab specifies mount points and options. 'Mounting' can be read as "pretend contents are available here, at this mount point" (in fact masking whatever was there, but mount points are conventially empty directories). This allows there to be a single directory tree in which many filesystems can be positioned.


The /etc/fstab entries imitate the options on mount. Each entry pivots around the filesystem type: a specific mounter is chosen based on it, which is a program (mount.something), which gets the the device, mount point and options handed to it. (I wouldn't be surprised if there is a very simple rewrite)

#device / fs          mountpoint      fstype       options                      dump check
/dev/sda1             /               ext3         defaults,errors=remount-ro      0 1
none                  /dev/shm        tmpfs        defaults                        0 0
/dev/sda2             none            swap         sw                              0 0
/dev/hda              /media/cdrom0   udf,iso9660  user,noauto                     0 0
//192.168.6.7/ddrive  /mnt/music      smbfs        username=Administrator,noauto   0 0

The columns are:

  • Device: what to mount (some sort of pointer to where a filesystem sits)
  • Mount point: where to put it
  • filesystem type: how to interpret it
  • options: additional arguments to the mounter. Sometimes central to functionality, sometimes tweaks.

And the two numbers:

  • "dump": you can ignore this. Historically used by (tape) backup programs.
  • "check": determines whether (0 means don't) and the order in which (>1, increasing order) to do filesystem checks (fsck) at bootup time(verify).


Device

At one point it was typically device file for a hard or floppy drive, almost always something in /dev though these days it's "anything that the specified mounter understands"

Options now include:

  • device files, e.g.:
    • /dev/hd[letter][number], for example
      /dev/hda1
      (letters refer to drives, numbers to partitions. hdsomething indicates a classical parallel ATA device)
    • /dev/sd[letter][number], for example
      /dev/sda1
      (partition reference on SCSI or SATA, or PATA when accessed via libata)
    • LVM references
    • etc.
  • symlinks like like
    /dev/cdrom
    -- usually themselves convenience links to a device file
  • a special case like
    none
    (e.g. for shm, which always just takes some RAM to create a RAM drive) or proc (which is a special case)
  • some sort of network reference, for example for SMB, NFS shares and such
  • UUID=longhexstring
    , compared to UUIDs in partitions. Lets you mount the same thing regardless of how it is connected (different cables/ports/interfaces, different device enumeration). Can be handy for external drives, but also for resistance to plugging in extra drives and such.
  • LABEL=something
    , compared to labels in partitions. Similar function and upsides to UUID. More readable, more likeliness of human-caused collision.

Note that the last two are useful when you have known partitions you want in a consistent place. In the case of removable media (CD/DVD/floppy), you don't want that.


Finding UUIDs for your filesystems:

  • gparted's 'Information' window shows them.

On the command line:

  • for ext2, ext3, ext4, use blkid (part of e2fsprogs):
    blkid /dev/sda1
    . It's also in the
    tune2fs -l
    output.
  • for XFS:
    xfs_admin -u /dev/sda1
    . Not automatically added at mkfs time. You can add a random one with
    xfs_admin -U generate /dev/sda1
  • for btrfs:
    btrfs filesystem show uuid /dev/sda1
  • for JFS: it's in the output of
    jfs_tune -l /dev/sdc1
    . Not added at mkfs time. You can add a random one with
    jfs_tune -U random /dev/sda1
  • for reiserfs: it's in the output of
    debugreiserfs /dev/sda1

Mount point

The place in your filesystem tree that this filesystem should become visible.

You always have one things as the root (/) (the system wouldn't have booted if you didn't).

Additional mounts are placed somewhere inside that root filesystem, usually in recognizable places like /mnt/something, /media/something, sometimes /cdrom and/or /floppy, and such. You don't have to. Sometimes it can be handy to keep the same filesystem layout by, say, mounting a new data drive where an old data directory used to be.

Mounting at a specific directory masks the contents of directory that is there. Mountpoint directories are usually empty for this reason.


Filesystem type

mount options

Mount options can be specific to the filesystem type, and their meaning may vary somewhat. Still, most of the options used in practice are shared between most filesystems (local an networked).

Defaults are often roughly equivalent to auto,nouser,rw,async,exec,suid,dev.

common mount options

This article/section is a stub — probably a pile of half-sorted notes, is not well-checked so may have incorrect bits. (Feel free to ignore, fix, or tell me)

Options shared between many filesystems include:

  • ro, rw: Mount read-only or read-write. Default behaviour is often rw, though ro where it makes sense, e.g. cdroms.
  • sync, async: basically, whether to use a writeback cache(verify). Async is often the default.
    • sync means operations block until present on disk, which you may want for floppies (async on floppies would often delay most/all writes until you umount), and possibly for USB sticks
    • async implies a writeback cache, which is often faster, implies less waiting on operations, and can lengthen drive lifetime when there are a limited number of write cycles.


  • suid, nosuid: allow/disallow SUID/GUID bits to apply
  • exec (default), noexec: allow binaries to run from this partition. Not a security solution since binaries can be copied, but can be convenient.


  • who can mount/umount:
    • nouser (default): only root can mount/umount
    • owner: allow the device file's user to mount/umount (implies nosuid, nodev, but they can be overridden again)
    • user: everyone can mount, but when mounted (username written to mtab), only the mounting user can umount (implies noexec, nosuid, and nodev, but they can be overridden again)
    • users: all users can mount and umount
    • if you want a subset of users, perhaps look at sudoers


When to try what:

  • auto (default), noauto: whether or not to try mounting this when
    mount -a
    is run - which includes bootup.
noauto can make sense for CD and floppy drives, and for network mounts are not always reachable.
  • fsck will skil fstab lines with non-existing devices and filesystem type auto
which sounds like this can sometimes make more sense than the two below:
  • nofail is about fscking
if an entry is set to be fscked, but isn't present, this says to not consider that an error
so it won't halt boot because it couldn't do this fsck
seems a feature of various *nix filesystems, but not present in some others
(also means it will silently fail to mount in general)
will apparently still repeatedly try to mount, until a timeout (verify)
  • nobootwait is about mounting (and only on ubuntu's mountall? (verify))
if an entry is set to be mounted, but cannot be for any reason, this says not to halt the boot
it seems that if the reason is a fsck, it will do so in the background
which is not what you want for drives that a service would fail on if missing
though can be preferable on large, non-boot-essential drives


The last two are mostly important for headless servers that you don't want hanging at bootup (particularly when you have no remote management).


Ubuntu's
mountall
isn't entirely the same and will act subtly differently - try to avoid it, it was meant to be temporary anyway.

In the meantime, the fairly-safe thing to do is to use both nofail and nobootwait, test this, and remove the options that the underlying filesystem fails on - e.g. xfs whines about nobootwait, so use nofail.

Also, avoid having services depend on disks that you use these on.

See also:

Other shared mount options

atime

for context

There are three times associated with each directory entry:

  • mtime, the modification time
    • set whenever the file is written to, i.e. content changes
    • By default, ls -l shows the mtime.
  • ctime, the status change time (not creation time as is commonly assumed, though apparently this was true long ago)
    • is updated for content changes, and also inode changes: permission changes, ownership changes, hardlink changes
    • For uses like backup this is probably the most meaningful of the three. (The other two can be changed with as little as touch. It seems the only way to do so for ctime is to make and delete a hardlink)
    • ls -lc to see
    • (not all sources I've read mention that content change affects ctime. I should check whether there are systems where this is not true, but it seems you can assume it does.)
  • atime, the access time
    • set whenever the file is opened, read from, or written to (verify)
    • (only happens when you have write permissons?(verify))
    • ls -lu to see


If you want to inspect all of them, the
stat
utility is probably easiest
# touch test && stat test
  File: ‘test’
  Size: 0               Blocks: 8          IO Block: 4096   regular file
Device: 812h/2066d      Inode: 147605508   Links: 1
Access: (0664/-rw-rw-r--)  Uid: (   33/www-data)   Gid: (   33/www-data)
Access: 2015-03-28 15:30:46.736127320 +0100
Modify: 2015-03-28 15:30:46.736127320 +0100
Change: 2015-05-29 14:58:54.206870022 +0200
 Birth: -

(apparently birth is actually recorded on ext, but not exposed in syscalls yet?(verify))

If you want to see this more often, e.g. have an ls variant, try e.g. something like this:

alias lsac='find . -maxdepth 1 -printf "%M %-10u %-10g %10s\tatime: %Ab %Ad %AH:%AM\tctime: %Cb %Cd %CH:%CM\t%P\n"'
Why you may not want it

Yes, atime means a disk-write operation for every file-open and file-read operation, even when the data comes entirely from the page cache).

Yes, atime means that under certain types of use, and particularly on platter drives, performance is going to take an hit particularly when there are many small reads, or many files being read from.

Yes, this speed hit is basically unnecessary.


How big the speed hit is depends on both what the program actually does, and how(/where) the backing filesystem stores(/journals) the metadata. (which is also why postponing atime to flush them in batches does not help - when it needs to go to many locations)

In the best case (low enough IO load to just have the filesystem happily do its thing while program does its) the speed hit is negligible, in pathological cases you may become IO bound purely because of atime and may easily see ~50% off of throughput. Everyday use it might be perhaps ~10%, order of magnitude.


An important point is that very few things depend on atime being updated. Not doing atime updates usually lowers latency of disk operations, and implicitly helps throughput somewhat.

If you know you don't need it at all, or can work around it (e.g. mutt relies on it by default, but can be configured otherwise, and by now often is) you can use the
noatime
mount option on all relevant mounts
.

For data drives it's often a no-brainer. For your system drive you can usually get away with it, but if you suspect some programs still use atime, then you probably want to use relatime instead - which these days is the default.


Mount options:

  • atime
    - asks for default behaviour. Implied if no atime-related option is given. Up to approx. 2009 this meant the behaviour described above. After ~2009, the default became relatime, and if you want strict POSIX compliant behaviour, ask for it with strictatime.
  • noatime
    - never update atime
  • nodiratime
    - Like noatime, but for directory entries only (...why?)
    • Implied by noatime, and has for some time now(verify)
  • strictatime
    - the 'update on everything' behaviour described above. The option to ask for this previously default behaviour was introduced around 2009, when the default moved to be relatime.
  • relatime
    (default since 2009ish) - Updates atime only if the file's current atime is older than its mtime (or ctime, but mtime should always be ≥ the ctime). This basically means atime is only updated on writes, not on reads.
    • Not strictly standards-compliant and all that, but breaks fewer things than noatime, and means fewer writes in mostly-read IO loads
    • Useful for apps that watch atime basically as if it were mtime, like mutt(verify)
    • Still breaks things that need to know whether a file was read since it was last modified. There are few of those (apparently tmpwatcher was one) and most of those have been fixed, or have known workarounds.

See also

automounters

autofs

gnome-mount

Now replaced by udisks?

udisks2

Ubuntu

Filesystem notes and choice

See Computer data storage - Partitioning and filesystems


linux swap

Total effective swap and usage can be shown using
free
,
top
, and similar utilities.


To view currently active swap, use
swapon -s
, which basically reports the contents of /proc/swaps.


You can a partition for swap use with something like
mkswap /dev/sda2
Note that you can also create a swap file (instead of partition), which you would probably only do when you can't use a partition for some reason. Use of
dd
(from /dev/zero) is probably the easiest way to create an empty file.


To have swap mounted automatically at boot, add it to your fstab.

To activate without a reboot, do something like
swapon /dev/sda2
, to deactivate (may take a while, and even fail) do
swapoff /dev/sda2
.
swapon -a
mounts all swaps listed in fstab.

(note: swapctl in some systems)


Use of swap is usually distributed, in that all all swap devices are used (to try to minimize wait time while swapping). You can give priorities to individual swap devices (a number between 0 and 32k, higher is more), if you know something about their performance.

fsck

This article/section is a stub — probably a pile of half-sorted notes, is not well-checked so may have incorrect bits. (Feel free to ignore, fix, or tell me)


Partitioning

The hardcore, oldschool way of partitioning is fdisk.

The shinier, easier-to-use way is something like gparted (useful on a livecd / USB stick), which also makes it easier to do things like resizing partitions with data still on them.


Notes:

  • If you edit the partition table outside of fdisk (including some less-usual tools), it's useful to know about partprobe: it tells the kernel to rescan the partition table for a device, and update the device nodes.




Semi-sorted

File sorting

  • ls: ls sorts alphabetically by default.
Beware aliases.
Sort entries alphabetically if none of -cftuvSUX nor --sort is specified.       
  • bash expansion: sorts alphabetically
After  word  splitting,  unless  the -f option has been set, bash scans
each word for the characters *, ?, and [.  If one of  these  characters
appears,  then  the word is regarded as a pattern, and replaced with an
alphabetically sorted list of file names matching the pattern.


ext3/ext4 "bad entry in directory"

This article/section is a stub — probably a pile of half-sorted notes, is not well-checked so may have incorrect bits. (Feel free to ignore, fix, or tell me)
EXT3-fs error (device dm-0): ext3_readdir: bad entry in directory #5556142: rec_len is smaller than minimal
- offset=0, inode=2553887680, rec_len=0, name_len=0
EXT4-fs error (device sdb1): htree_dirblock_to_tree:587: inode #2827419650: block 11309464893: comm
filename: bad entry in directory: rec_len % 4 != 0 - offset=0(32768), inode=3612842463, rec_len=32077,
name_len=85


At this time, the following is an untested theory:

fsck does not seem to find or correct these - so you have to delete the directory. I'm guessing you can't copy out the files, but I don't know. In my case this problem was so annoying I was glad enough to be rid of it...


On a mounted filesystem, find the directory by inode:

find /mnt/point -inum 5556142

This should give you the name of one directory. Remove it (probably meaning rm -rf)

Then umount, fsck -f until no errors are reported (which may be the first time), and mount again.

On large filesystems

See Partitioning#Huge_filesystems_-_practicalities


On filesystem slowness

This article/section is a stub — probably a pile of half-sorted notes, is not well-checked so may have incorrect bits. (Feel free to ignore, fix, or tell me)

Filesystems will by nature be better at some operations than others.

In some specific uses, such as those in servers that have to store very large files, store very many files, do a lot of creations and deletes or such, you'll find some filesystems slightly better tuned than others. The difference to everyday use is probably easily easily overvalued, though.


A common cause for reported slowness caused by filesystems comes from very long directory listings (particularly if the underlying filesystem stores this in a simple list, not using an index or balanced-tree type thing), because each entry lookup takes an amount of time linear to the directory size. For most directories this isn't noticable (and a complex balanced structure may even be slower for directories with up to a few dozen files) but operations you may think of as simple, say,
ls
, can be pretty bad cases: ls will go through the list of items and stat() each item as it lists it -- and each stat will itself be a separate linear-time lookup -- so will actually combining to quadratic time with the amount of entries.

Note that anything looking for a known filename will be only linear with number of directory entries, which itself stays acceptable much longer (but can still be faster).


One thing you occasionally see is programs dividing a load of items in a directory tree, to reduce the amount of items in each directory. There are a few different aspects to this:

  • can also be useful to avoid (filesystem/OS) limits imposed on directory entry amount (you ocasionally see a max of 64k entries in a directory)
  • means you add one (or more, with deeper structures) directory stat()s on each file operation. If you are opening a predetermined filename, this may sometimes even turn out slower than opening it from a large directory
  • may be nice to file browsers that want to look at these directories (windows explorer and various others can freeze up for minutes when looking at 20K+ items)

Common *nix filesystem organisation

This article/section is a stub — probably a pile of half-sorted notes, is not well-checked so may have incorrect bits. (Feel free to ignore, fix, or tell me)
This article is an attempt at a decent overview, introduction, summary, or such.

It may currently just contain a mess of information. If it is likely to keep sucking it will probably be removed later.

Perhaps-useful-to-knows

apparent size versus disk usage

Apparent size is the byte size of the file. It is what, for example, cat filename would show.

Disk usage is the amount of bytes used on disk for this file, and is something reported by filesystem code.


The two will differ:

  • when the file size is not exact multiple of the underlying lock size, the last allocated block will only be partially used.
  • for sparse files, disk usage can be much lower
  • for fragmented files, disk usage is somewhat higher
  • use of indirect blocks[3]


Most shell utilities show apparent size.

Interestingly, du shows disk usage, at least by default.


If you want to find the difference, do a stat:

st_size is the apparent size
st_blocks*512 is the disk usage


MBR backup and restore

(Note: This does not apply to GPT - those are larger)

When you install Windows, it overwrites the boot code in the MBR without asking. If you had a boot menu there, tough luck.

Being the first 512 bytes on a disk, it's pretty easy to back up beforehand and restore after.


Keep in mind that the partition table is also in there. If you change that during your windows install, you don't want to revert that afterwards.

With foresight, you can install windows first and linux after.

If you installed linux first and windows after, you'll probably either want to:

  • partition the way you want it beforehand, and make sure windows didn't change it anyway, or...
  • know how to install a new grub based on existing configuration (often using a liveCD and two or three commands, there are various pages about this).


Back up the MBR:

  • boot in some *nix
  • Back up the MBR (Master Boot Record), probably somewhere that you can easily get to later
dd if=/dev/hda of=/mbrbackup bs=512 count=1

Restore the MBR:

  • Boot from a *nix liveCD
  • get to the MBR backup (often: mount the linux partition that you saved it on)
  • write it back verbatim to restore the MBR:
dd if=/mnt/mylinux/mbrbackup of=/dev/hda bs=512 count=1


Note/warning: You read the MBR from and write the MBR to the drive device, not one of the partitions on it.


Note that if you didn't back up the MBR, tools like ms-sys lets you write various MBRs (mostly windows).

Deleted file content removed only when file closed

It seems that in most filesystems, deleting an open file, in itself, removes only the directory entry.

The data is not freed on disk until the last process that has a handle to it closes that handle.


This can be useful to know in the context of large files and log files.

It can also be useful for temporary files. If you create, open, delete the file, no other process can open this file. You can still pass the file handle around. It's essentially an anonymous disk file that will be removed as soon as you close its handle.

volume management using LVM

See LVM notes

md: software RAID

This article/section is a stub — probably a pile of half-sorted notes, is not well-checked so may have incorrect bits. (Feel free to ignore, fix, or tell me)

One argument in the hardware RAID versus software RAID discussion is that once a hardware controller fails, you may have lost all your data, not because another controller won't be able to read it, but because you need to get the same brand and model, since the actual storage methods used are proprietary, usually differ per company, and often also over time.

Since it is easier to duplicate software setup, that makes software RAID a safer, easier-to-restore option. This at the cost of doing all the data processing to the main CPU instead of at the RAID controller, introducing potential bottlenecks if you do it over PCI (bus saturation with RAID was pretty easy), and not really having the battery backed flushing option (useful if you don't have an UPS).