Computer data storage - Partitioning and filesystems

From Helpful
(Redirected from Advanced Format)
Jump to navigation Jump to search

📃 These are primarily notes, intended to be a collection of useful fragments, that will probably never be complete in any sense.
Computer data storage

On 512B versus 4K (Advanced Format, AF) sectors

This article/section is a stub — some half-sorted notes, not necessarily checked, not necessarily correct. Feel free to ignore, or tell me about it.

More technically

More practically

On Hard drive size barriers

In the past, there were some hardware constrains, related to addressing.

Initial BIOS CHS and int13 stuff -- 99% irrelevant these days since you won't have a motherboard that old.



The one that you may still run into today is the 2.2TB limit.

This one is largely a software limit, due to 32-bit LBA - 232 times 512-byte sectors is 2.2TB (2199023255552 bytes).


Keep in mind that in some cases, the it's only the BIOS that won't understand the disk, showing it as a smaller size), but the OS driver actually talks to the controller properly.


So if it does show properly in the OS, (and preferably you're not booting off it - that can be troublesome) you're likely fine.

If it does not show properly in the OS, it will typically show up as its real size modulo 2.2TB, e.g. 3TB as 0.8TB (3TB - 2.2TB), 6TB as 1.6TB (6TB - 4.4TB). When you see this, unplug it now and think very hard about your next step.

Things can get funny when some part of the system mixes 512 and 4096-byte sectors - old OSes, drive firmware, USB-to-SATA controllers.

This can cause anything from

not being able to use more than a fraction of the drive,
corrupting the drive when you plug it into something else (verify), to
being able to safely use a many-TB external USB disk on WinXP even though that would never work with an internal disk.


Software limit: addresses in partitioning

Even if the BIOS and OS doesn't have this problem, use of MBR-style partitioning still does, because its address field is 32-bit, so the same 232 * 512-byte sectors is the same 2.2TB. So with MBR you can safely use the first 2.2TB, but not the rest.

This would be a good reason to use GPT: GPT uses 64-bit LBA addressing.

You do need an OS that can read GPT. Initially it also mattered whether it could also boot off GPT (This is only an issue when you use WinXP, where only the 64-bit variant can read/write it, and no variant can boot from it).

Partitioning notes

Things like floppies are not partitioned, they are one volume.

CDs have their own standards.

USB Flash sticks are usually one-partition things. They can be partitioned, but it seems not every OS understands the result equally well.


Alignment

This article/section is a stub — some half-sorted notes, not necessarily checked, not necessarily correct. Feel free to ignore, or tell me about it.

Some recent tools align to 1MiB into the device, because it covers block size and some further details with a a single (somewhat coarse) rule.

If you don't care about the details, that's a perfectly good solution. The only setups that may need further consideration are RAID setups with very large stripe blocks, and RAID arrays with many disks.


Bulk storage is done on block devices, which means the units involved in operations are rather larger than a byte.

Historically it made little difference, because most things used 512-byte units.

These days, aligning things (like partitions) to the underlying things can help avoid some unnecessary work, meaning the system can use the device a little bit more efficiently.

Think a few percent on performance benchmarks on drives, on RAID sometimes a little more, and on SSDs some small effect on longevity.

Recent platter drives (since approx. 2010) use Advanced Format, currently meaning 4KiB sector size. (They are still reported in units of 512-byte sectors there is too much code that thinks that way. That will probably change in the pretty-long run).

SSDs are more interesting, but an overly simple summary is that they think in 4KiB units as well. ...so for SSDs and recent platter drives you want to align to a multiple of 4KiB. Start on a (512-byte-)sector number that is divisible by 8.


For RAID (in particular parity RAID) it is a good idea to align to the underlying stripe blocks (often 64KiB, 128KiB, or 256KiB) to avoid cases where a write has to do more work just because it a write touches a little data in the next underlying block.



See also:


Partition styles

This article/section is a stub — some half-sorted notes, not necessarily checked, not necessarily correct. Feel free to ignore, or tell me about it.

MBR

MBR is the historically common variant on PCs - from the DOS era and is still used though there is now significant shift towards GPT.


The Master Boot Record refers to all the contents of the first 512 bytes of the disk, which contains

basic bootstrapping code
a simple four-entry partition table


That four-partition thing was later extended into a sort of linked-list-into-a-partition construction.(verify)

Because addresses in MBR are 32 bit, and it counts 512-byte sectors, this can refer to at most 241 bytes (2TiB, ~2.2TB), so for larger drives, most use GPT instead.


See also:

GPT (GUID Partition Table)

This article/section is a stub — some half-sorted notes, not necessarily checked, not necessarily correct. Feel free to ignore, or tell me about it.

GPT is part of the EFI standard that various OSes are now largely following.

GPT is popular for large drives and RAID arrays because it is a necessity, given MBR/msdos's 2TiB limitation.


GPT can coexist with MBR-style partition table. This is useful when you have multiple OSes installed, some of which understand only MBR, but when you are not limited in this way you probably do not want this -- because of the question of what to do when they disagree. Or when you use disk utilities that know about only one, which is one way to get them out of sync.


Whether GPT disks are bootable depends

  • on whether the BIOS follows EFI to this degree or not
  • on whether the OS supports this

These days, the answer to both is that most do.

...but this can be an interesting


There is also an extra footnote in that the OS bootloader may install only what the BIOS is set to, so if you toggle it between EFI or not, things may stop booting, so you may want think about this briefly when doing a complete reinstall.


See also:

APM, Apple partition map

See also:


BSD disklabel

See also:


Others / unsorted

Practical partitioning notes

GPT

Related booting stuff

Names and labels

filesystem name

Most filesystems can store a name within that filesystem.

Setting (and one way of reading) the label will be a filesystem-specific tool, see e.g.:

https://wiki.archlinux.org/index.php/persistent_block_device_naming#by-label

Disks may be exposed via labels, e.g. in /dev/disk/by-label but note this is not necessarily unique so doesn't always make sense.


GPT partition label

When using GPT, you can also assign a name to each partition.

May be exposed e.g. via /dev/disk/by-partlabel


Keep in mind that if these names are not unique, the above symlinks will refer to only device with that name.

Partition recovery

This article/section is a stub — some half-sorted notes, not necessarily checked, not necessarily correct. Feel free to ignore, or tell me about it.

There are some tools that can scan a disk for known filesystem types (usually by something recognizable that they start with), which can be useful when your partition table is, say, corrupted by a stupid utility or an administrator mistake.

These tools include:

I personally had more luck with testdisk than gpart, but this probably varies with the type of filesystems you lost.

Filesystem choice

simple filesystems

ext2

OSes: *nix, mostly linux.


Journaling filesystems

ext3, ext4

This article/section is a stub — some half-sorted notes, not necessarily checked, not necessarily correct. Feel free to ignore, or tell me about it.


reiser, reiser4

XFS

ZFS

See Computer data storage - ZFS notes (it was getting a bit too large for this page)

btrfs

OSes: *nix

Values features like robustness and scalability over speed.

Going in the same direction as ZFS (and sharing half its features already). If ZFS remains GPL-incompatible, this may become a realistic alternative to ZFS for the feature set both of those provide.

Probably the direction linux will take; ext3 and ext4 are extensions to ext2 but sort of dead-ended.

As of this writing (early 2014), still considered beta/experimental/"Not yet production ready", but it seems many of its developers run it as their primary filesystem, so seems to be well on its way.


See also:

HammerFS

Non-*nix

FAT (FAT32, FAT16, FAT12)

This article/section is a stub — some half-sorted notes, not necessarily checked, not necessarily correct. Feel free to ignore, or tell me about it.

OS support: Most of them


The name refers to there being one main table that contained all directory and file entries, all in the same place. (The relative simplicitly of that implementation is why it's still used in some contexts)

If it contains a number it relates to the size of the addresses (and therefore also of the maximum size of the filesystem). Say, usually FAT12 on floppies, and FAT16 and later FAT32 on larger media.

Due to being common for a long time, there are many flavours of in particular FAT16, but also FAT32.


FAT12 (partition type 0x01)

effectively a 32 MiB limit, which was much more than most floppy formats needed
(eventually extended in such use, but then also replaced by the somehat more practical FAT16(verify))

FAT16 (partition type 0x04)

was used so much for a while that it has seen a history of not-always-compatible variants.
initially effectively a 32MiB partition size limit (or 2GB in extended form?)

FAT16B (partition type 0x06 - to avoid trouble from earlier unaware FAT16 code)

'Final FAT16', introduced by Compaq
mostly like FAT16 but sector counts are 32-bit, not 16-bit
in part solves some of that early incompatibility

FAT32 (partition type 0x0b for CHS or 0x0b for LBA flavours)

  • Readable and writable by most versions of common OSes (Windows, OSX, Linux)
  • Max file size: 4GB
  • Max filesystem size:
Theoretically 2TB in specs - 232 512-byte sectors = 2TB
...but due to there being flavours, some contexts limit this for other reasons (order of dozens of GByte)
  • Microsoft used to charge for implementations; the patent basis for that seems to have expired in 2013 so now you can implement it any which way

exFAT (partition type 0x07)



See also:

NTFS

exFAT

New, proprietary, licensed (Microsoft).

Not very related to FAT32 and earlier. Partly meant as a useful filesystem for USB flash drives.

Basically something between FAT32 and NTFS: larger, deals a little better with large files, has ACLs, simpler than NTFS.


  • not yet widely supported (for media players you'ld want FAT32)
    • Windows: Supported in Win7, WinCE6, Vista since SP1. WinServer2003 and XP only with a driver[1]
    • OSX since 10.6.5 (late 2010; updated versions of Snow Leopard)
    • Linux: FUSE version in beta, kernel version under development (verify)
  • Max filesystem size: 64ZB theoretical, 512TB recommended
  • Max file size: 16EB
  • Well suited to Flash (why?(verify))

See also:

Storage Spaces

Basically a thin-provisioned LVM, with redundancy via software RAID1 / RAID5 (block level), and which lets you mix different-sized drives.

Has its limitations, though. See e.g.:


ReFS

http://en.wikipedia.org/wiki/ReFS

More specific filesystem notes

fuse and fuseblk

This article/section is a stub — some half-sorted notes, not necessarily checked, not necessarily correct. Feel free to ignore, or tell me about it.


In particular automounters may mount something as type fuse or fuseblk, which is basically a wrapper that hands the work over to be done in userspace, via FUSE.


Has a few of its own mount options.


Permission related:

  • uid, gid
  • umask
  • default_permissions
    • default behaviour (omitting this option) by the underlying filesystem driver, regularly meaning no permission checking(verify)
    • use of this option means permissions checking based on file mode. (meaning? the one defaulted to / specified by (verify))


  • allow_other - instead of the default behaviour of restricting permissions to the mounting user, this allows other users too. (how exactly?)
  • allow_root - instead of allowing only the mounting user, allow them and root. Exclusive with allow_other.



(what are these default permissions, and/or what are they based on?(verify))


See also:

The 5% reserve

The ext[234] family of filesystems reserve 5% of their capacity.

That is, only programs running as root can put files in this space.


To see:

tune2fs -l /dev/sda1  | grep eserved
# that's in blocks. Multiply by the block size to get it in bytes

To change:

# probably easiest to do via a percentage (can be fractional, e.g. 0.5):
tune2fs -m 1 /dev/sda1


Before you change this to 0, you should know when reserved space can be a good idea.


...mostly on partitions used by the system (and possibly services). It means that when a user does a weird thing that uses all space, they will be denied allocation before anything running as root will be denied.

That tends to mean system services will continue working as they can happily write a good while longer - rather than crash in a mess of IO errors. This gives the sysadmin time to notice and deal with this situation.


Another reason is fragmentation.

Modern filesystems have clever allocators that try their best to avoid fragmentation, yes, but once a filesystem is very nearly full, the speed at which it fragments necessarily increases regardless, largely because the scarcity of the space left to give away.

The 5% reserve works surprisingly decently to avoid that (given how much of a blunt-edged solution it is). You are trading the ability to use space, with better performance in the longer run. Details still vary with use patterns, and also the relation of file size and drive size.

If you really dislike fragmentation, you may wish to increase it to 10%, even 20% or 30%.


However:

  • On SSD, fragmentation is not the speed-reducer it is on platter disks.
It still makes some difference, but not a lot.
  • On disks used for archives, you can afford to not care about fragmentation, or care about the superuser reason, if it gives you a few hundred more GB of usable space.
  • On particularly large disks, 5% can be a lot, larger than justifiable by either of the above reasons.
For example, on a 60TB RAID array, 5% is 3TB. That's probably overkill unless your files are typically hundreds of GB - and probably even then.
  • on scratch disks you can also lower it (though you may still want to clean it out completely every now and then)

Filename limitations (characters, length)

This article/section is a stub — some half-sorted notes, not necessarily checked, not necessarily correct. Feel free to ignore, or tell me about it.

Notes below are for various filesystems. (For programmers like me, it's interesting to know what is safe for all common filesystems.)


Note that in many cases, the API you use is more restrictive than the backing filesystem, which can mean that strange things may exist without problems, and you can rename them from unsafe to safe (usually), but but not from safe to unsafe. In my experience, this has happened multiple times under NTFS, where I had nearly undeletable files.


Limitations from filesystem and/or APIs

FAT

  • maximum file/folder name length: up to 255 (UTF-16) characters with LFN, otherwise fixed at 8.3
  • maximum path length: not defined
  • Invalid characters:
    • ...in general: \/:*?"<>|, and control characters
    • Allowed in LFN but not 8.3 names: +,.;=[], lowercase a-z,

NTFS:

  • maximum file/folder name length: 255 (UTF-16) characters
  • maximum path length: 32767 UTF-16 characters - seems to be an arbitrary imposed restriction (windows itself allowed you to create longer paths by moving into deeper directories - but that then confused programs)
  • Invalid characters: \/:*?"<>| and NUL (U+0000)

ext2 / ext3 / ext4:

  • Invalid characters: / (as per POSIX) and presumably NUL
  • file/folder names: 255, 255, and 256 bytes (respectively)


Windows:

  • doesn't like a space at the end of the filename (which can sometimes act up indirectly. For example, test .zip is valid, but tring to expand that to a new folder test instead of test would be a problem.
  • doesn't like a period at the end of a filename
  • doesn't like a period at the start of a filename

Apple OS:

  • OS9 and the Carbon API only restrict :
  • OSX and the POSIX API do not allow /
  • File/folder length of 31 in OS9, 255 in OSX

POSIX:

  • disallows / and pretty much nothing else


See also:



More notes on invalid characters

In general, stay away from \/:*?"<>| to be fairly safe on windows, linux, and OSX.


There are further things that may be entirely valid but can cause other problems, such as in shell scripting (where it's usually hard to be completely safe).

These include:

  • newlines
  • control characters (ASCII bytes 0x01 through 0x1F, and 0x7F. The same set in Unicode codepoints)
  • backspace
  • tab
  • [ and ] - usually fine
  • "
  • =
  • +
  • :
  • ;
  • ,
  • *
  • "
  • $
  • !
  • &
  • - at the start of a filename - shells can interpret this as options. There are a few workarounds to that, such as -- is for (but not everything supports it), using ./
  • . in some places (Explorer doesn't like to rename to dotfiles, but otherwise seems not to mind them)


  • In web-publishing contexts there may be further restrictions but they are usually arbitrary software restrictions


See also:

lost+found

Huge filesystems - practicalities

This article/section is a stub — some half-sorted notes, not necessarily checked, not necessarily correct. Feel free to ignore, or tell me about it.

for reference: [2]

  • GB is 10003, 109
  • TB is 10004, 1012 (tera)
  • PB is 10005, 1015 (peta)
  • EB is 10006, 1018 (exa)
  • ZB is 10007, 1021 (zetta)


Partitioning

If you have a drive (or a RAID array) larger than 2TiB, ~2.2TB, then you probably want GPT partitioning instead of MBR/msdos-style.

To boot from GPT, you'll need UEFI support (both BIOS and OS?(verify)), which is getting more common but not yet ubiquitous.


Filesystem size limits

(note: These things change, this is 'at the time of writing' stuff)

JFS

  • apparently limited to 32PB
  • ...but there was one bug that made for problems for partitions larger than 32TB
  • if its fsck reports "Unrecoverable error writing M to /dev/sdb1", this seems to be a bug in jfsutils (1.1.12?), not in the filesystem data (verify)

XFS

  • limit is 16EB on 64-bit platforms, 16TB on 32-bit platforms
  • keep in mind xfs_check is memory-hungry in proportion to the amount of entries, although xfs_repair -n (dry run) is essentially equivalent and tends to work much better

ZFS

  • limit is 16EB

ext4

  • default: 32-bit addressing (*4KiB blocks = 16TiB)
    • Trying to mkfs.ext4 on something over 16TB yields "size of device /dev/sdb1 too big to be expressed in 32 bits using a blocksize of 4096"
  • You can use 64-bit addressing (*4KiB blocks = 64ZiB, though it seems it's currently 48-bit for 1EiB)

ext2 and ext3

  • depends on block size at mkfs time:
    • block sizes of 1KB makes for 4/2TB, 2KB for 8TB, 4KB (default?) for 16TB
    • ...and 8KB for 32TB, but this is only available on architectures with 8KB pages, so on Intel you're limited to 16TB(verify).

btrfs

Linux admin

Intro

First some background: The concept of a file on a filesystem typically means that you have

  • a chunk of data (in a large pool of space)
  • referenced by some unique number (in linux this is an inode)
  • a name in a directory tree referring to that inode


Most general-purpose filesystems have

  • directory entries (which can contain directory entries and file entries)
  • file entries, conceptually each (name, at_location, bunch_of_metadata)

This combination is enough to build what you think of as a tree.


You can use the names to resolve to fetch the metadata (stat by name), the location of the data (fetch inode by name), open the data there (open via name), and so on. (open by inode is rarely implemented, as that would bypass permissions)

As filesystems typically want to entangle into other kernel-level subsystems filesystem code is often part of the kernel. Or actually still managed by it, e.g. with FUSE's userspace filesystems.


The kernel exposes a relatively minimal set of syscalls, and libraries usually add to that, to make things like "open file by path" easier.

Usually the deepest you dive into any of this is when you want to walk directories (and perhaps avoid unnecessary stat() calls), understand the symlink/hardlink difference, and such.


Filesystems are databases

Text coding

On symlinks and hardlinks

See Filesystem_links_on_different_OSes#Linux

fstab

📃 These are primarily notes, intended to be a collection of useful fragments, that will probably never be complete in any sense.

fstab specifies mount points and options. 'Mounting' can be read as "pretend contents are available here, at this mount point" (in fact masking whatever was there, but mount points are conventially empty directories). This allows there to be a single directory tree in which many filesystems can be positioned.


The /etc/fstab entries imitate the options on mount. Each entry pivots around the filesystem type: a specific mounter is chosen based on it, which is a program (mount.something), which gets the the device, mount point and options handed to it. (I wouldn't be surprised if there is a very simple rewrite)

#device / fs (what)   mountpoint (where)   fstype       options                      dump check
/dev/sda1             /                    ext3         defaults,errors=remount-ro      0 1
none                  /dev/shm             tmpfs        defaults                        0 0
/dev/sda2             none                 swap         sw                              0 0
/dev/hda              /media/cdrom0        udf,iso9660  user,noauto                     0 0
//192.168.6.7/ddrive  /mnt/music           smbfs        username=Administrator,noauto   0 0

The columns are:

  • Device: what to mount (some sort of pointer to where a filesystem sits)
  • Mount point: where to put it
  • filesystem type: how to interpret it
  • options: additional arguments to the mounter. Sometimes central to functionality, sometimes tweaks.

And the two numbers:

  • "dump": you can ignore this. Historically used by (tape) backup programs.
  • "check": determines whether (0 means don't) and the order in which (>1, increasing order) to do filesystem checks (fsck) at bootup time(verify).


Device

At one point it was typically device file for a hard or floppy drive, almost always something in /dev though these days it's "anything that the specified mounter understands"

Options now include:

  • device files, e.g.:
    • /dev/hd[letter][number], for example /dev/hda1 (letters refer to drives, numbers to partitions. hdsomething indicates a classical parallel ATA device)
    • /dev/sd[letter][number], for example /dev/sda1 (partition reference on SCSI or SATA, or PATA when accessed via libata)
    • LVM references
    • etc.
  • symlinks like like /dev/cdrom -- usually themselves convenience links to a device file
  • a special case like none (e.g. for shm, which always just takes some RAM to create a RAM drive) or proc (which is a special case)
  • some sort of network reference, for example for SMB, NFS shares and such
  • UUID=longhexstring, compared to UUIDs in partitions. Lets you mount the same thing regardless of how it is connected (different cables/ports/interfaces, different device enumeration). Can be handy for external drives, but also for resistance to plugging in extra drives and such.
  • LABEL=something, compared to labels in partitions. Similar function and upsides to UUID. More readable, more likeliness of human-caused collision.

Note that the last two are useful when you have known partitions you want in a consistent place. In the case of removable media (CD/DVD/floppy), you don't want that.


Finding UUIDs for your filesystems:

  • gparted's 'Information' window shows them.

On the command line:

  • for ext2, ext3, ext4, use blkid (part of e2fsprogs): blkid /dev/sda1. It's also in the tune2fs -l output.
  • for XFS: xfs_admin -u /dev/sda1. Not automatically added at mkfs time. You can add a random one with xfs_admin -U generate /dev/sda1
  • for btrfs: btrfs filesystem show uuid /dev/sda1
  • for JFS: it's in the output of jfs_tune -l /dev/sdc1. Not added at mkfs time. You can add a random one with jfs_tune -U random /dev/sda1
  • for reiserfs: it's in the output of debugreiserfs /dev/sda1

Mount point

The place in your filesystem tree that this filesystem should become visible.

You always have one things as the root (/) (the system wouldn't have booted if you didn't).

Additional mounts are placed somewhere inside that root filesystem, usually in recognizable places like /mnt/something, /media/something, sometimes /cdrom and/or /floppy, and such. You don't have to. Sometimes it can be handy to keep the same filesystem layout by, say, mounting a new data drive where an old data directory used to be.

Mounting at a specific directory masks the contents of directory that is there. Mountpoint directories are usually empty for this reason.


Filesystem type

mount options

Mount options can be specific to the filesystem type, and their meaning may vary somewhat. Still, most of the options used in practice are shared between most filesystems (local an networked).

Defaults are often roughly equivalent to auto,nouser,rw,async,exec,suid,dev.

common mount options

This article/section is a stub — some half-sorted notes, not necessarily checked, not necessarily correct. Feel free to ignore, or tell me about it.

Options shared between many filesystems include:

  • ro, rw: Mount read-only or read-write. Default behaviour is often rw, though ro where it makes sense, e.g. cdroms.
  • sync, async: basically, whether to use a writeback cache(verify). Async is often the default.
    • sync means operations block until present on disk, which you may want for floppies (async on floppies would often delay most/all writes until a few minutes later, or until you umount if that's sooner), and possibly for USB sticks
    • async implies a writeback cache, which is often faster, implies less waiting on operations, and can lengthen drive lifetime when there are a limited number of write cycles.


  • suid, nosuid: allow/disallow SUID/GUID bits to apply
  • exec (default), noexec: allow binaries to run from this partition. Not a security solution since binaries can be copied, but can be convenient.


  • who can mount/umount:
    • nouser (default): only root can mount/umount
    • owner: allow the device file's user to mount/umount (implies nosuid, nodev, but they can be overridden again)
    • user: everyone can mount, but when mounted (username written to mtab), only the mounting user can umount (implies noexec, nosuid, and nodev, but they can be overridden again)
    • users: all users can mount and umount
    • if you want to allow a subset of users to mount, look at sudoers


When to try what:

  • auto (default), noauto: whether or not to try mounting this when mount -a is run - which includes bootup.
since trying this early may delay or halt boot, noauto can make sense for CD and floppy drives, and for network mounts that are not always reachable.
  • fsck will skip fstab lines with non-existing devices and filesystem type auto
which sounds like this can sometimes make more sense than the two below:
  • nofail is about fscking
if an entry is set to be fscked, but isn't currently present, this says to not consider that an error
so it won't halt boot because it couldn't do this fsck
(also means it will silently fail to mount in general)
will apparently still repeatedly try to mount, until a timeout (verify)
seems a feature of various *nix filesystems, not present in some others
  • nobootwait is about mounting (and only on ubuntu's mountall? (verify))
if an entry is set to be mounted, but cannot be for any reason, this says not to halt the boot
it seems that if the reason is a fsck, it will do so in the background
which is not what you want for drives that a service would fail on if missing
yet can be preferable on large, non-boot-essential drives


The last two are mostly important for headless servers that you don't want hanging at bootup (particularly when you have no remote management).


Ubuntu's mountall isn't entirely the same and will act subtly differently - try to avoid it, it was meant to be temporary anyway.

In the meantime, the fairly-safe thing to do is to use both nofail and nobootwait, test this, and remove the options that the underlying filesystem fails on - e.g. xfs whines about nobootwait, so use nofail.

Also, avoid having services depend on disks that you use these on.

See also:

Other mount options used in ≥2 filesystems

This article/section is a stub — some half-sorted notes, not necessarily checked, not necessarily correct. Feel free to ignore, or tell me about it.
  • _netdev
man page:
"The filesystem resides on a device that requires network access 
(used to prevent the system from attempting to mount these filesystems
 until the network has been enabled on the system)."
The option is ignored by mount commands.
the option was intended for initscripts, which could be told "mount all local filesystems"
initscripts and other mounter tools will generally
either whitelist known local filesystems, or blacklist known network filesystems plus anything that says _netdev
know 90% of filesystem types you use, so _netdev is usually only necessary for very new things, or edge cases like network block filesystems (think iSCSI) - but also can't hurt


atime

for context

There are three times associated with each directory entry:

  • mtime, the modification time
    • set whenever the file is written to, i.e. content changes
    • By default, ls -l shows the mtime.
  • ctime, the status change time (not creation time as is commonly assumed, though apparently this was true long ago)
    • is updated for content changes AND
    • is updated for just-inode changes: permission changes, ownership changes, hardlink changes
    • For uses like backup this is probably the most meaningful of the three. (The other two can be changed with as little as touch. It seems the only way to do so for ctime is to make and delete a hardlink)
    • ls -lc to see
    • (not all sources I've read mention that content change affects ctime. I should check whether there are systems where this is not true, but it seems you can assume it does.)
  • atime, the access time
    • set whenever the file is opened, read from, or written to (verify)
    • (only happens when you have write permissons?(verify))
    • ls -lu to see
    • various filesystems let you disable updating atime for IO-efficiency reasons


Those are the more standard classics. There is also

  • birth, or crtime (depending on context)
which is recorded on some filesystems (e.g. ext)
but not exposed in syscalls yet, so even if it's there on disk utilities like stat may not actually show it
debugfs should


If you want to inspect the classics, the stat utility is probably easiest

# touch test && stat test
  File: ‘test’
  Size: 0               Blocks: 8          IO Block: 4096   regular file
Device: 812h/2066d      Inode: 147605508   Links: 1
Access: (0664/-rw-rw-r--)  Uid: (   33/www-data)   Gid: (   33/www-data)
Access: 2015-03-28 15:30:46.736127320 +0100
Modify: 2015-03-28 15:30:46.736127320 +0100
Change: 2015-05-29 14:58:54.206870022 +0200
 Birth: -


If you want to see more of mtime+ctime+atime at the time, then try an alias like: something like this:

alias lsac='find . -maxdepth 1 -printf "%M %-10u %-10g %10s\tatime: %Ab %Ad %AH:%AM\tctime: %Cb %Cd %CH:%CM\t%P\n"'


For crtime on linux, we currently have to be somewhat indirect (it's unclear whether it will be adopted, because it's unclear people even want it) It's multiple steps, roughly debugfs -R "stat <`stat -c %i FILENAME`>" DEVNAME so you probably need a small script for that (or e.g. the bash function mentioned here)

Why you may want atime disabled

The behaviour alternatives:

  • strictatime
implies a metadata write for every file open/read (even reads when the data comes entirely from the page cache)
which means more disk operations
is POSIX compliant
but almost no programs use this behaviour
the default while mounting before ~2009 kernels(verify)
  • relatime
(basically) metadata writes only for write operations
...which is roughly free in that that meant updating the inode for mtime anyway...
and doesn't break (most of) the few programs that do count on atime getting updated
the default while mounting since ~2009 kernels(verify)
  • noatime
never update atime field


The reason to have a discussion is that since atime goes to the file metadata, this is easily in a different location from the data you just accessed, which means extra IO operations (and on platter extra seeks).

This varies with access patterns - are you doing lots of small reads or a big one? Is the IO subsystem able to merge the small ones?

On some filesystems it can help a lot to postpone actually updating the metadata in batches, in others it may not.



The amount of speed hit from strictatime, so improvement by noatime or relatime, depends on lots of things.

It's suggested that in pathological cases you can take off 50% of throughput, while on most everyday cases you take off 10%, possibly less.

...but regardless of how large the speed hit is and how much you can alleviate it, it should be pointed out very few things depend on atime being updated in the first place.


If you know you don't need it at all, or can work around it (e.g. mutt historically relies on it by default, but can be configured otherwise and now often is) you can use the noatime mount option on all relevant mounts.

For data drives it's often a no-brainer. For your system drive you can usually get away with it, but if you suspect some programs still use atime, then you probably want to use relatime instead - which these days is the default.


Mount options:

  • atime - default, and asks for system-default behaviour. Before ~2009 this meant the update-always behaviour described above. After ~2009 it meant relatime.
  • strictatime - the 'update on everything' behaviour described above. The option to ask for this previously default behaviour was introduced around 2009, when the default moved to be relatime.
  • relatime (default since 2009ish) - Updates atime only if the file's current atime is older than its mtime (or ctime, but mtime should always be ≥ the ctime).
    • This basically amounts to "atime is updated on writes, not on reads"
    • Not strictly standards-compliant and all that, but breaks fewer things than noatime, and means fewer writes in mostly-read IO loads
    • Useful for apps that watch atime basically as if it were mtime, like mutt(verify)
    • Still breaks things that need to know whether a file was read since it was last modified. There are few of those (apparently tmpwatcher was one) and most of those have been fixed, or have known workarounds.
  • noatime - never update atime
  • nodiratime - Like noatime, but for directory entries only (...why?)
    • Implied by noatime, and has for some time now(verify)

See also

automounters

autofs

gnome-mount

Now replaced by udisks?

udisks2

Ubuntu

systemd

linux swap

Total effective swap and usage can be shown using free, top, and similar utilities.


To view currently active swap, use swapon -s, which basically reports the contents of /proc/swaps.


You can a partition for swap use with something like mkswap /dev/sda2

Note that you can also create a swap file (instead of partition), which you would probably only do when you can't use a partition for some reason. Use of dd (from /dev/zero) is probably the easiest way to create an empty file.


To have swap mounted automatically at boot, add it to your fstab. To activate without a reboot, do something like swapon /dev/sda2, to deactivate (may take a while, and even fail) do swapoff /dev/sda2.

swapon -a mounts all swaps listed in fstab.

(note: swapctl in some systems)


Use of swap is usually distributed, in that all all swap devices are used (to try to minimize wait time while swapping). You can give priorities to individual swap devices (a number between 0 and 32k, higher is more), if you know something about their performance.

fsck

This article/section is a stub — some half-sorted notes, not necessarily checked, not necessarily correct. Feel free to ignore, or tell me about it.


Partitioning

The hardcore, oldschool way of partitioning is fdisk.

The shinier, easier-to-use way is something like gparted (useful on a livecd / USB stick), which also makes it easier to do things like resizing partitions with data still on them.


Notes:

  • If you edit the partition table outside of fdisk (including some less-usual tools), it's useful to know about partprobe: it tells the kernel to rescan the partition table for a device, and update the device nodes.




Semi-sorted

more mount

bind mounts

A bind mount makes a directory from a filesystem appear as a mounted filesystem (note: --bind is basically a shorthand for -o bind(verify))

mount -o bind olddir newdir


This seems primarily useful

when symlinks are out because a program doesn't like them,
when hardlinks are out because it's cross-filesystem
with isolation (think containers) -- see also --make-*


There's also --rbind, which looks for mounts under a path.

This e.g. makes sense when making / appear elsewhere, because if things like /boot, /var, /usr, /home are separate mounts, you probably want those too.



--make-*

Since 2.6.15

This largely comes from implications of namespace isolation (think containers and such)

Say you've shared /container/mnt/ to some containers. If you later mount something in /container/mnt/ on the host, you probably want it shown in containers - but namespace isolation means that specifically won't happen and you you would have to mount it in each.

What you'ld want is the controlled option to have this propagate in.


https://lwn.net/Articles/689856/

--move

You can --move a mountpoint elsewhere




File sorting

  • ls: ls sorts alphabetically by default.
Beware aliases.
Sort entries alphabetically if none of -cftuvSUX nor --sort is specified.       
  • bash expansion: sorts alphabetically
After  word  splitting,  unless  the -f option has been set, bash scans
each word for the characters *, ?, and [.  If one of  these  characters
appears,  then  the word is regarded as a pattern, and replaced with an
alphabetically sorted list of file names matching the pattern.


ext3/ext4 "bad entry in directory"

This article/section is a stub — some half-sorted notes, not necessarily checked, not necessarily correct. Feel free to ignore, or tell me about it.
EXT3-fs error (device dm-0): ext3_readdir: bad entry in directory #5556142: rec_len is smaller than minimal
- offset=0, inode=2553887680, rec_len=0, name_len=0
EXT4-fs error (device sdb1): htree_dirblock_to_tree:587: inode #2827419650: block 11309464893: comm
filename: bad entry in directory: rec_len % 4 != 0 - offset=0(32768), inode=3612842463, rec_len=32077,
name_len=85


At this time, the following is an untested theory:

fsck does not seem to find or correct these - so you have to delete the directory. I'm guessing you can't copy out the files, but I don't know. In my case this problem was so annoying I was glad enough to be rid of it...


On a mounted filesystem, find the directory by inode:

find /mnt/point -inum 5556142

This should give you the name of one directory. Remove it (probably meaning rm -rf)

Then umount, fsck -f until no errors are reported (which may be the first time), and mount again.


On filesystem slowness

This article/section is a stub — some half-sorted notes, not necessarily checked, not necessarily correct. Feel free to ignore, or tell me about it.

Filesystems will by nature be better at some operations than others.

In some specific uses, such as those in servers that have to store very large files, store very many files, do a lot of creations and deletes or such, you'll find some filesystems slightly better tuned than others. The difference to everyday use is probably easily easily overvalued, though.


A common cause for reported slowness caused by filesystems comes from very long directory listings (particularly if the underlying filesystem stores this in a simple list, not using an index or balanced-tree type thing), because each entry lookup takes an amount of time linear to the directory size. For most directories this isn't noticable (and a complex balanced structure may even be slower for directories with up to a few dozen files) but operations you may think of as simple, say, ls, can be pretty bad cases: ls will go through the list of items and stat() each item as it lists it -- and each stat will itself be a separate linear-time lookup -- so will actually combining to quadratic time with the amount of entries.

Note that anything looking for a known filename will be only linear with number of directory entries, which itself stays acceptable much longer (but can still be faster).


One thing you occasionally see is programs dividing a load of items in a directory tree, to reduce the amount of items in each directory. There are a few different aspects to this:

  • can also be useful to avoid (filesystem/OS) limits imposed on directory entry amount (you ocasionally see a max of 64k entries in a directory)
  • means you add one (or more, with deeper structures) directory stat()s on each file operation. If you are opening a predetermined filename, this may sometimes even turn out slower than opening it from a large directory
  • may be nice to file browsers that want to look at these directories (windows explorer and various others can freeze up for minutes when looking at 20K+ items)

Common *nix filesystem organisation

This article/section is a stub — some half-sorted notes, not necessarily checked, not necessarily correct. Feel free to ignore, or tell me about it.
This article is an attempt at a decent overview, introduction, summary, or such.

It may currently just contain a mess of information. If it is likely to keep sucking it will probably be removed later.

Perhaps-useful-to-knows

apparent size versus disk usage

Apparent size is the byte size of the file. It is what, for example, cat filename would show.

Disk usage is the amount of bytes used on disk for this file, and is something reported by filesystem code.


The two will differ:

  • when the file size is not exact multiple of the underlying lock size, the last allocated block will only be partially used.
  • for sparse files, disk usage can be much lower
  • for fragmented files, disk usage is somewhat higher
  • use of indirect blocks[3]


Most shell utilities show apparent size.

Interestingly, du shows disk usage, at least by default.


If you want to find the difference, do a stat:

st_size is the apparent size
st_blocks*512 is the disk usage


MBR backup and restore

(Note: This does not apply to GPT - those are larger)

When you install Windows, it overwrites the boot code in the MBR without asking. If you had a boot menu there, tough luck.

Being the first 512 bytes on a disk, it's pretty easy to back up beforehand and restore after.


Keep in mind that the partition table is also in there. If you change that during your windows install, you don't want to revert that afterwards.

With foresight, you can install windows first and linux after.

If you installed linux first and windows after, you'll probably either want to:

  • partition the way you want it beforehand, and make sure windows didn't change it anyway, or...
  • know how to install a new grub based on existing configuration (often using a liveCD and two or three commands, there are various pages about this).


Back up the MBR:

  • boot in some *nix
  • Back up the MBR (Master Boot Record), probably somewhere that you can easily get to later
dd if=/dev/hda of=/mbrbackup bs=512 count=1

Restore the MBR:

  • Boot from a *nix liveCD
  • get to the MBR backup (often: mount the linux partition that you saved it on)
  • write it back verbatim to restore the MBR:
dd if=/mnt/mylinux/mbrbackup of=/dev/hda bs=512 count=1


Note/warning: You read the MBR from and write the MBR to the drive device, not one of the partitions on it.


Note that if you didn't back up the MBR, tools like ms-sys lets you write various MBRs (mostly windows).

Deleted file content removed only when file closed

It seems that in most filesystems, deleting an open file, in itself, removes only the directory entry.

The data is not freed on disk until the last process that has a handle to it closes that handle.


This can be useful to know in the context of large files and log files.

It can also be useful for temporary files. If you create, open, delete the file, no other process can open this file. You can still pass the file handle around. It's essentially an anonymous disk file that will be removed as soon as you close its handle.

volume management using LVM

See LVM notes

md: software RAID

This article/section is a stub — some half-sorted notes, not necessarily checked, not necessarily correct. Feel free to ignore, or tell me about it.

One argument in the hardware RAID versus software RAID discussion is that once a hardware controller fails, you may have lost all your data, not because another controller won't be able to read it, but because you need to get the same brand and model, since the actual storage methods used are proprietary, usually differ per company, and often also over time.

Since it is easier to duplicate software setup, that makes software RAID a safer, easier-to-restore option. This at the cost of doing all the data processing to the main CPU instead of at the RAID controller, introducing potential bottlenecks if you do it over PCI (bus saturation with RAID was pretty easy), and not really having the battery backed flushing option (useful if you don't have an UPS).