Security notes / Hashing notes

From Helpful
Jump to navigation Jump to search
Security related stuff.


Theory / unsorted

how to do a login system badly
how to do encryption badly
encrypted connections
data-at-rest encryption
This article/section is a stub — probably a pile of half-sorted notes and is probably a first version, is not well-checked, so may have incorrect bits. (Feel free to ignore, or tell me)

Hash function

A hash function

  • takes an arbitrary amount of data (usually byte data), and
  • gives a output of shorter, typically fixed size.
this result value is referred to as the hash value of that data, often just 'the hash', or similar terms.

Usually takes byte input, usually creates output mangled in that it doesn't resemble the input, which is where we get to the point that different uses of hashing/hashes rely on different properties.

Hashes have a whole bunch of distinct uses, many of which not particularly related to security, and a few that very much are (which is where you have longer discussions about the details).

Properties you care about in most uses include:

  • given the same input, a hash function always generates the same output (hash) value
you pretty much always want that
  • given almost-equal input, the hash value is quite different
due largely to the Avalanche effect
not always necessary, but useful in many uses, and fairly easy to ensure
  • a hash is (typically much) smaller than the data it was calculated for
most hashes are fixed size and most somewhere between 4 and 20 bytes; the data they work on is often kilobytes to gigabytes
  • the original is not easily recognizable in the hash value.
Sometimes not a priority, but usually is. Also...
  • one-way function: it mangles in a way that is not easily reversed
finding the original is often a brute force search, so how hard it is to find depends mainly on search space / hash size, and CPU power
(but there are some practical attacks that don't have much to do with the hash itself. If privacy matters, basically get someone to do their best to break your system before you make it public)
  • a hash function will try to employ all the bits of the hash value equally, preferably regardless of the input data
(roughly to ensure that the amount of values the hash value could take actually reflects how many possibilities actually do get used)

Beyond that, requirements may start to vary. Consider for example:

  • very large keyspace
more useful in cryptography, because it helps make collision-like attacks much less feasible to run
many other applications don't really care.
  • fast - when supporting a low-latency use like network packet integrity, the faster you can say "very likely fine", the faster communication goes.
while for file integrity checks you wouldn't care about speed as much
  • slow - when supporting a login system, the more you make it impossible to verify a hash slowly, the harder it is is to brute-force.
e.g. very useful for password checks
also for certain types of signing
  • details in various sections below

Applications with simpler requirement

Check that data made it through a trusted channel okay: see Message integrity (doing so over an untrusted channel is much harder, and cannot be done with hashing alone)

Check that we're talking about the same data without communicating the data itself: see Message digest

In a bunch of data structures, you care only to generate a short number relatively unique from other such numbers, but always the same for the same original value.

Examples include:

  • keys in hash maps. The main use is in spreading things roughly evenly to many buckets
as a value always has the same hash, you will always look in the right place (you don't have to care about how it's hashed, it just matters it's the same)
The hash function can actually be relatively stupid (and sometimes is) and still spread fairly well.
  • speeding up bulks of (equality) comparisons of complex objects (e.g. text documents).
If you calculate a hash value representative for each object, you can then compare such values to quickly decide on inequality
Equal hash values do not mean equal values, but the statistics of collisions mean you can significantly narrow things down - you will have only a tiny amount of collisions to check by content

  • coloring - e.g. if you want to always color the same text the same way (anything that you will see repeatedly, e.g. username, category), but not have to remember what you assigned it, you can hash that text and pick bits out of that hash value
the value lies in that it always gives the same output, not even what that output is
  • identicons - an extension of the previous, but using the result for shapes as well

Uses with stronger requirements

Message authenticity

See Message authenticity

Password checking

The very simplest password system stores a password as-is.

This is obviously a bad idea to longer-term security, because if someone gets (local or remote) access to the computer that stores them, they could simply read off all passwords with zero effort.

Instead, you can store the hash of the password.

The basic reason to store password hashes instead of passwords is that stealing the hashes means an attacker has a lot more work to do.

Exactly how much guesswork depends a lot on setup - and on the passwords themselves.

This is basically data verification, in that matching passwords means matching password hashes.

However, it matters that the input is a very small bit of data.

Say, there's only so many six-letter passwords - basically so few that you could basically make a big list. The hash doesn't reveal what password it is directly -- but the shorter the password, the more likely it is that only one of all possible passwords of that size (or shorter) will match.

The amount of computation you need to do is bounded by password length.

This is the basic reason longer passwords are better. And also that very predictable passwords are bad even if longer (an attacker would start with them).

When implementing security like this, there is a checklist of things frequently done wrong, that you can avoid.

On reversibility and salt


On speed

General purpose hash functions (MD5, all of the SHA family, and a dozen others) are not the best for passwords.

Not because they're broken. They can check all the nice properties, but because of a more practical issue: They are fast.

The reason you would like a password hasher to be slow is, well, more of a stupidly pragmatic one than anything else:

A fast hash algorithm may allow million checks per second (microseconds per password).

If you use bcrypt, and configure so that it takes, say, ~100ms to check a password on current computers, people logging in won't even notice, while limiting even an offline brute force attack to the order of dozens per second per PC, rather than potentially millions for MD5/SHA1. (bcrypt is also tweakably slow - ten years on you can make new passwords slower, to counteract computers having become faster. Roughly by doing more loops of mangling - in a way that ideally can't be reduced much with cryptanalysis).)

Others interesting choices for password hashing (including other reasons) include Argon2, scrypt, and PBKDF2.

The 'password too short' nagging comes from very similar origins: even if it's slow, there are only so many passwords possible.

Assuming ~50 choices of character, then there are ~2500 two-character passwords, ~20 billion six-character, ..., 50k-billion eight-character.

That sounds like a bunch, but if indeed you can check millions per second, then you can exhaust 6-character passwords in minutes - but 8-character is still years.

(There are many footnotes to that. Not least of which being that most people don't use random passwords, and that this is perfectly divisible work so if you have huge interest in a specific password, you hire a server park and divide those CPU-years by however many CPUs you have. This is why you add a few more characters.)

bcrypt notes

Hashes with nonces

What "Hash function X is broken" means

This article/section is a stub — probably a pile of half-sorted notes and is probably a first version, is not well-checked, so may have incorrect bits. (Feel free to ignore, or tell me)

The only thing it always means is that "something takes less time than we designed it to".

How much less, and which attacks (if any) it makes more feasibly (by how much) still varies wildly.

Collision attacks, pre-image attacks

More on collision attacks
More on pre-image attacks

Some real-world examples, with some real-world numbers

More technicalities

"cryptographic hash function"

Perfect hash function

Perfect hash functions are collisionless, by definition, because they map each input to an unique output.

Minimal perfect hash functions map to a contiguous integer set.

This isn't purely academis 'why not'.

Real-world uses include cases where a map is generated once and then only queried, whenever there is value in spending more time up front when it means a reduction later.

Even if that reduction is slight, if you're, say, compiling some real-time kernel stuff, a few seconds of extra compile time is easily worth it.

For data indexing, this is sometimes worth thinking about.

But yeah, it is not relevant to many everyday uses. For example, while low collision in hashmaps is a good property to have, it's usually quite cheap to lower collision a lot, but expensive to have it to have it at the absolutely-provalbly-minimal level. (Particularly when you have to regenerate that hash every resize/rebalance)

Locality sensitive hashing

Locality-sensitive hashing (LSH) attempts to sort similar items into the same bucket, with an often low amount of buckets.

Note that this is effectively intentional maximizing collision (unlike almost all other hash use), and meaningful collision.

Note that this is the opposite of what you want in e.g. a genera purpose hashmap, but may be a specific part of things like neighbour search, or clustering, and some types of similarity and fingerprinting.

Note that this bears similarities to dimensionality reduction instead.

Semantic hashing

Semantic hashing tries to generate hashes where the (Hamming) distance between codes is

Spectral hashing

Fuzzy hashing

hash types, significators, values you see in files, etc.

This article/section is a stub — probably a pile of half-sorted notes and is probably a first version, is not well-checked, so may have incorrect bits. (Feel free to ignore, or tell me)

Note that you can base a hash on a block cipher, and various of these are.

In places where more than one type of hash may appear, the entry may look like $identifier$<salt>$<hash>. This, and any of the identifiers, is primarily convention, or situation-specific de-facto standard, and not down to any sing standard.

For /etc/shadow, the identifiers include:

  • nothing: crypt() (one of various implementations, historically usually based on DES)
  • _: bsdi_crypt
  • $2$ - bcrypt
  • $2a$ - bcrypt that would accept unicode as UTF8
  • $2x$, $2y$ - BSD-only(verify) marker used to note hashes from a patched crypt_blowfish (fixing a bug in this PHP implementation of bcrypt)[1]
  • $2b$ - bcrypt (marking a fix to the BSD implementation of bcrypt)
  • $3$: bsd_nthash (only freeBSD)
  • $md5$, {md5,: MD5
  • %sha1$: SHA1
  • $5$: SHA256
  • $6$: SHA512

Entries take either the form




The above is the relatively basic set used by OSes. See also:

Note that any use of the crypt() function is not a specific implementation, so is not guaranteed to work cross-system (use a specific marked algorithm if you want that). Historically it was often DES, currently it's possibly MD5.

Application using passlib may also be using

$apr1$ - Apache's modified MD5
$argon2i$, $argon2d$</nowiki>}} argon2
$bcrypt-sha256$ Passlib-specific
$P$, $H$ PHPass-based
$pbkdf2$ pbkdf2_sha1, Passlib-specific
$pbkdf2-sha256$ Passlib-specific
$pbkdf2-sha512$ Passlib-specific
$scram$ Passlib-specific
$p5k2$ cta_pbkdf2_sha1 OR dlitz_pbkdf2_sha1
$scrypt$ Passlib-specific

(list from here)

Apparently in .htpasswd files have a similar style, but don't assume you get more than:

  • nothing: crypt
  • $2y$ - bcrypt
  • {SHA} - SHA1
  • $apr1$ - apache's modified MD5 (to ease system independence?(verify)) (also used by BSD?(verify))

See also:

Some hash(/Digester/Checksummer) methods

This article/section is a stub — probably a pile of half-sorted notes and is probably a first version, is not well-checked, so may have incorrect bits. (Feel free to ignore, or tell me)

(this is likely to contain errors, and given to be incomplete. Sorted vaguely by how common they are.)


  • DES
    • Basic DES: (13 characters): 2 salt, 11 hash value. (56-bit DES)
    • Extended DES: (20 characters): 9 character salt, 11 hash value
  • MD family:
    • MD5 [2] (Salt size is variable(verify), commonly 8 or 12 characters(verify))
      • MD5
      • MD5 (Unix)
      • MD5 (APR), e.g. in "$apr1$salt$hash"
    • MD2 [3]
    • MD4 [4]
    • MD3
    • MD6

  • SHA [5]
    • SHA-0 (160-bit) - was never really adopted, because of a spec flaw that led to SHA-1
    • SHA-1 (160-bit) [6]
      • currently the most common variant
      • collisions can be found faster than is ideal. While not really an attack, it suggests that a stronger hash is a good idea
    • SHA-2 [7]
      • SHA-244 / SHA-256
      • SHA-384 / SHA-512 (faster on 64-bit)
    • SHA-3 [8]

  • MySQL PASSWORD() - a few different versions exist
    • MySQL before 4.1 generate a 128-bit value (what sort of hash?(verify))
    • 4.1.0 used a preliminary version of the below
    • 4.1.1 and later generate a 160-bit value, shown in hex with a prepended * (41 chars)
      • Added the old function under OLD_PASSWORD()
      • Apparently this is SHA1(SHA1(password)) (implementation note: the outer SHA1 reads the inner one's result in byte form)

  • NT hash (also less accurately known as NTLM hash)
    • 128 bits
    • MD4 of the little-endian UTF-16 version of the password

  • AES (Rijndael) [9]
  • Haval: [11]
    • Haval-128 (most common variant?(verify))
    • Haval-160
    • Haval-192
    • Haval-224
    • Haval-256
  • RipeMD [12] (Cryptographic)
    • RIPEMD-160
    • RIPEMD-128 (most common variant?(verify))
    • RIPEMD-256
    • RIPEMD-320
  • Tiger [14]
    • Tiger-128
    • Tiger-160
    • Tiger-192
  • Whirlpool [15] (Cryptographic)
  • Snefru [19]
    • Snefru (128-bit)
    • Snefru-256 (256-bit)

  • AICH [20]
    • 'Advanced Intelligent Corruption Handling', basically a hash tree setup

  • Jenkins [21]
    • non-cryptographic, intended for hash tables and such
    • Variants:
      • Jenkins One-At-A-Time (JOAAT)
      • lookup2
      • lookup3
      • SpookyHash
  • MurMur [22]
    • non-cryptographic

  • FNV (Fowler–Noll–Vo) [23]
    • non-cryptographic, intended for hash tables, checksums, and such
    • FNV-0 (deprecated)
    • FNV-1
    • FNV-1a
sizes between 32 and 1024 bits, but typically 32 or 64?(verify)


  • Bitzi bitprint - a concatentation of SHA-1 of the data (20 bytes) and the TTH of the data (24 bytes), usually in Base32], which makes for a 32-character string.

Non-cryptographic hashers

  • CRC ('Cyclic Redundancy Check')
    • Primarily used as a checksum function/value
    • Note that specific implementations can vary in with:
      • (most obviously) size, often 32 or 16, sometimes 8 or 64
      • choice of polynomial (note that if reported as a hex value rather than a polynomial's powers, it may be bit-reversed, so in a way there are two such values for each poly)
      • initial value (often all bits on)
      • when/how bits are shifted (verify)
      • possible other details, such as a bitmask in each step, or general details of bit ordering (verify)
      • such, checking a CRC from something you didn't design yourself, you often have to figure out its specific implementation.
  • Murmur, Murmur2, Murmur3
  • xxHash
Built for speed

Password hashers

This article/section is a stub — probably a pile of half-sorted notes and is probably a first version, is not well-checked, so may have incorrect bits. (Feel free to ignore, or tell me)
made for passwords
parameters for required CPUtime
  • PBKDF2 [25] (Password-Based Key Derivation Function)
made for passwords
potentially a little more sensitive to time–memory trade-off than e.g. Argon2, scrypt(verify)
made for passwords
parameters for required CPUtime, memory requirements, and parallelism
also makes GPU attacks much less efficient
made for passwords

Block ciphers / cryptosystems

  • Blowfish
    • ($2$ or $2a$): 16 character salt
    • ($2a$07$): see IPSEC note below
  • Cast
    • Cast 128 (a.k.a. Cast5) [28]
    • Cast 256 (a.k.a. Cast6) [29]
  • RSA (Rivest, Shamir, Adleman) [32]

(a.k.a. rDSA(verify))

Identifiers may include IPSEC's ESP Transform Identifiers (see RFC 2407), and others. said transform identifiers include:

  • 1: DES_IV64
  • 2: DES
  • 3: 3DES
  • 4: RC5
  • 5: IDEA
  • 6: CAST
  • 8: 3IDEA
  • 9: DES_IV32
  • 10: RC4
  • 11: NULL

See also

Some notes on CRC

This article/section is a stub — probably a pile of half-sorted notes and is probably a first version, is not well-checked, so may have incorrect bits. (Feel free to ignore, or tell me)

Cyclic Redundancy Check was an early hash function, aimed mainly at content verification

Aside from different poly values, there are more variants out there than you might expect, in that they end up with a different value.

Reasons include:

  • Initial values. Often -1 (i.e. 0xffff, 0xffffffff, etc.), sometimes 0, sometimes something else
  • Some code is endian-specific (through some bitwise operations)
  • some may use a bit-reversed version of an otherwise identical poly value.
As such, the poly factors are more descriptive than the poly value(s) (the list below tries to mention both)
  • There is sometimes a final XOR with a specific value (verify)

So while you can cover the bulk with a few description, an exhaustive list is harder.

CRC32 (CCITT32, Ethernet, PKZip, FDDI, ITU-T, OSI/IEC, Autodin-II, others)

  • Polynomial: (0,1,2,4,5,7,8,10,11,12,16,22,23,26(,32)) (0xedb88320 / 0x04c11db7)
  • Initial Value: 0xffffffff

"CRC-16" (this variant is the one often meant when a non-specific CRC16 is mentioned)

  • Polynomial: x^16 + x^15 + x^2 + 1 (0xA001 / 0x8005)
  • Initial value: 0xffff
  • Also:

CRC16 CCITT, X-25, XModem, Kermit use the same poly but vary in details

  • Polynomial: x^16 + x^12 + x^5 + 1 (0x8408 / 0x1021)
  • Initial value: 0xffff (but there are variants with 0x0000 and 0x1d0f (verify))
  • output XOR mask:
CCITT: 0xffff except for the 1D0F variant
most others: 0x0000
  • Kermit bit-reverses each byte before use, and the result afterwards(verify), and swaps the two result bytes (verify)

CRC-16 IBM seems another variation on the above (apparently also used in bisync, usb, modbus?)

CRC-16 Polynomial: x^16 + x^12 + x^5 + 1 (0x1021)

  • Used by


  • Poly: x^16 + x^13 + x^12 + x^11 + x^10 + x^8 + x^6 + x^5 + x^2 + 1 (0xA6BC )
  • Initial value 0

CRC8 (Dallas/Maxim iButton) Polynomial: x^8 + x^5 + x^4 + 1 (0x8C)

Unsorted: CRC-16 SICK seems a faster but weaker variation (verify)

See also:

Checksum files

Can you treat a hash as unique identifier of content?