Hashing notes

From Helpful
(Redirected from Hashing)
Jump to: navigation, search
This article/section is a stub — probably a pile of half-sorted notes, is not well-checked so may have incorrect bits. (Feel free to ignore, fix, or tell me)

Hash function

A hash function takes data (usually byte data), and gives a mangled output. The result is often referred to as the hash (of that data), a hash value, or such.

Various uses of hashing/hashes rely on different properties, which include:

  • always generates the same hash value for the same input
  • the original is not recognizable in the hash value. Sometimes not a priority, but usually is, and in fact much stronger than that:
  • one-way function: it mangles in a way that is not easily reversed
mainly brute force; how hard it is depends mainly on search space / hash size, and available CPU power
  • a hash is typically much smaller than the data it was calculated for
(most hashes are fixed size and most somewhere between 4 and 20 bytes; most data they work on is much larger)
  • a hash function will try to employ all the bits of the hash value equally, preferably regardless of the input data
  • the hash value is quite different for inputs that are nearly equal (due largely to the Avalanche effect)
so that that won't accidentally reveal things about input

Equal hash values do not imply the value you put in was the same, but statistically the likeliness of that happening by accident takes a nosedive with increasing hash sizes.

You can e.g. force this on a 16-bit hash because roughly one in every 65536 strings will get the same hash (because there are only that many possible outputs), but doing the same for a 160-bit take on the order of 1E48 attempts.

Simpler-requirement uses

Verification of data transmission

When you transfer a large chunk of data, you can check whether the data on the other end is identical by doing a hash on both ends, and comparing the hash values.

This is generally a very good indicator of both success on success, and of failure on failure, needs only a tiny amount of further transfer.

How good of an indication is determined by probability, mostly depending on the size of the hash.

The hash value doesn't need to be very large for the probability of error to be very low - even a CRC32 (4 byte) is an okay start, but it's often easy to use something like MD5 (16 byte) or SHA-1 (20 byte) and get a much better probability.

data structures

Hash map

Higher-requirement uses

Password checking

The very simplest password system stores a password as-is.

This is a bad idea to longer-term security, because if someone got access to the computer they could simply read off all passwords.

Instead, you can store the hash of the password.

This is basically data verification, in that matching passwords means matching password hashes. It's just of a very small bit of data.

The basic reason to store password hashes instead of passwords is that stealing the hashes means an attacker has more work to do.

Exactly how much guesswork depends a lot on setup - and on the passwords themselves.

When implementing security like this, there is a checklist of things frequently done wrong, that you can avoid.

On speed

General purpose hash functions (MD5, SHA1, and a dozen others) are not the best for passwords.

Not because they're broken, but because they're fast: hundreds of MB/sec or more is doable on a ten-year-old computer.

Allowing speedy content verification is great.

Allowing speedy offline brute force password cracking is not.

Things like bcrypt is great for passwords because it is tweakably slow. (as in, ten years on you can make new passwords slower, to counteract computers having become faster. Roughly by doing more loops of mangling - in a way that can't be reduced much with cryptanalysis).

The reason this is good is actually a very pragmatic, dumb one:

A fast hash may allow million checks per second (microseconds per password). If you configure bcrypt so that it takes, say, ~50ms to check a password on current computers, people logging in won't even notice, while limiting even an offline bute force attack to ~20 checks per second per PC, making an exhausive attack very costly (generally infeasible).

The 'password too short' nagging comes from very similar origins: even if it's slow, there are only so many passwords possible.

Assuming ~50 choices per character, then there are ~2500 two character passwords, ..., ~20 billion for 6-character, ..., 50k-billion for 8-character.

That sounds like a bunch, but if indeed you can check millions per second, then 6-character is still only minutes. Whereas 8-character is years.

(There are many footnotes to that. Not least of which being that most people don't use random passwords, and that this is perfectly divisible work so if you have huge interest in a specific password, you hire a server park and divide those CPU-years by however many CPUs you have. This is why you add a few more characters, or add some symbols.)


This article/section is a stub — probably a pile of half-sorted notes, is not well-checked so may have incorrect bits. (Feel free to ignore, fix, or tell me)

A MAC (Message Authentication Code) is used to verify the authenticity (and content integrity) of a message.

Intuitively, this is content verification, but which stands up to malicious intent.

Variants of this include:

uses secret selection of the hash function.

based on a block cipher
  • OMAC (One-key MAC)
based on a block cipher
based on a block cipher
made for speed

  • DAA (Data Authentication Algorithm)
an outdated US government MAC.

See also:


A fairly common implementation of a MAC is the Keyed-Hash Message Authentication Code (HMAC or KHMAC).

It feeds two things into a hash function: the message data, and a shared secret.


Reproducable, relatively-unique output

Perfect hash function

Perfect hash functions are collisionless: they map each input to an unique output.

Minimal perfect hash functions map to a contiguous integer set.

Real-world uses include cases in which a map is generated once and then only queried, whenever there is value in spending more time up front when it means a(n often slight) reduction later.

For example in code compilation, some real-time kernel situations, sometimes data indexing, and such.

It is not relevant to many everyday uses. For example, while in hashmaps, low collision is a good property to have, choosing a good hash function goes most of the way already, while regenerating a perfect hash on every update is prohibitive.

hash types, significators, values you see in files, etc.

This article/section is a stub — probably a pile of half-sorted notes, is not well-checked so may have incorrect bits. (Feel free to ignore, fix, or tell me)

Note that you can base a hash on a block cipher, and various of these are.

In places where more than one type of hash may appear, the entry may look like

This, and any of the identifiers, is primarily convention, or situation-specific de-facto standard, and not down to any sing standard.

For '/etc/shadow, the identifiers include:

  • nothing: crypt() (one of various implementations, historically usually based on DES)
  • _
    : bsdi_crypt
  • $2y$
    : bcrypt (based on blowfish)
  • %sha$1
    : SHA1
  • $md5$
    : MD5
  • $5$
    : SHA256
  • $6$
    : SHA256
  • $3$
    : bsd_nthash

The above is the relatively basic set used by OSes. See also:


Note that any use of the crypt() function is not a specific implementation, so is not guaranteed to work cross-system (use a specific marked algorithm if you want that). Historically it was often DES, currently it's possibly MD5.

Application using passlib may also be using things in this list

Apparently in .htpasswd files have a similar style, but don't assume you get more than:

  • nothing: crypt
  • $2y$
  • {SHA}
  • $apr1$
    - apache's modified MD5 (to ease system independence?(verify)) (also used by BSD?(verify))

See also: https://httpd.apache.org/docs/2.4/misc/password_encryptions.html

Some hash(/Digester/Checksummer) methods

This article/section is a stub — probably a pile of half-sorted notes, is not well-checked so may have incorrect bits. (Feel free to ignore, fix, or tell me)

(this is likely to contain errors, and given to be incomplete. Sorted vaguely by how common they are.)

made for passwords
made for passwords

  • DES
    • Basic DES: (13 characters): 2 salt, 11 hash value. (56-bit DES)
    • Extended DES: (20 characters): 9 character salt, 11 hash value
  • MD family:
    • MD5 [3] (Salt size is variable(verify), commonly 8 or 12 characters(verify))
      • MD5
      • MD5 (Unix)
      • MD5 (APR), e.g. in "$apr1$salt$hash"
    • MD2 [4]
    • MD4 [5]
    • MD3
    • MD6
  • CRC ('Cyclic Redundancy Check')
    • Primarily used as a checksum function/value
    • Note that specific implementations can vary in with:
      • (most obviously) size, often 32 or 16, sometimes 8 or 64
      • choice of polynomial (note that if reported as a hex value rather than a polynomial's powers, it may be bit-reversed, so in a way there are two such values for each poly)
      • initial value (often all bits on)
      • when/how bits are shifted (verify)
      • possible other details, such as a bitmask in each step, or general details of bit ordering (verify)
      • ...as such, checking a CRC from something you didn't design yourself, you often have to figure out its specific implementation.
  • SHA [6]
    • SHA-0 (160-bit) - was never really adopted, because of a spec flaw that led to SHA-1
    • SHA-1 (160-bit) [7]
      • currently the most common variant
      • collisions can be found faster than is ideal. While not really an attack, it suggests that a stronger hash is a good idea
    • SHA-2 [8]
      • SHA-244 / SHA-256
      • SHA-384 / SHA-512 (faster on 64-bit)
    • SHA-3 [9]

  • MySQL PASSWORD() - a few different versions exist
    • MySQL before 4.1 generate a 128-bit value (what sort of hash?(verify))
    • 4.1.0 used a preliminary version of the below
    • 4.1.1 and later generate a 160-bit value, shown in hex with a prepended * (41 chars)
      • Added the old function under OLD_PASSWORD()
      • Apparently this is SHA1(SHA1(password)) (implementation note: the outer SHA1 reads the inner one's result in byte form)

  • NT hash (also less accurately known as NTLM hash)
    • 128 bits
    • MD4 of the little-endian UTF-16 version of the password
  • AES (Rijndael) [10]
  • Haval: [12]
    • Haval-128 (most common variant?(verify))
    • Haval-160
    • Haval-192
    • Haval-224
    • Haval-256
  • RipeMD [13] (Cryptographic)
    • RIPEMD-160
    • RIPEMD-128 (most common variant?(verify))
    • RIPEMD-256
    • RIPEMD-320

  • Tiger [15]
    • Tiger-128
    • Tiger-160
    • Tiger-192
  • Whirlpool [16] (Cryptographic)
  • Panama, [17]
    • Panama
    • RadioGatĂșn [18]
  • Snefru [20]
    • Snefru (128-bit)
    • Snefru-256 (256-bit)

  • AICH [21]
    • 'Advanced Intelligent Corruption Handling', basically a hash tree setup


  • Bitzi bitprint - a concatentation of SHA-1 of the data (20 bytes) and the TTH of the data (24 bytes), usually in Base32], which makes for a 32-character string.

Built for speed, not cryptographic:

Block ciphers / cryptosystems

  • Blowfish
    • ($2$ or $2a$): 16 character salt
    • ($2a$07$): see IPSEC note below
  • Cast
    • Cast 128 (a.k.a. Cast5) [22]
    • Cast 256 (a.k.a. Cast6) [23]
  • RSA (Rivest, Shamir, Adleman) [26]

(a.k.a. rDSA(verify))

Identifiers may include IPSEC's ESP Transform Identifiers (see RFC 2407), and others. said transform identifiers include:

  • 1: DES_IV64
  • 2: DES
  • 3: 3DES
  • 4: RC5
  • 5: IDEA
  • 6: CAST
  • 8: 3IDEA
  • 9: DES_IV32
  • 10: RC4
  • 11: NULL

See also

More notes on

Varying requirements

This article/section is a stub — probably a pile of half-sorted notes, is not well-checked so may have incorrect bits. (Feel free to ignore, fix, or tell me)

What "Hash function X is broken" means


This article/section is a stub — probably a pile of half-sorted notes, is not well-checked so may have incorrect bits. (Feel free to ignore, fix, or tell me)

Aside from different poly values, there are more variants out there than you might expect, in that they end up with a different value.

Reasons include:

  • Initial values. Often -1 (i.e. 0xffff, 0xffffffff, etc.), sometimes 0, sometimes something else
  • Some code is endian-specific (through some bitwise operations), and some may use a bit-reversed version of an otherwise identical poly value.
As such, the poly factors are more descriptive than the poly value(s) (the list below tries to mention both)
  • There is sometimes a final XOR with a specific value (verify)

So while you can cover the bulk with a few description, an exhaustive list is harder.

CRC32 (CCITT32, Ethernet, PKZip, FDDI, ITU-T, OSI/IEC, Autodin-II, others)

  • Polynomial: (0,1,2,4,5,7,8,10,11,12,16,22,23,26(,32)) (0xedb88320 / 0x04c11db7)
  • Initial Value: 0xffffffff

"CRC-16" (this variant is the one often meant when a non-specific CRC16 is mentioned)

  • Polynomial: x^16 + x^15 + x^2 + 1 (0xA001 / 0x8005)
  • Initial value: 0xffff
  • Also:

CRC16 CCITT, X-25, XModem, Kermit use the same poly but vary in details

  • Polynomial: x^16 + x^12 + x^5 + 1 (0x8408 / 0x1021)
  • Initial value: 0xffff (but there are variants with 0x0000 and 0x1d0f (verify))
  • output XOR mask:
CCITT: 0xffff except for the 1D0F variant
most others: 0x0000
  • Kermit bit-reverses each byte before use, and the result afterwards(verify), and swaps the two result bytes (verify)

CRC-16 IBM seems another variation on the above (apparently also used in bisync, usb, modbus?)

CRC-16 Polynomial: x^16 + x^12 + x^5 + 1 (0x1021)

  • Used by
    • BinHex (init val 0)? (verify)
    • Packit (init val 0)? (verify)


  • Poly: x^16 + x^13 + x^12 + x^11 + x^10 + x^8 + x^6 + x^5 + x^2 + 1 (0xA6BC )
  • Initial value 0

CRC8 (Dallas/Maxim iButton) Polynomial: x^8 + x^5 + x^4 + 1 (0x8C)

Unsorted: CRC-16 SICK seems a faster but weaker variation (verify)

See also:

Checksum files