Security notes / Hashing notes

From Helpful
Revision as of 18:18, 1 December 2021 by Helpful (Talk | contribs)

Jump to: navigation, search
Security related stuff.


Theory / unsorted

  • Hashing notes

how to do a login system badly
how to do encryption badly
encrypted connections
data-at-rest encryption
This article/section is a stub — probably a pile of half-sorted notes and is probably a first version, is not well-checked, somay have incorrect bits. (Feel free to ignore, or tell me)

Hash function

A hash function takes data (usually byte data), and gives a mangled output.

The result value is often referred to as the hash (of that data), a hash value, or such.

And it has a lot of uses not related to security - and a bunch that are about security.

Different uses of hashing/hashes rely on different properties. Properties you may care about include:

  • given the same input, a hash function always generates the same output (hash) value
  • given almost-equal input, the hash value is quite different
due largely to the Avalanche effect
not always necessary, but useful in many uses, and fairly easy to ensure
  • a hash is (typically much) smaller than the data it was calculated for
most hashes are fixed size and most somewhere between 4 and 20 bytes; the data they work on is often much larger
  • the original is not easily recognizable in the hash value.
Sometimes not a priority, but usually is, and in fact much stronger than that:
  • one-way function: it mangles in a way that is not easily reversed
finding the original is often a brute force search, so how hard it is to find depends mainly on search space / hash size, and CPU power
  • a hash function will try to employ all the bits of the hash value equally, preferably regardless of the input data

Beyond that, requirements may start to vary. Consider for example:

  • very large keyspace
more useful in crypto, because it helps make collision-like attacks non-feasible to do
a bunch of other uses don't really care.
  • fast - when supporting a low-latency use like network packet integrity, the faster you can say "very likely fine", the faster communication goes.
while for file integrity checks you wouldn't care about speed as much
  • slow - the slower it is to calculate a single hash, the harder it is to brute-force.
e.g. very useful for password checks
also for certain types of signing

Simpler-requirement uses

Data integrity checks

You do a hash on the source side, and make the receiving side do the same hash, and compare them.

This is generally a very good indicator of both success on success, and of failure on failure, needs only a tiny amount of further transfer.


  • speed: as fast as possible
  • robustness: hash should have avalanche effect

Doesn't care much about:

  • keyspace size
  • being a secure hash (but it's sometimes a nice bonus)

Detecting unintentional corruption of transmitted data is useful to avoid some very basic misbehaviour, and doesn't have to be complex at all to significantly lower the chances of that happening.

Relatively simple hashes include:

four different methods, but all necessarily simple. Odd is arguably the best.
not actually used a lot - they give very little protection, and there is no automatic fix you're better off adding your own checking (and possible correction) at protocol level
  • GPS (specifically the NMEA 0183 text protocol)
per line, all characters are XORed together. The resulting byte is appended as hex
not very strong protection at all, but enough to ignore most garbled lines
  • ISBNs (used on books) have their last digit as a check digit, by some basic math
relevant to scanning their barcodes with a lot fewer errors (many barcodes have such checking)
  • IPv4 uses a 16-bit checksum (of the header only) that is basically just addition[1]
  • TCP over IPv4 uses a 16-bit checksum (of the header and payload) that is basically just addition[2]
  • TCP over IPv6 uses a 16-bit checksum (of header and payload) that is basically just addition[3]
  • XMODEM used a sum of bytes, a later variant used 16-bit CRC instead
  • YMODEM used a 16-bit CRC
  • ZMODEM used 16-bit or 32-bit CRC

Most of these are mainly meant to make fairly likely that a garbled transmission is detected as bad, but not

focusing on the best probability of that detection (CRC is decent, the rest is not)
error correction (without retransmission)
or being strong against ill intent (see cryptographic hash for that).

On keyspace size

When you transfer a large chunk of data, e.g. large downloads, keyspace size matters a little, in that how likely you are likely to get the right hash with wrong data is a probability that lowers with larger spaces.

...but lower very quickly, so while 16 bits probably isn't enough, and 32-bit is meh, there are simple and fast hashes like MD5 (128 bit) or SHA-1 (160 bit) that are arguably already overkill for this use.

On avalance effect

Not all hashes are robust to certain structural errors. For example, one of the simplest hash functions is a simple bytewise XOR, which is usually blindingly fast to execute, but.

But if, say, data was transmitted over a 7-bit serial line, and its net effect was that the eighth but was shaved off all the data and the checksum, things will probably still check out.

Sure, that's a somewhat pathological example, but hashes with the avalance effect are simple to make, and solve this particular issue.

That also means that cryptographic hashes work well.

Supporting data structures

This article/section is a stub — probably a pile of half-sorted notes and is probably a first version, is not well-checked, somay have incorrect bits. (Feel free to ignore, or tell me)

Uses with stronger requirements - security, mostly

Password checking

The very simplest password system stores a password as-is.

This is obviously a bad idea to longer-term security, because if someone gets (local or remote) access to the computer that stores them, they could simply read off all passwords.

Instead, you can store the hash of the password.

The basic reason to store password hashes instead of passwords is that stealing the hashes means an attacker has more work to do.

Exactly how much guesswork depends a lot on setup - and on the passwords themselves.

This is basically data verification, in that matching passwords means matching password hashes.

However, it matters that the input is a very small bit of data.

Say, there's only so many six-letter passwords - basically so few that you could basically make a big list. The hash doesn't reveal what password it is directly -- but the shorter the password, the more likely it is that only one of all possible passwords of that size (or shorter) will match.

The amount of computation you need to do is bounded by password length.

This is the basic reason longer passwords are better. And also that very predictable passwords are bad even if longer (an attacker would start with them).

When implementing security like this, there is a checklist of things frequently done wrong, that you can avoid.

On reversibility and salt


On speed

General purpose hash functions (MD5, all of the SHA family, and a dozen others) are not the best for passwords.

Not because they're broken. They can check all the nice properties, but because of a more practical issue: They are fast.

The reason you would like a password hasher to be slow is, well, more of a stupidly pragmatic one than anything else:

A fast hash algorithm may allow million checks per second (microseconds per password).

If you use bcrypt, and configure so that it takes, say, ~100ms to check a password on current computers, people logging in won't even notice, while limiting even an offline brute force attack to the order of dozens per second per PC, rather than potentially millions for MD5/SHA1. (bcrypt is also tweakably slow - ten years on you can make new passwords slower, to counteract computers having become faster. Roughly by doing more loops of mangling - in a way that ideally can't be reduced much with cryptanalysis).)

Others interesting choices for password hashing (including other reasons) include Argon2, scrypt, and PBKDF2.

The 'password too short' nagging comes from very similar origins: even if it's slow, there are only so many passwords possible.

Assuming ~50 choices of character, then there are ~2500 two-character passwords, ~20 billion six-character, ..., 50k-billion eight-character.

That sounds like a bunch, but if indeed you can check millions per second, then you can exhaust 6-character passwords in minutes - but 8-character is still years.

(There are many footnotes to that. Not least of which being that most people don't use random passwords, and that this is perfectly divisible work so if you have huge interest in a specific password, you hire a server park and divide those CPU-years by however many CPUs you have. This is why you add a few more characters.)


Message digest

Message signing

See also:

Anonymizing data

Variants of challenge-response

Reproducable, relatively-unique output

What "Hash function X is broken" means

This article/section is a stub — probably a pile of half-sorted notes and is probably a first version, is not well-checked, somay have incorrect bits. (Feel free to ignore, or tell me)

The only thing it always means is that "something takes less time than we designed it to".

How much less, what parts, and how much or little that implies to ill intent (or specific uses), varies.

Context - "What is this pre-image stuff? Why is it called that?"

It's a math thing, probably introduced by the cryptanalists.

In mathematics, the set of possible outputs of a function is called its image[4].

And the inverse image, or preimage, is the members of the input that will map(verify) to that image.

In math you use more precise wording than this, but in the context of hashing, that roughly means

"pre-image" are inputs to the hash function (the message, e.g. the password)
"image" means outputs of the hash function (the message's hash value, e.g. password's hash)

I'll use those terms below where it may be clearer.

Collision attacks

Pre-image attacks

Second pre-image attack

Some real-world examples, with some real-world numbers

"cryptographic hash"

More technicalities

Perfect hash function

Perfect hash functions are collisionless: they map each input to an unique output.

Minimal perfect hash functions map to a contiguous integer set.

This isn't pure theory. Real-world uses include cases where a map is generated once and then only queried, whenever there is value in spending more time up front when it means a(n often slight) reduction later.

For example in code compilation, some real-time kernel situations, sometimes data indexing, and such.

It is not relevant to many everyday uses. For example, while in hashmaps, low collision is a good property to have, choosing a good hash function goes most of the way already, while regenerating a perfect hash on every update is prohibitively expensive.

Locality sensitive hashing

Locality-sensitive hashing (LSH) attempts to sort similar items into the same bucket, with an often low amount of buckets.

Note that this is effectively intentional maximizing collision (unlike almost all other hash use), and meaningful collision.

Though also note that you could call this dimensionality reduction instead.

This usually as part of neighbour search, or clustering, and some types of similarity and fingerprinting

Semantic hashing

Semantic hashing tries to generate hashes where the (Hamming) distance between codes is

Spectral hashing

hash types, significators, values you see in files, etc.

This article/section is a stub — probably a pile of half-sorted notes and is probably a first version, is not well-checked, somay have incorrect bits. (Feel free to ignore, or tell me)

Note that you can base a hash on a block cipher, and various of these are.

In places where more than one type of hash may appear, the entry may look like

This, and any of the identifiers, is primarily convention, or situation-specific de-facto standard, and not down to any sing standard.

For /etc/shadow, the identifiers include:

  • nothing: crypt() (one of various implementations, historically usually based on DES)
  • _
    : bsdi_crypt
  • $2$
    - bcrypt
  • $2a$
    - bcrypt that would accept unicode as UTF8
  • $2x$
    - BSD-only(verify) marker used to note hashes from a patched crypt_blowfish (fixing a bug in this PHP implementation of bcrypt)[5]
  • $2b$
    - bcrypt (marking a fix to the BSD implementation of bcrypt)
  • $3$
    : bsd_nthash (only freeBSD)
  • $md5$
    : MD5
  • %sha1$
    : SHA1
  • $5$
    : SHA256
  • $6$
    : SHA512

Entries take either the form




The above is the relatively basic set used by OSes. See also:

Note that any use of the crypt() function is not a specific implementation, so is not guaranteed to work cross-system (use a specific marked algorithm if you want that). Historically it was often DES, currently it's possibly MD5.

Application using passlib may also be using

- Apache's modified MD5
, $argon2d$</nowiki>}} argon2
pbkdf2_sha1, Passlib-specific
cta_pbkdf2_sha1 OR dlitz_pbkdf2_sha1

(list from here)

Apparently in .htpasswd files have a similar style, but don't assume you get more than:

  • nothing: crypt
  • $2y$
    - bcrypt
  • {SHA}
    - SHA1
  • $apr1$
    - apache's modified MD5 (to ease system independence?(verify)) (also used by BSD?(verify))

See also:

Some hash(/Digester/Checksummer) methods

This article/section is a stub — probably a pile of half-sorted notes and is probably a first version, is not well-checked, somay have incorrect bits. (Feel free to ignore, or tell me)

(this is likely to contain errors, and given to be incomplete. Sorted vaguely by how common they are.)


  • DES
    • Basic DES: (13 characters): 2 salt, 11 hash value. (56-bit DES)
    • Extended DES: (20 characters): 9 character salt, 11 hash value
  • MD family:
    • MD5 [6] (Salt size is variable(verify), commonly 8 or 12 characters(verify))
      • MD5
      • MD5 (Unix)
      • MD5 (APR), e.g. in "$apr1$salt$hash"
    • MD2 [7]
    • MD4 [8]
    • MD3
    • MD6

  • SHA [9]
    • SHA-0 (160-bit) - was never really adopted, because of a spec flaw that led to SHA-1
    • SHA-1 (160-bit) [10]
      • currently the most common variant
      • collisions can be found faster than is ideal. While not really an attack, it suggests that a stronger hash is a good idea
    • SHA-2 [11]
      • SHA-244 / SHA-256
      • SHA-384 / SHA-512 (faster on 64-bit)
    • SHA-3 [12]

  • MySQL PASSWORD() - a few different versions exist
    • MySQL before 4.1 generate a 128-bit value (what sort of hash?(verify))
    • 4.1.0 used a preliminary version of the below
    • 4.1.1 and later generate a 160-bit value, shown in hex with a prepended * (41 chars)
      • Added the old function under OLD_PASSWORD()
      • Apparently this is SHA1(SHA1(password)) (implementation note: the outer SHA1 reads the inner one's result in byte form)

  • NT hash (also less accurately known as NTLM hash)
    • 128 bits
    • MD4 of the little-endian UTF-16 version of the password

  • AES (Rijndael) [13]
  • Haval: [15]
    • Haval-128 (most common variant?(verify))
    • Haval-160
    • Haval-192
    • Haval-224
    • Haval-256
  • RipeMD [16] (Cryptographic)
    • RIPEMD-160
    • RIPEMD-128 (most common variant?(verify))
    • RIPEMD-256
    • RIPEMD-320
  • Tiger [18]
    • Tiger-128
    • Tiger-160
    • Tiger-192
  • Whirlpool [19] (Cryptographic)
  • Snefru [23]
    • Snefru (128-bit)
    • Snefru-256 (256-bit)

  • AICH [24]
    • 'Advanced Intelligent Corruption Handling', basically a hash tree setup
  • FNV (Fowler–Noll–Vo) [25]
    • FNV-0 (deprecated)
    • FNV-1
    • FNV-1a
sizes between 32 and 1024 bits, but typically 32 or 64?(verify)
  • Jenkins [26]
    • Jenkins One-At-A-Time (JOAAT)
    • lookup2
    • lookup3
    • SpookyHash


  • Bitzi bitprint - a concatentation of SHA-1 of the data (20 bytes) and the TTH of the data (24 bytes), usually in Base32], which makes for a 32-character string.

Non-cryptographic hashers

  • CRC ('Cyclic Redundancy Check')
    • Primarily used as a checksum function/value
    • Note that specific implementations can vary in with:
      • (most obviously) size, often 32 or 16, sometimes 8 or 64
      • choice of polynomial (note that if reported as a hex value rather than a polynomial's powers, it may be bit-reversed, so in a way there are two such values for each poly)
      • initial value (often all bits on)
      • when/how bits are shifted (verify)
      • possible other details, such as a bitmask in each step, or general details of bit ordering (verify)
      • such, checking a CRC from something you didn't design yourself, you often have to figure out its specific implementation.
  • Murmur, Murmur2, Murmur3
  • xxHash
Built for speed

Password hashers

This article/section is a stub — probably a pile of half-sorted notes and is probably a first version, is not well-checked, somay have incorrect bits. (Feel free to ignore, or tell me)
made for passwords
parameters for required CPUtime
  • PBKDF2 [28] (Password-Based Key Derivation Function)
made for passwords
potentially a little more sensitive to time–memory trade-off than e.g. Argon2, scrypt(verify)
made for passwords
parameters for required CPUtime, memory requirements, and parallelism
also makes GPU attacks much less efficient
made for passwords

Block ciphers / cryptosystems

  • Blowfish
    • ($2$ or $2a$): 16 character salt
    • ($2a$07$): see IPSEC note below
  • Cast
    • Cast 128 (a.k.a. Cast5) [31]
    • Cast 256 (a.k.a. Cast6) [32]
  • RSA (Rivest, Shamir, Adleman) [35]

(a.k.a. rDSA(verify))

Identifiers may include IPSEC's ESP Transform Identifiers (see RFC 2407), and others. said transform identifiers include:

  • 1: DES_IV64
  • 2: DES
  • 3: 3DES
  • 4: RC5
  • 5: IDEA
  • 6: CAST
  • 8: 3IDEA
  • 9: DES_IV32
  • 10: RC4
  • 11: NULL

See also

Some notes on CRC

This article/section is a stub — probably a pile of half-sorted notes and is probably a first version, is not well-checked, somay have incorrect bits. (Feel free to ignore, or tell me)

Cyclic Redundancy Check was an early hash function, aimed mainly at content verification

Aside from different poly values, there are more variants out there than you might expect, in that they end up with a different value.

Reasons include:

  • Initial values. Often -1 (i.e. 0xffff, 0xffffffff, etc.), sometimes 0, sometimes something else
  • Some code is endian-specific (through some bitwise operations)
  • some may use a bit-reversed version of an otherwise identical poly value.
As such, the poly factors are more descriptive than the poly value(s) (the list below tries to mention both)
  • There is sometimes a final XOR with a specific value (verify)

So while you can cover the bulk with a few description, an exhaustive list is harder.

CRC32 (CCITT32, Ethernet, PKZip, FDDI, ITU-T, OSI/IEC, Autodin-II, others)

  • Polynomial: (0,1,2,4,5,7,8,10,11,12,16,22,23,26(,32)) (0xedb88320 / 0x04c11db7)
  • Initial Value: 0xffffffff

"CRC-16" (this variant is the one often meant when a non-specific CRC16 is mentioned)

  • Polynomial: x^16 + x^15 + x^2 + 1 (0xA001 / 0x8005)
  • Initial value: 0xffff
  • Also:

CRC16 CCITT, X-25, XModem, Kermit use the same poly but vary in details

  • Polynomial: x^16 + x^12 + x^5 + 1 (0x8408 / 0x1021)
  • Initial value: 0xffff (but there are variants with 0x0000 and 0x1d0f (verify))
  • output XOR mask:
CCITT: 0xffff except for the 1D0F variant
most others: 0x0000
  • Kermit bit-reverses each byte before use, and the result afterwards(verify), and swaps the two result bytes (verify)

CRC-16 IBM seems another variation on the above (apparently also used in bisync, usb, modbus?)

CRC-16 Polynomial: x^16 + x^12 + x^5 + 1 (0x1021)

  • Used by
    • BinHex (init val 0)? (verify)
    • Packit (init val 0)? (verify)


  • Poly: x^16 + x^13 + x^12 + x^11 + x^10 + x^8 + x^6 + x^5 + x^2 + 1 (0xA6BC )
  • Initial value 0

CRC8 (Dallas/Maxim iButton) Polynomial: x^8 + x^5 + x^4 + 1 (0x8C)

Unsorted: CRC-16 SICK seems a faster but weaker variation (verify)

See also:

Checksum files