Difference between revisions of "Security notes / Hashing notes"

From Helpful
Jump to: navigation, search
m (Uses with stronger requirements - security, mostly)
m (Some real-world examples, with some real-world numbers)
 
Line 511: Line 511:
 
<!--
 
<!--
  
[https://en.wikipedia.org/wiki/Data_Encryption_Standard DES] is broken not even because of cleverness, but because its keyspace is small enough (56 bits), that thirty years after its design, a fully exhaustive search became feasible in a small amount of time with what is now only moderate CPU/power.
+
'''[https://en.wikipedia.org/wiki/Data_Encryption_Standard DES]''' is broken not even because of cleverness, but because its keyspace is small enough (56 bits), that thirty years after its design, a fully exhaustive search became feasible in a small amount of time with what is now only moderate amount of CPU/power.
  
...even ''without'' the flaws that are now known. Knowing those, that search is now even shorter. ([https://www.schneier.com/blog/archives/2004/10/the_legacy_of_d.html There's interesting history to this one])
+
But yeah, with the DES flaws that are now known, that's a little shorter yet.  
 +
 
 +
[https://www.schneier.com/blog/archives/2004/10/the_legacy_of_d.html There's interesting history to this one])
  
  
Line 530: Line 532:
 
'''"But doesn't it mean MD5 and SHA1 no longer be used for passwords?"'''
 
'''"But doesn't it mean MD5 and SHA1 no longer be used for passwords?"'''
  
Even if they were perfectly unbroken, you should have ''never'' been using them for passwords at all.
+
Yes and no.
  
That's not about whether it's broken.  
+
Neither should ''ever'' have been used for passwords at all, regardless of whether it's broken,
 +
because these two have a much more important property: They are very quick to calculate.
  
It's because these are designed to be very fast hashes.
+
By using very fast hashes, you have ''knowingly'' chosen that brute force attacks are much faster than they need to be.
Which means you have knowingly allowed brute force attacks to be ''much'' faster than they need to be.
+
  
 
'''A fast hash function is always a bad password hash function ''no matter what other properties that function has''.'''
 
'''A fast hash function is always a bad password hash function ''no matter what other properties that function has''.'''
  
  
At a practical level, it doesn't actually matter much.  
+
At a practical level, it only matters a ''moderate'' amount.
  
While password attacks (on stolen hashes) are preimage attacks (still unbroken for MD5 and SHA1),
+
Finding passwords based on stolen hashes is a preimage attacks.
and attacks are largely infeasible anyway.
+
Pre-image attacks are still unbroken for MD5 and SHA1 (as of 2020).
  
Even if this hash is a millions times faster than it needs to be,
+
Exhaustive search is largely infeasible anyway.
that effectively just shaves six zeroes from the end of a number, most of these numbers still have at least fourty zeroes left.
+
If the amount of seconds you need for ''fully'' exhaustive search has 40 zeroes on the end,
That's a ''lot'' of seconds.
+
then even if the hash is a million times faster than it needs to be still has 34 zeroes left,
 
+
and that's still more than anyone has.
 
+
with a whole bunch of zeroes less, is still impractical.
+
 
+
 
+
 
+
Again, passwords are a bad example. This does mean strong passwords are not broken within reasonable time -- but most passwords are not strong passwords.
+
  
 +
That means ''strong'' passwords are not broken within reasonable time -- but most passwords are not strong passwords.
  
  
Line 562: Line 559:
 
verifying your downloads (assuming of course you can transfer the hashes safely).
 
verifying your downloads (assuming of course you can transfer the hashes safely).
  
And a fast hash, like either of these, is ''preferable'', just for speed.
+
And a fast hash, like either of these, is then mildly preferable, just for speed.
  
  
Line 571: Line 568:
 
In part because because explanation that involves a lot of "it depends on what you're doing".
 
In part because because explanation that involves a lot of "it depends on what you're doing".
  
It's just ''simpler'' (and a lot shorter) to recommend something else instead.
+
You can see people's eyes glaze over while you explain the intricacies, or you can recommend something else instead.
 +
 
 +
After at most a dozen times of trying the former, you do the latter.
  
...though you shouldn't recommending something newer just because -- because newer also means less scrutinized - keep in mind it took ~15 years from MD5's inception before that attack was known.
+
...though you shouldn't recommending something too new, because something fresh is also something less scrutinized - keep in mind it took ~15 years from MD5's inception before that attack was known.
  
  
Line 582: Line 581:
 
Not equally so. It's useful to put some numbers to it
 
Not equally so. It's useful to put some numbers to it
  
* DES would take 2<sup>56</sup>
+
* DES would take 2<sup>56</sup> in thepry
 
: collision attacks can run in ~2<sup>40</sup>, or ~2<sup>20</sup> if you accept a limited success rate
 
: collision attacks can run in ~2<sup>40</sup>, or ~2<sup>20</sup> if you accept a limited success rate
 
: pre-image attacks: {{verify}}
 
: pre-image attacks: {{verify}}
Line 590: Line 589:
 
: pre-image attacks: {{verify}}
 
: pre-image attacks: {{verify}}
  
* MD5 would take 2<sup>128</sup> to
+
* MD5 would take 2<sup>128</sup> in theory
 
: collision attacks: 2<sup>30</sup> ...ish [https://en.wikipedia.org/wiki/MD5#Security]
 
: collision attacks: 2<sup>30</sup> ...ish [https://en.wikipedia.org/wiki/MD5#Security]
 
: pre-image attacks: 2<sup>123</sup> [https://link.springer.com/chapter/10.1007%2F978-3-642-01001-9_8]
 
: pre-image attacks: 2<sup>123</sup> [https://link.springer.com/chapter/10.1007%2F978-3-642-01001-9_8]
  
* MD4 2<sup>128</sup>
+
* MD4 would take 2<sup>128</sup> in theory
 
: collision attacks: somewhat around 1991, completely around 2004
 
: collision attacks: somewhat around 1991, completely around 2004
 
: pre-image attacks: 2<sup>102</sup> {{search|G Leurent 2008 MD4 is Not One-Way|in 2008}}, 2<sup>99.7</sup> {{search|Advanced Meet-in-the-Middle Preimage Attacks: First Results on Full Tiger, and Improved Results on MD4 and SHA-2|in 2010}}
 
: pre-image attacks: 2<sup>102</sup> {{search|G Leurent 2008 MD4 is Not One-Way|in 2008}}, 2<sup>99.7</sup> {{search|Advanced Meet-in-the-Middle Preimage Attacks: First Results on Full Tiger, and Improved Results on MD4 and SHA-2|in 2010}}
Line 606: Line 605:
  
  
These bit figures are actually very approximate, because attacks are usually specific,
+
These bit figures are actually very approximate,
and may not apply at all to a specific use.
+
because attacks are usually specific,
 +
and may not apply at all to any one specific use.
  
 
Still, up to 2<sup>40</sup> is often moderately feasible for someone willing to invest time/money,
 
Still, up to 2<sup>40</sup> is often moderately feasible for someone willing to invest time/money,

Latest revision as of 16:32, 19 April 2022

Security related stuff.

Practical


Theory / unsorted


  • Hashing notes


how to do a login system badly
how to do encryption badly
Disk and file encryption notes
This article/section is a stub — probably a pile of half-sorted notes, is not well-checked so may have incorrect bits. (Feel free to ignore, fix, or tell me)

Hash function

A hash function takes an arbitrary amount of (usually byte) data, and gives a mangled output of shorter, typically fixed size.


The result value is often referred to as the hash (of that data), a hash value, or such.



Different uses of hashing/hashes rely on different properties.


Hashes have a whole bunch of uses, many of which not particularly related to security.

And a bunch that are about security - which is where you have longer discussions about the details.


Properties you usually care about include:

  • given the same input, a hash function always generates the same output (hash) value
you pretty much always want that
  • given almost-equal input, the hash value is quite different
due largely to the Avalanche effect
not always necessary, but useful in many uses, and fairly easy to ensure
  • a hash is (typically much) smaller than the data it was calculated for
most hashes are fixed size and most somewhere between 4 and 20 bytes; the data they work on is often kilobytes to gigabytes
  • the original is not easily recognizable in the hash value.
Sometimes not a priority, but usually is. Also...
  • one-way function: it mangles in a way that is not easily reversed
finding the original is often a brute force search, so how hard it is to find depends mainly on search space / hash size, and CPU power
(but there are some practical attacks that don't have much to do with the hash itself. If privacy matters, basically get someone to do their best to break your system before you make it public)
  • a hash function will try to employ all the bits of the hash value equally, preferably regardless of the input data
(roughly to ensure that the amount of values the hash value could take actually reflects how many possibilities actually do get used)



Beyond that, requirements may start to vary. Consider for example:

  • very large keyspace
more useful in cryptography, because it helps make collision-like attacks much less feasible to do
many other applications don't really care.
  • fast - when supporting a low-latency use like network packet integrity, the faster you can say "very likely fine", the faster communication goes.
while for file integrity checks you wouldn't care about speed as much
  • slow - the slower it is to calculate a single hash, the harder it is to brute-force.
e.g. very useful for password checks
also for certain types of signing

Simpler-requirement uses

Check that data made it through a trusted channel okay: see Message integrity


Check that we're talking about the same data: see Message digest


In a bunch of data structures, you care only to generate a short number relatively unique from other such numbers, but always the same for the same original value.

Examples include:

  • keys in hash maps. The main use is in bucketing, so all that matters here is hashing the same data always yields the same hash - the hash function can actually be relatively stupid (and sometimes is).
  • speeding up bulks of (equality) comparisons of complex objects.
If you calculate a hash value representative for each object, you can then use such values to quickly decide on inequality.
Equal hash values do not mean equal values, but the statistics of collusions mean you can significantly narrow things down before you start more complex/expensive comparisons.

Uses with stronger requirements

Message authenticity

See Message authenticity


Password checking

The very simplest password system stores a password as-is.

This is obviously a bad idea to longer-term security, because if someone gets (local or remote) access to the computer that stores them, they could simply read off all passwords with zero effort.


Instead, you can store the hash of the password.

The basic reason to store password hashes instead of passwords is that stealing the hashes means an attacker has a lot more work to do.

Exactly how much guesswork depends a lot on setup - and on the passwords themselves.


This is basically data verification, in that matching passwords means matching password hashes.

However, it matters that the input is a very small bit of data.

Say, there's only so many six-letter passwords - basically so few that you could basically make a big list. The hash doesn't reveal what password it is directly -- but the shorter the password, the more likely it is that only one of all possible passwords of that size (or shorter) will match.

The amount of computation you need to do is bounded by password length.

This is the basic reason longer passwords are better. And also that very predictable passwords are bad even if longer (an attacker would start with them).

When implementing security like this, there is a checklist of things frequently done wrong, that you can avoid.


On reversibility and salt

TODO


On speed

General purpose hash functions (MD5, all of the SHA family, and a dozen others) are not the best for passwords.


Not because they're broken. They can check all the nice properties, but because of a more practical issue: They are fast.

The reason you would like a password hasher to be slow is, well, more of a stupidly pragmatic one than anything else:

A fast hash algorithm may allow million checks per second (microseconds per password).

If you use bcrypt, and configure so that it takes, say, ~100ms to check a password on current computers, people logging in won't even notice, while limiting even an offline brute force attack to the order of dozens per second per PC, rather than potentially millions for MD5/SHA1. (bcrypt is also tweakably slow - ten years on you can make new passwords slower, to counteract computers having become faster. Roughly by doing more loops of mangling - in a way that ideally can't be reduced much with cryptanalysis).)

Others interesting choices for password hashing (including other reasons) include Argon2, scrypt, and PBKDF2.



The 'password too short' nagging comes from very similar origins: even if it's slow, there are only so many passwords possible.

Assuming ~50 choices of character, then there are ~2500 two-character passwords, ~20 billion six-character, ..., 50k-billion eight-character.

That sounds like a bunch, but if indeed you can check millions per second, then you can exhaust 6-character passwords in minutes - but 8-character is still years.

(There are many footnotes to that. Not least of which being that most people don't use random passwords, and that this is perfectly divisible work so if you have huge interest in a specific password, you hire a server park and divide those CPU-years by however many CPUs you have. This is why you add a few more characters.)



bcrypt notes

Hashes with nonces

What "Hash function X is broken" means

This article/section is a stub — probably a pile of half-sorted notes, is not well-checked so may have incorrect bits. (Feel free to ignore, fix, or tell me)

The only thing it always means is that "something takes less time than we designed it to".

How much less, and which attacks (if any) it makes more feasibly (by how much) still varies wildly.


Collision attacks, pre-image attacks

More on collision attacks
More on pre-image attacks

Some real-world examples, with some real-world numbers

More technicalities

"cryptographic hash function"

Perfect hash function

Perfect hash functions are collisionless, by definition, because they map each input to an unique output.

Minimal perfect hash functions map to a contiguous integer set.


This isn't purely academis 'why not'.

Real-world uses include cases where a map is generated once and then only queried, whenever there is value in spending more time up front when it means a reduction later.

Even if that reduction is slight, if you're, say, compiling some real-time kernel stuff, a few seconds of extra compile time is easily worth it.


For data indexing, this is sometimes worth thinking about.

But yeah, it is not relevant to many everyday uses. For example, while low collision in hashmaps is a good property to have, it's usually quite cheap to lower collision a lot, but expensive to have it to have it at the absolutely-provalbly-minimal level. (Particularly when you have to regenerate that hash every resize/rebalance)

Locality sensitive hashing

Locality-sensitive hashing (LSH) attempts to sort similar items into the same bucket, with an often low amount of buckets.


Note that this is effectively intentional maximizing collision (unlike almost all other hash use), and meaningful collision.


Note that this is the opposite of what you want in e.g. a genera purpose hashmap, but may be a specific part of things like neighbour search, or clustering, and some types of similarity and fingerprinting.


Note that this bears similarities to dimensionality reduction instead.


https://en.wikipedia.org/wiki/Locality-sensitive_hashing

Semantic hashing

Semantic hashing tries to generate hashes where the (Hamming) distance between codes is


Spectral hashing

hash types, significators, values you see in files, etc.

This article/section is a stub — probably a pile of half-sorted notes, is not well-checked so may have incorrect bits. (Feel free to ignore, fix, or tell me)

Note that you can base a hash on a block cipher, and various of these are.


In places where more than one type of hash may appear, the entry may look like
$identifier$<salt>$<hash>
.

This, and any of the identifiers, is primarily convention, or situation-specific de-facto standard, and not down to any sing standard.


For /etc/shadow, the identifiers include:

  • nothing: crypt() (one of various implementations, historically usually based on DES)
  • _
    : bsdi_crypt
  • $2$
    - bcrypt
  • $2a$
    - bcrypt that would accept unicode as UTF8
  • $2x$
    ,
    $2y$
    - BSD-only(verify) marker used to note hashes from a patched crypt_blowfish (fixing a bug in this PHP implementation of bcrypt)[1]
  • $2b$
    - bcrypt (marking a fix to the BSD implementation of bcrypt)
  • $3$
    : bsd_nthash (only freeBSD)
  • $md5$
    ,
    {md5,
    : MD5
  • %sha1$
    : SHA1
  • $5$
    : SHA256
  • $6$
    : SHA512


Entries take either the form

$id$hash

or

$id$salt$saltedhash


The above is the relatively basic set used by OSes. See also:

https://passlib.readthedocs.io/en/stable/modular_crypt_format.html#os-defined-hashes
http://passlib.readthedocs.io/en/stable/lib/passlib.hash.html

Note that any use of the crypt() function is not a specific implementation, so is not guaranteed to work cross-system (use a specific marked algorithm if you want that). Historically it was often DES, currently it's possibly MD5.


Application using passlib may also be using

$apr1$
- Apache's modified MD5
$argon2i$
, $argon2d$</nowiki>}} argon2
$bcrypt-sha256$
Passlib-specific
$P$
,
$H$
PHPass-based
$pbkdf2$
pbkdf2_sha1, Passlib-specific
$pbkdf2-sha256$
Passlib-specific
$pbkdf2-sha512$
Passlib-specific
$scram$
Passlib-specific
$p5k2$
cta_pbkdf2_sha1 OR dlitz_pbkdf2_sha1
$scrypt$
Passlib-specific

(list from here)


Apparently in .htpasswd files have a similar style, but don't assume you get more than:

  • nothing: crypt
  • $2y$
    - bcrypt
  • {SHA}
    - SHA1
  • $apr1$
    - apache's modified MD5 (to ease system independence?(verify)) (also used by BSD?(verify))

See also: https://httpd.apache.org/docs/2.4/misc/password_encryptions.html

Some hash(/Digester/Checksummer) methods

This article/section is a stub — probably a pile of half-sorted notes, is not well-checked so may have incorrect bits. (Feel free to ignore, fix, or tell me)

(this is likely to contain errors, and given to be incomplete. Sorted vaguely by how common they are.)

Unsorted

  • DES
    • Basic DES: (13 characters): 2 salt, 11 hash value. (56-bit DES)
    • Extended DES: (20 characters): 9 character salt, 11 hash value
  • MD family:
    • MD5 [2] (Salt size is variable(verify), commonly 8 or 12 characters(verify))
      • MD5
      • MD5 (Unix)
      • MD5 (APR), e.g. in "$apr1$salt$hash"
    • MD2 [3]
    • MD4 [4]
    • MD3
    • MD6


  • SHA [5]
    • SHA-0 (160-bit) - was never really adopted, because of a spec flaw that led to SHA-1
    • SHA-1 (160-bit) [6]
      • currently the most common variant
      • collisions can be found faster than is ideal. While not really an attack, it suggests that a stronger hash is a good idea
    • SHA-2 [7]
      • SHA-244 / SHA-256
      • SHA-384 / SHA-512 (faster on 64-bit)
    • SHA-3 [8]


  • MySQL PASSWORD() - a few different versions exist
    • MySQL before 4.1 generate a 128-bit value (what sort of hash?(verify))
    • 4.1.0 used a preliminary version of the below
    • 4.1.1 and later generate a 160-bit value, shown in hex with a prepended * (41 chars)
      • Added the old function under OLD_PASSWORD()
      • Apparently this is SHA1(SHA1(password)) (implementation note: the outer SHA1 reads the inner one's result in byte form)


  • NT hash (also less accurately known as NTLM hash)
    • 128 bits
    • MD4 of the little-endian UTF-16 version of the password


  • AES (Rijndael) [9]
  • Haval: [11]
    • Haval-128 (most common variant?(verify))
    • Haval-160
    • Haval-192
    • Haval-224
    • Haval-256
  • RipeMD [12] (Cryptographic)
    • RIPEMD-160
    • RIPEMD-128 (most common variant?(verify))
    • RIPEMD-256
    • RIPEMD-320
  • Tiger [14]
    • Tiger-128
    • Tiger-160
    • Tiger-192
  • Whirlpool [15] (Cryptographic)
  • Snefru [19]
    • Snefru (128-bit)
    • Snefru-256 (256-bit)


  • AICH [20]
    • 'Advanced Intelligent Corruption Handling', basically a hash tree setup
  • FNV (Fowler–Noll–Vo) [21]
    • FNV-0 (deprecated)
    • FNV-1
    • FNV-1a
sizes between 32 and 1024 bits, but typically 32 or 64?(verify)
  • Jenkins [22]
    • Jenkins One-At-A-Time (JOAAT)
    • lookup2
    • lookup3
    • SpookyHash


Also:

  • Bitzi bitprint - a concatentation of SHA-1 of the data (20 bytes) and the TTH of the data (24 bytes), usually in Base32], which makes for a 32-character string.


Non-cryptographic hashers

  • CRC ('Cyclic Redundancy Check')
    • Primarily used as a checksum function/value
    • Note that specific implementations can vary in with:
      • (most obviously) size, often 32 or 16, sometimes 8 or 64
      • choice of polynomial (note that if reported as a hex value rather than a polynomial's powers, it may be bit-reversed, so in a way there are two such values for each poly)
      • initial value (often all bits on)
      • when/how bits are shifted (verify)
      • possible other details, such as a bitmask in each step, or general details of bit ordering (verify)
      • ...as such, checking a CRC from something you didn't design yourself, you often have to figure out its specific implementation.
  • Murmur, Murmur2, Murmur3
http://code.google.com/p/smhasher/
  • xxHash
Built for speed


http://floodyberry.com/noncryptohashzoo/


Password hashers

This article/section is a stub — probably a pile of half-sorted notes, is not well-checked so may have incorrect bits. (Feel free to ignore, fix, or tell me)
made for passwords
parameters for required CPUtime
  • PBKDF2 [24] (Password-Based Key Derivation Function)
made for passwords
potentially a little more sensitive to time–memory trade-off than e.g. Argon2, scrypt(verify)
made for passwords
parameters for required CPUtime, memory requirements, and parallelism
also makes GPU attacks much less efficient
made for passwords



Block ciphers / cryptosystems

  • Blowfish
    • ($2$ or $2a$): 16 character salt
    • ($2a$07$): see IPSEC note below
  • Cast
    • Cast 128 (a.k.a. Cast5) [27]
    • Cast 256 (a.k.a. Cast6) [28]
  • RSA (Rivest, Shamir, Adleman) [31]

(a.k.a. rDSA(verify))



Identifiers may include IPSEC's ESP Transform Identifiers (see RFC 2407), and others. said transform identifiers include:

  • 1: DES_IV64
  • 2: DES
  • 3: 3DES
  • 4: RC5
  • 5: IDEA
  • 6: CAST
  • 7: BLOWFISH
  • 8: 3IDEA
  • 9: DES_IV32
  • 10: RC4
  • 11: NULL

See also

Some notes on CRC

This article/section is a stub — probably a pile of half-sorted notes, is not well-checked so may have incorrect bits. (Feel free to ignore, fix, or tell me)


Cyclic Redundancy Check was an early hash function, aimed mainly at content verification


Aside from different poly values, there are more variants out there than you might expect, in that they end up with a different value.

Reasons include:

  • Initial values. Often -1 (i.e. 0xffff, 0xffffffff, etc.), sometimes 0, sometimes something else
  • Some code is endian-specific (through some bitwise operations)
  • some may use a bit-reversed version of an otherwise identical poly value.
As such, the poly factors are more descriptive than the poly value(s) (the list below tries to mention both)
  • There is sometimes a final XOR with a specific value (verify)

So while you can cover the bulk with a few description, an exhaustive list is harder.


CRC32 (CCITT32, Ethernet, PKZip, FDDI, ITU-T, OSI/IEC, Autodin-II, others)

  • Polynomial: (0,1,2,4,5,7,8,10,11,12,16,22,23,26(,32)) (0xedb88320 / 0x04c11db7)
  • Initial Value: 0xffffffff


"CRC-16" (this variant is the one often meant when a non-specific CRC16 is mentioned)

  • Polynomial: x^16 + x^15 + x^2 + 1 (0xA001 / 0x8005)
  • Initial value: 0xffff
  • Also:


CRC16 CCITT, X-25, XModem, Kermit use the same poly but vary in details

  • Polynomial: x^16 + x^12 + x^5 + 1 (0x8408 / 0x1021)
  • Initial value: 0xffff (but there are variants with 0x0000 and 0x1d0f (verify))
  • output XOR mask:
CCITT: 0xffff except for the 1D0F variant
most others: 0x0000
  • Kermit bit-reverses each byte before use, and the result afterwards(verify), and swaps the two result bytes (verify)

CRC-16 IBM seems another variation on the above (apparently also used in bisync, usb, modbus?)

CRC-16 Polynomial: x^16 + x^12 + x^5 + 1 (0x1021)

  • Used by
    • BinHex (init val 0)? (verify)
    • Packit (init val 0)? (verify)


CRC-16 DNP

  • Poly: x^16 + x^13 + x^12 + x^11 + x^10 + x^8 + x^6 + x^5 + x^2 + 1 (0xA6BC )
  • Initial value 0


CRC8 (Dallas/Maxim iButton) Polynomial: x^8 + x^5 + x^4 + 1 (0x8C)


Unsorted: CRC-16 SICK seems a faster but weaker variation (verify)




See also:

Checksum files