Computer / Speed notes

From Helpful
Jump to navigation Jump to search

Low level

This article/section is a stub — probably a pile of half-sorted notes and is probably a first version, is not well-checked, so may have incorrect bits. (Feel free to ignore, or tell me)

The following is not precise, and only meant to impress order of magnitude as you go further away from the center of things.

CPU cycle                                  0.5  ns     (at least 2GHz has been common for a while)
L1 cache reference                           1  ns     
simpler instructions                         1  ns     
branch mispredict                            5  ns    
function call                                5  ns   
floating point overhead                      7  ns    
L2 cache reference                           7  ns   
L3 cache reference                          15  ns   
NUMA shared-L3 cache hit                    30  ns   
floating point overhead                     20  ns   
L4 cache reference                          50  ns   
NUMA transfer overhead, more ideal          70  ns   
RAM access overhead (/ NUMA own memory)    100  ns   
context switch, more ideal                 100  ns   (verify)
memory allocation or free, more ideal      150  ns   (verify)
NUMA transfer overhead, less ideal         300  ns   
kernel call / syscall                    1,500  ns   (1.5us)
context switch, less ideal              10,000  ns   (verify)
1MB of PCIe transfer                   100,000  ns   (verify)
1MB of sequential RAM read             100,000  ns   (verify)
SSD read overhead                      150,000  ns   (verify), also varies with access method   
network RTT within LAN/datacenter      500,000  ns   (0.5 ms)  (not being overly optimistic, but may be lower)
1MB SATA SSD read                    1,000,000  ns   (  1 ms)
1MB read over 1 Gbps network         8,000,000  ns   (  8 ms)  (1ms of that is just division, the rest is assumed overhead)
platter seek overhead               10,000,000  ns   ( 10 ms)  (maybe half that for some pricier/server models)
broadband connection latency        10,000,000  ns   ( 10 ms)  (typical for DOCSIS and ADSL)
1MB sequential platter read         20,000,000  ns   ( 20 ms)
network RTT within country          25,000,000  ns   ( 25 ms)  (usually within 15ms..40ms, varying with infrastructure, country size)
network RTT between continents     150,000,000  ns   (150 ms)  (note that RTT is roughly 2 time one-trip latency, because Round-Trip)
congested network retransmit     2,000,000,000  ns   (  2 s)
tape access time                50,000,000,000  ns   ( 50 seconds, ballpark)
reboot                         300,000,000,000  ns   (  5 min)


Notes:

  • the lower-level ones have been relatively stable (within nanoseconds) over time
though some are due to positive and negative effects like pipelining, vectorization, branches, prediction, TLB, "L0", etc.
  • interconnects are still improving, and vary between architectures (and with needs); assume these can vary a few factors
  • distant media, shared media (network, disk) like disk can vary wildly with actual use


  • one implication is that in the real world, there may always be some reason for an overhead one or two orders larger
...that you can do nothing about
also a major reason throughputs are tricky


  • writes are more complex than reads at every scale
in part because caches
  • things that fit in caches can be much faster
there are various footnotes to that
  • Caches are purely hardware managed, which is why assembly is less relevant these days - you don't have as much control as you once had.
Though you can still sometimes control size better.
  • smaller code, fewer branches can matter a lot (and they may be at odds)
if more everything you need to access is closer to the CPU, there's a lot less waiting
...for parts that work in a tight loop. If it's disparate, unpredictable parts, such micro-optimizations won't work
  • A memory access being served from L1, RAM, swap from SSD, or swap from platter roughly an order of magnitude each


  • Hides some overhead effects.
for things that have overhead for random access, the access time may be most of the time, with little effect from the size of the access
  • floating point operations aren't detailed above, as pipelining and vectorization can have factors difference
but yeah, e.g. division is slower than addition
  • GPUs, being chunks of vector processors, internally roughly follow local/NUMA transfer latencies(verify)
but from the CPU view are just a coprocessor on a bus, transfers on and off are a separate and often significant thing




Parts taken from places like:

https://stackoverflow.com/questions/4087280/approximate-cost-to-access-various-caches-and-main-memory
https://gist.github.com/understeer/4d8ea07c18752989f6989deeb769b778
https://gist.github.com/jboner/2841832
http://norvig.com/21-days.html#answers
https://twitter.com/rzezeski/status/398306728263315456/photo/1
http://ithare.com/infographics-operation-costs-in-cpu-clock-cycles/


https://computers-are-fast.github.io/