Computer / Speed notes
Jump to navigation
Jump to search
Low level
This article/section is a stub — probably a pile of half-sorted notes and is probably a first version, is not well-checked, so may have incorrect bits. (Feel free to ignore, or tell me) |
The following is not precise, and only meant to impress order of magnitude as you go further away from the center of things.
CPU cycle 0.5 ns (at least 2GHz has been common for a while) L1 cache reference 1 ns simpler instructions 1 ns branch mispredict 5 ns function call 5 ns floating point overhead 7 ns L2 cache reference 7 ns L3 cache reference 15 ns NUMA shared-L3 cache hit 30 ns floating point overhead 20 ns L4 cache reference 50 ns NUMA transfer overhead, more ideal 70 ns RAM access overhead (/ NUMA own memory) 100 ns context switch, more ideal 100 ns (verify) memory allocation or free, more ideal 150 ns (verify) NUMA transfer overhead, less ideal 300 ns kernel call / syscall 1,500 ns (1.5us) context switch, less ideal 10,000 ns (verify) 1MB of PCIe transfer 100,000 ns (verify) 1MB of sequential RAM read 100,000 ns (verify) SSD read overhead 150,000 ns (verify), also varies with access method network RTT within LAN/datacenter 500,000 ns (0.5 ms) (not being overly optimistic, but may be lower) 1MB SATA SSD read 1,000,000 ns ( 1 ms) 1MB read over 1 Gbps network 8,000,000 ns ( 8 ms) (1ms of that is just division, the rest is assumed overhead) platter seek overhead 10,000,000 ns ( 10 ms) (maybe half that for some pricier/server models) broadband connection latency 10,000,000 ns ( 10 ms) (typical for DOCSIS and ADSL) 1MB sequential platter read 20,000,000 ns ( 20 ms) network RTT within country 25,000,000 ns ( 25 ms) (usually within 15ms..40ms, varying with infrastructure, country size) network RTT between continents 150,000,000 ns (150 ms) (note that RTT is roughly 2 time one-trip latency, because Round-Trip) congested network retransmit 2,000,000,000 ns ( 2 s) tape access time 50,000,000,000 ns ( 50 seconds, ballpark) reboot 300,000,000,000 ns ( 5 min)
Notes:
- the lower-level ones have been relatively stable (within nanoseconds) over time
- though some are due to positive and negative effects like pipelining, vectorization, branches, prediction, TLB, "L0", etc.
- interconnects are still improving, and vary between architectures (and with needs); assume these can vary a few factors
- distant media, shared media (network, disk) like disk can vary wildly with actual use
- one implication is that in the real world, there may always be some reason for an overhead one or two orders larger
- ...that you can do nothing about
- also a major reason throughputs are tricky
- writes are more complex than reads at every scale
- in part because caches
- things that fit in caches can be much faster
- there are various footnotes to that
- Caches are purely hardware managed, which is why assembly is less relevant these days - you don't have as much control as you once had.
- Though you can still sometimes control size better.
- smaller code, fewer branches can matter a lot (and they may be at odds)
- if more everything you need to access is closer to the CPU, there's a lot less waiting
- ...for parts that work in a tight loop. If it's disparate, unpredictable parts, such micro-optimizations won't work
- A memory access being served from L1, RAM, swap from SSD, or swap from platter roughly an order of magnitude each
- Hides some overhead effects.
- for things that have overhead for random access, the access time may be most of the time, with little effect from the size of the access
- floating point operations aren't detailed above, as pipelining and vectorization can have factors difference
- but yeah, e.g. division is slower than addition
- GPUs, being chunks of vector processors, internally roughly follow local/NUMA transfer latencies(verify)
- but from the CPU view are just a coprocessor on a bus, transfers on and off are a separate and often significant thing
Parts taken from places like:
- https://stackoverflow.com/questions/4087280/approximate-cost-to-access-various-caches-and-main-memory
- https://gist.github.com/understeer/4d8ea07c18752989f6989deeb769b778
- https://gist.github.com/jboner/2841832
- http://norvig.com/21-days.html#answers
- https://twitter.com/rzezeski/status/398306728263315456/photo/1
- http://ithare.com/infographics-operation-costs-in-cpu-clock-cycles/
https://computers-are-fast.github.io/