Cache and proxy notes

From Helpful
(Redirected from HTTP CONNECT tunnel)
Jump to navigation Jump to search

This article/section is a stub — some half-sorted notes, not necessarily checked, not necessarily correct. Feel free to ignore, or tell me about it.


Proxy (network)

In the dictionary definition and most technical contexts, a proxy is

an entity that does something on your behalf, and/or which you do something through


Proxy server (concept)

A proxy server forwards requests on behalf of something else. That server figures out where to get them answered, and makes sure the response ends up with the original requester.

Any one proxy server tends to specialize for one type of resource - service, TCP connections, HTTP connections for webpages, file requests


That's a pretty fuzzy definition, though. According to that, your IPv4 broadband modem broadband modem could be called a IP-level proxy,

but that's twisting the term.
and/or so common that we call that terms like gateway instead,
and/or points out that we usually don't call things proxy when it is its most basic designed function.


For example, we usually point at proxies on the web, mostly for HTTP, because there's many of them and they augment what is already there.


VPNs also proxy our requests, and for VPN-as-a-product that is one of their selling points.


Perhaps it's more useful to point at our goals - we often use proxies for one or more' of the following reasons:

  • identification/anonimization: an end server will think the request came from the proxy.
    • you may have a "please log into this to access company services" proxy
    • ...or an often-more-transparent "you are on the university network and can access full text articles we have subscriptions for"
  • filtering, statistics: since the proxy sees all data, this is one easy place (but not the only) to count content for statistics, block content, transform content, eavesdrop, etc.
  • caching: the web has a lot of transparent caches that let content load from something closer than the origin server, which improves reaction time and (often more importantly) spreads network traffic.
Your business/university may have one of these sitting on its internet connection, caching the common content and saving on bandwidth
  • connection sharing
    • In theory, you can set up a HTTP proxy for various LAN hosts to use the same connection for web browsing and more
    • ...which was more common when just one person had a modem. This is now easier to do at IP level, "just plug cable into our house", than the historically HTTP-specific proxies.

Transparent proxy (concept/quality)

A transparent proxy, which is often the same as an intercepting proxy, is a proxy that acts without having clients having to set up for them or even knowing it's there, enabling it to automatically proxy certain connections.

(This in contrast with proxies that you have to specifically configure to use at all. To this day, you can configure your browser to a specific HTTP/HTTPS or SOCKS proxy, but most networks weren't set up to need this, because it's easier for most people involved)


Reverse proxy (concept)

Forward and reverse proxying are both fetching on behalf of something else.

Forward proxying is done close to the client, and configured on the client side.


Reverse proxying refers to various things done closer to the server end, because it serves more specific goals that primarily those servers will care about.


The point of reverse proxying this is often one of...

  • offloading some work from the backing web servers to said proxy server, e.g.
    • making it do encryption work (SSL)
    • compress compressible content
    • caching static content
    • buffering uploads, or downloaded, so that on slow uploads or downloads, the proxy rather than the backend gets occupied
  • load balancing: the reverse proxy can distribute jobs to various backing web servers (round-robin, based on load, or whatnot)
  • some attacks (DoS or other) are easier to mitigate in the proxy layer than at each server, because it is conceptually more central to control
  • may be a useful extra layer of "what servers/services are exposed outside?"
Depending on case, this may even be more convenient than doing this at network level
  • it may be easier to get some statistics in the proxy layer (more central than collecting from each backend and merge)

Edge and service proxies (purpose)

Proxy types (largely for browsers)

While both of the following can actually proxy arbitrary TCP, they are mostly used by browsers (and other UAs) for HTTP.

...with the notable exception that SOCKS (especially SOCKS5) is correlated with doing DNS requests through that proxy, rather than at the client side.


SOCKS proxy

This article/section is a stub — some half-sorted notes, not necessarily checked, not necessarily correct. Feel free to ignore, or tell me about it.

SOCKS proxies refer to network services that you can access via TCP (by default port 1080), and ask (primarily) to relay TCP/IP connections elsewhere.

To servers, you look like a different endpoint than before.

Apparently the idea that it stands for 'Secure SOCKet' came later?(verify) -- as it's an application layer protocol, not itself about sockets at all.


Notes:

  • It was initially an informal standard, with SOCKS5 becoming specified in RFC 1928 (verify)
  • There are roughly three versions
    • SOCKS4
    • SOCKS4a allowed specifying a name rather than an IP address to connect to.
    • SOCKS5
added optional authentication (user+pass or GSSAPI(verify))
added more official support for UDP and for IPv6,
...that UDP also being usable for DNS
  • Even though you could use it for arbitrary connections, it was possibly primarily used by browsers(verify)
(while most browsers support all, IE supported only SOCKS version 4)(verify)
  • SOCKS is protocol-agnostic
Some summaries mention support for HTTP, HTTPS, POP3, SMTP and FTP. The suggestion it is specific to them in any way seems to be recent misinformation in rewording and/or copy-pasting.(verify)
  • some suggest that SOCKS can handle anything while HTTP proxies can only handle HTTP(s)
but for both, the outgoing TCP connection is also protocol agnostic(verify)
when used from browsers, both are primarily for HTTP(verify)
it is true that the initial exchange must understand some basic HTTP (and may well be a more complex HTTP server)


See also:


HTTP/HTTPS tunneling (CONNECT style)

HTTP/HTTPS tunneling (other styles)

Further browser-related notes

As mentioned, SOCKS clients often do DNS requests through the proxy, but there are exceptions.


When mechanical UAs (e.g. fetching libraries) say they talk to proxies, they may mean HTTP proxies before they mean SOCKS. This too varies, and it's not always well documented which.


Software

nginx

envoy

See also

Cache types (conceptual/local)

📃 These are primarily notes, intended to be a collection of useful fragments, that will probably never be complete in any sense.


Write-through versus write-back

A white-through cache means writes go to both cache and storage.

This is simpler, more reliable, a little slower.
(saying 'buffer' may be better indication that the thing that goes to storage is about the real data, isn't about optional copies)

A Write-back cachemeans the cache is updated now, and the store is updated later.

This is often done to do those writes in batches, which is lower-overhead than write-through and preferable if here is high write volume.


When this refers to CPU and cache, this also relates to memory accesses, possible blocking, and the complexity of cache coherency protocol.

Size limitations, logic related to items entering and leaving

FIFO cache

Least Recently Used (LRU)

Somewhat like a FIFO cache, in that it has a limited size and housekeeping is generally minimal.

Instead of throwing away items created the longest ago (as in a basic FIFO cache), it throws away items that were last accessed the longest ago.


Used when you expect a distribution of accesses where a few items are probably accessed a lot.

Basic implementations are fairly basic linked list / queue dealies.


Because items could stay in the cache indefinitely, there is often also some timeout logic, so that common items will leave at all - usually be refreshed, as they will likely immediately ne created and cached again.



Real-world caches (local and/or network)

CPU caches

The inner workings of CPUs move their own pieces of information with delays of a nanosecond (order of magnitude[1]).

RAM access easily takes 100 nanoseconds (order of magnitude).

So there is value to sticking a small piece of expensive memory between the two, that tries to keep commonly accessed things in a cache near memory.

...so that when we hit that cache,that RAM access can be served in perhaps 1, 5, 15ns for L1, L2, and L3 (order of magnitude) instead.


That's roughly the idea.

The details are a lot more complex.

(...and not something you have to think about; it's transparent and you basically cannot control what CPU caches do)


OS caches

📃 These are primarily notes, intended to be a collection of useful fragments, that will probably never be complete in any sense.
This article/section is a stub — some half-sorted notes, not necessarily checked, not necessarily correct. Feel free to ignore, or tell me about it.

Page cache refers to the fact that a lot of VMMs that do paging in any sense (which is most) have chosen to add caches at that level. So page cache can refer broadly to any cache that happens to be implemented by using the OS's existing memory paging logic.


...but usually, we mean the ont that keeps recently read disk contents (metadata and data).

which speed up disk access to that data and metadata
...without violating any semantics
...and with the ability to free this space for memory allocation at the drop of a hat (note that entirely unused memory is basically wasted memory. Its value is in potential use)


It is a tradeoff in the end - freeeing up page cache increases allocation latency a little, but often by so little that the increase in disk access speed is more than worth it. To the point that various services actively count on the page cache.


For example, linux's OS-level cache is mostly:

  • the page cache - part of general OS memory logic, also caches filesystem data
  • the inode cache - filesystem related
  • the dentry cache - filesystem related, specifically for directory entries



While there is rarely reason to, you can flush these caches (since 2.6 kernels? earlier?(verify))). One reason would be to do IO benchmarking and similar tests. (note: inode and dentry caches may not flush completely because they are part of OS filesystem code?(verify), mmap()ped+mlock()ed memory won't go away nor swap)

According to the kernel source's Documentation/filesystems/proc.txt:

To free pages:

sync; echo 1 > /proc/sys/vm/drop_caches

To free dentries and inodes:

sync; echo 2 > /proc/sys/vm/drop_caches

To free pages, dentries and inodes:

sync; echo 3 > /proc/sys/vm/drop_caches

The sync isn't really necessary, but can be a little more thorough for tests involving writes (without it, more dirty objects stay in the cache/buffers).


You can also tweak the way the system uses the inode and dentry cache -- which is sometimes handy for large file servers, to avoid oom_kill related problems, and such. Do some searches for the various setting names you get from ls /proc/sys/vm / sysctl -a


See also:


Inspecting

There are some tools that e.g. allow you to query how much of a given file is cached in memory.

Most of that seems to be a per-file / per-file-descriptor query (being based on mincore() / fincore()),

so you can't quickly list all files that are in memory,
and checking for a directory of 10K+ files will mean a lot of syscalls (though no IO, so hey)


vmtouch

  • https://hoytech.com/vmtouch/
  • with only file arguments will tell you how much of them are in memory (counts the pages)
  • with options you can load, evict, lock, and such.

linux-ftools

Avoiding

memcache

This article/section is a stub — some half-sorted notes, not necessarily checked, not necessarily correct. Feel free to ignore, or tell me about it.

'memcache' usually refers to a program that acts as a temporary but fast store, often in main memory for speed, and often often of fixed size to not cause swapping.


A memcache makes sense for data that

  • is needed with high-ish frequency
  • stays current for a while, or is fine if it's a few minutes old
  • and/or was non-trivial to fetch or calculate (which, relative to fetching it from a memcache, is true for probably any structured storage)


It is also often useful that

  • you can serve it with very minimal latency
  • you can avoid disk access, or just effectively rate-limit it regardless of query load.


Keeping such information in a memcache will probably save disk IO, CPU, and possibly network resources.


Often refers to more than a convenient hashmap - it often refers to a service that uses unswappable memory, and is network-connected to scale up (without much duplication).

See e.g. memcached.

On the web

HTTP caching logic

See Webpage_performance_notes#Caching

Transparent caches

For example, Squid is often used as a transparent proxy that does nothing else but cache content that can be cached.

Also useful for companies, ISPs, home LANs with a dozen users, to save bandwidth (and lower latency on some items) by placing cacheable content closer to its eventual consumers.

Web server caches

This article/section is a stub — some half-sorted notes, not necessarily checked, not necessarily correct. Feel free to ignore, or tell me about it.


On access times