Cache and proxy notes

From Helpful
(Redirected from Page cache)
Jump to: navigation, search
This article/section is a stub — probably a pile of half-sorted notes, is not well-checked so may have incorrect bits. (Feel free to ignore, fix, or tell me)


Proxy

In the dictionary definition and most technical contexts, a proxy is

an entity that does something on your behalf, and/or which you do something through


Proxy server

A proxy server forwards requests (file, service, connection, web page, etc, depending type of proxy) elsewhere, and makes sure that the response ends up where it should.

From different contexts, the term 'proxy server' can refer to proxying of just HTTP (for basic web surfing), may handle non-HTTP services (additionally, or even only), may proxy any network connection.


Proxies are usually used for one or more of the following reasons:

  • caching: the web has a lot of transparent caches that let content load from something closer than the origin server, which improves reaction time and (often more importantly) spreads network traffic.
Your business/university may have one of these sitting on its internet connection, caching the common content and saving on bandwidth
  • filtering/statistics: since the proxy sees all data, this is one easy place (but not the only) to block content, transform content, eavesdrop for statistics, etc.
  • identification/anonimization: an end server will think the request came from the proxy.
    • if you set up a proxy for anonymous use (and don't log use, and also the clients don't identify themselves), the end server's logs can only know requests came from a proxy, and from which of that proxy's users
    • (this has some implications on anything by IP - rate limiting, banning, and more)
    • can also e.g. be used to make sure only students+staff use a university's licensed content
      • often a web-based proxy that you have to log in to, which does all HTTP requests on your behalf
  • connection sharing
    • In theory, you can set up a HTTP proxy for various LAN hosts to use the same connection for web browsing and more
    • ...however, it is now uncommon, because it is often easier to do it at IP level routing, rather than the historically HTTP-specific proxies.


Transparent proxy

A transparent proxy (a.k.a. intercepting proxy) is a proxy that acts as a network gateway, enabling it to automatically proxy certain connections (meaning the end server sees the proxy, not you, as a client), without knowledge of the client.

In the form of transparent proxies, this is often employed by ISPs for the decreased bandwidth use and increased speed. Large organizations regularly also do this, whether they use a private IP network or a public one (as various universities do), as it:

  • makes it easy to add a caching web proxy
  • is easy on administration (upsides of a proxy without any necessary client configuration)
  • makes it easy to check user (employer) use/abuse, and block content if necessary

Reverse proxy

A reverse proxy is named in contrast to what you would normally consider a proxy - in direction.

It often refers to reverse proxying for HTTP servers: Browsers talk to the reverse proxy, the reverse proxy talks to the internal web servers that do the actual work for the request.

The point of doing this is...

  • to selectively offload some work from the backing web servers, e.g.
    • caching static content
    • making it do encryption work (SSL)
    • compress compressible content
  • to scale and load-balance: the reverse proxy can distribute jobs to various backing web servers (round-robin, based on load, or whatnot)
  • avoid keeping open connections to dynamic-content-generation servers. This is interesting because it ties up resources for a bit, which is one factor limiting the request rate
    • for clients with slow download, effectively accepting responses slowly -- because the proxy can be a short-term buffer for this content, making the dynamic web server free earlier.
    • for clients with slow uploads - the proxy can accept the request and its data, and postpone talking to the dynamic part until it has the complete request.
  • some attacks are easier to mitigate at a proxy than at each web-server
  • may give useful control over which specific internal web servers/services are used and/or exposed. Depending on case this may be more convenient than doing this at network level


See also

(Mostly abstract) cache types

These are primarily notes
It won't be complete in any sense.
It exists to contain fragments of useful information.

Size limitations, logic related to items entering and leaving

FIFO cache

Least Recently Used (LRU)

Somewhat like a FIFO cache, in that it has a limited size and housekeeping is minimal. Instead of throwing away items created the longest ago (as in a basic FIFO cache), it throws away items that were last accessed the longest ago.

Used when you expect a distribution of a few often-accessed items, which will stay in the cache.

Basic implementations are fairly easy linked list / queue deals.


Because items could stay in the cache indefinitely, there is often also some timeout logic, so that common items will leave at all - usually be refreshed, as they will likely immediately ne created and cached again.



Real-world caches

OS caches

These are a bunch of quick jots worth noting down but not complete in any way (and probably won't make it up to well-written text and possibly not even stub status).
This article/section is a stub — probably a pile of half-sorted notes, is not well-checked so may have incorrect bits. (Feel free to ignore, fix, or tell me)

Page cache

Page cache can refer broadly to any cache implemented by using the OS's existing memory paging logic.

It often refers specifically to the common use of keeping around recently read disk contents (metadata and data), to transparently speed up disk access (without violating any semantics).


For example, linux's OS-level cache is mostly:

  • the page cache - part of general OS memory logic, also caches filesystem data
  • the inode cache - filesystem related
  • the dentry - filesystem related, specifically for directory entries

You can flush these (since 2.6 kernels? earlier?(verify))), which can be interesting to IO benchmarking and similar tests. (inode and dentry caches may not flush completely because they are part of OS filesystem code?(verify), mmap()ped+mlock()ed memory won't go away or swap)

According to the kernel source's Documentation/filesystems/proc.txt

To free pages:

sync; echo 1 > /proc/sys/vm/drop_caches

To free dentries and inodes:

sync; echo 2 > /proc/sys/vm/drop_caches

To free pages, dentries and inodes:

sync; echo 3 > /proc/sys/vm/drop_caches

The sync isn't really necessary, but can be a little more thorough for tests involving writes (without it, dirty objects stay in the cache, as this is non-destructive cache clearing).


You can also tweak the way the system uses the inode and dentry cache -- which is sometimes handy for large file servers, to avoid oom_kill related problems, and such. Do some searches for the various setting names you get from
ls /proc/sys/vm
/
sysctl -a


See also:


Inspecting

There are some tools that e.g. allow you to query how much of a given file is cached in memory.

Most of that seems to be a per-file / per-file-descriptor query (being based on mincore() / fincore()),

so you can't quickly list all files that are in memory,
and checking for a directory of 10K+ files will mean a lot of syscalls (though no IO, so hey)


vmtouch

  • https://hoytech.com/vmtouch/
  • with only file arguments will tell you how much of them are in memory (counts the pages)
  • with options you can load, evict, lock, and such.

linux-ftools

Avoiding

memcache

This article/section is a stub — probably a pile of half-sorted notes, is not well-checked so may have incorrect bits. (Feel free to ignore, fix, or tell me)

'memcache' usually refers to a program that acts as a temporary but fast store (often of fixed size) in main memory.


A memcache makes sense for data that

  • is needed with high-ish frequency
  • stays valid for a while
  • was non-trivial to fetch or calculate (which, relative to fetching it from a memcache, is true for probably any structured storage)
  • should be available with minimal latency
  • should not involve disk

Keeping such information in a memcache will probably save disk IO, CPU, and possibly network resources.


Often refers to more than a convenient hashmap - it often refers to a service that uses unswappable memory, and is network-connected to scale up (without much duplication).

See e.g. memcached.

On the web

HTTP caching logic

See Webpage_performance_notes#Caching

Transparent caches

For example, Squid is often used as a transparent proxy that does nothing else but cache content that can be cached.

Also useful for companies, ISPs, home LANs with a dozen users, to save bandwidth (and lower latency on some items) by placing cacheable content closer to its eventual consumers.

Web server caches

This article/section is a stub — probably a pile of half-sorted notes, is not well-checked so may have incorrect bits. (Feel free to ignore, fix, or tell me)


On access times