Postgresql notes

From Helpful
Jump to navigation Jump to search

Database related

More theoretical - thinking about databases:

Everyday-use notes

"I installed the package, now what?"

Let's get you to the point where a client/app can get to the database it needs.

If your package makes you care: Clusters

This article/section is a stub — some half-sorted notes, not necessarily checked, not necessarily correct. Feel free to ignore, or tell me about it.
🛈 You may skim or skip this at first
Clusters are a great organization tool if your job description is 'database admin for many people'. On your first install, to poke a stick at a database? you mostly won't care.


What the SQL standard calls a catalog cluster, postgres makes a little more specific, and calls it a database cluster, which it defines something like "a collection of databases served by a postmaster instance".

Each will have its own config, run on its own port, etc.

Much of the point is that

  • clusters don't know about each other at all,
  • that they can be stored in different places
  • people can't access each other even if you mess up role permissions, because it's not even the same process (set)
  • lets you run different versions of postgres side by side.
It theoretically also eases upgrades between them[1]).
part of why you are allowed to supply a config to base a new cluster on (verify)

All this is great when you are a database admin with distinct users/clients who have distinct needs

...but until then, you probably don't need them, and it's just exta work. Even if it makes you care about clusters, just create a single one to dump all your project databases into.

As such, the thing that installed postgres might

  • leave it entirely up to you, e.g. assuming you wanted control over that, may be upgrading, or whatnot.
meaning you still need to create a cluster, and then you can skip reading the rest of this section
  • not care for multiple clusters, but still have created one for you to get going (most won't mention 'cluster' at all)
meaning you can skip this section
  • let you manage clusters (initdb, pg_ctl)
and might have created a default one for you to get going
  • help manage clusters
e.g. debian/ubuntu adds pg_lscluster, pg_createcluster, pg_ctlcluster (which wraps pg_ctl and pginit)
and uses those wrappers to automatically models clusters into their service management (others leave that part up to you)

More practically

On redhat

RHEL seems to not do clusters, or create one after installing the postgres package (verify)

So setup comes down to

  • run initdb
  • enable the service
  • start the service

On debian/derivatives

If a cluster does not yet exist (you'll see a lot of messages like "No PostgreSQL clusters exist"), and the following command will show none:


A cluster has a name, and a version. For example, after installing the postgresql-9.3 package you might run

pg_createcluster 9.3 main --start

which creates a set of config files inside


and which you can start like

pg_ctlcluster 9.3 main start

(...though that part is usually the responsibility of via services)

See also:

Check that you can log in

In most cases, after basic installation you have

a host user called postgres with admin rights, and a postgres role within the database
a database called postgres
a pg_hba that trusts localhost

In those cases, the easiest check is to become the postgres user, and run psql (a command line tool).

Assuming you have sudo rights, that amounts to:

sudo -u postgres psql postgres

If it gives you an SQL prompt (postgres=#), then there is a running postgres instance, and you're logged into it as the local admin.

If it gives an error, search for what that error means.

(Note that if you have more than one cluster, then typically each will be on its own port(verify), and you will need to choose the cluster by handing the correct port into psql -- and everything else you use to connect to the database)

Optional: Set password

That is, the postgres role within postgresql is basically its superuser.

You'll likely use it for major admin (e.g. creating databases),

Postgres is probably set to trust local admins (see #On_peer_auth_.28and_ident.29). This is fine on your own box / VM / instance, while on shared-user systems there are cases where it's not as secure, so if the data is sensitive then consider setting a password.

(Also, it's sometimes useful to allow remote login, e.g. for tools like pgadmin3, and you'ld probably allow that via password auth -- though note you can SSH-tunnel to avoid needing this)

To set its password, connect to the database, e.g.

sudo -u postgres  psql postgres

Once you're at its prompt:

\password postgres

Optional: creating roles, setting access privileges, and client auth settings

I typically do this later (and frankly am too lazy to do it on my home server), but it's a thing to consider.

By default you are:

  • doing everything as the database superuser (postgres role)
  • only allowed to so so on localhost (because peer auth, see below)

This is fine for initial mucking about, but in production you may want e.g.:

  • admin can do everything (but no regular users can authenticate as them)
  • backup can only read everything
  • each app can read/write, but only its specific database

Access privileges are combinations of

  • actions (SELECT, INSERT, UPDATE, DELETE, ec.)
  • on specific things (DATABASE, TABLESPACE, TABLE, etc.)
  • to roles (users or groups)

Roles are postgres's internal user management (separate from the OS's users - though note that you can tie things to host accounts in pg_hba).

It is preferable to read up on common practice - and also to keep this as simple as makes sense to you, in that the easier you can remember/explain it, the more that random implications won't sneak up on you.

Optional and sometimes useful: client authentication (pg_hba)

pg_hba is part of config, and is rough-grained control of which sorts of connections get a chance to log in at all.

Often filtering out by type of connection, type of authentication, and such.

People often seem to use it as a firewall-like whitelist. The defaults are usually decent for that.

Note that this is is about connections, and unrelated to role privileges.

See #pg_hba

Optional: performance tweaks

There are some conservative defaults that you can increase, e.g. a few relating to memory use.

And on platter disks it helps when not everything (tablespace, WAL, logs) goes to the same platter(s).

Note that if you're unsure where the config file is, or if there may be several and you don't know which one, psql -U postgres -c 'SHOW config_file' will tell you which file the running database is using

Create database

Your new project will need its own database. That would be


The previous sections mean you've probably thought about who owns it and can access it.

Which, for first experiments, is often "I'm using that postgres admin user that can change everything everywhere, because for now I'm alone in this install and if I do anything wrong it's my own damn fault."

Use notes


The .psqlrc file is psql's config. See the man page, but example:

# see how fast things are
\timing on

# Setting PROMPT1 to show the server you're on
\set PROMPT1 '%n@%m:%/ \#'
# colors are handled but don't make things more readable
\set PROMPT1 '%[%033[90m%]%n%[%033[0m%]@%m:%[%033[92m%]%/%[%033[0m%]\# '
# See also

# one of:
#\pset pager off
\pset pager on       # won't trigger when output has fewer than ~25 lines
#\pset pager always

# many-column tables can be annoying to read 
#   (also depending on the PAGER itself, see below)
# expanded formatting shows records at a time, vertically.
# to get this only when it's wider than the screen:
\x auto

# See nulls as distinct from empty string. 
\pset null '(NULL)'
# You can also make it distinct from literal whatever-this-string-is 
# by using some unicode character here, like ¤ or ∅,
# as psql will show unicode ''data'' escaped (configurable maybe?)
\pset null '¤'

Your PAGER is a shell thing, that psql also chooses to use. You may want to, in general, tweak it as you like (e.g. if using less, I want -S, to crop rather than wrap lines, as that makes scrolling to the right a lot less confusing), e.g.

export PAGER="less -SaXz-5j3R"
# or
export PAGER="/usr/local/bin/gvim -f -R -"
# or look at things like

You can also consider putting some canned queries in there, see e.g.

Note that if you adminster the database via another user (e.g. postgres), then sudo -u postgres psql will not pick up the .psql file in that account.

The simple solution is to switch to that user (or ssh as that user) for a normal login.

Another possible fix is something like

alias psql="sudo -u postgres sh -c 'cd ~postgres; . ~postgres/.bashrc; psql'"


This article/section is a stub — some half-sorted notes, not necessarily checked, not necessarily correct. Feel free to ignore, or tell me about it.

Upserts refers to an INSERT where, if a matching(=conflicting) row exists already, falls back to UPDATE that matching row.

This is not a core SQL feature, but often very useful.

Upserts are a weirder thing than you'ld think it should be, in part because the condition is actually more like "would an insert cause a duplicate in a UNIQUE index", and that can race with another such case. MVCC makes it even more interesting - perhaps simpler in the generic case, but harder in the contested case.

So ideally you want the backend to be able to handle this, because doing it in generic SQL needs multiple statements, and having to think about edge cases and fallback logic, basically amounts to a mini-transaction that requires reasonable isolation. This is not efficient if you want to do this a lot,

It's still not going to be the most efficient thing ever (and you want to know about the tradeoff to avoid it if not necessary) but it may be good enough for most cases.

Postgres grew a "INSERT ... ON CONFLICT UPDATE" in 9.5.

Before then, rules could be abused for the same.

For both and more, see General_RDBMS_use_and_speed_tweaking_notes#Upsert

See also:


Mass inserts

temporary tables

This article/section is a stub — some half-sorted notes, not necessarily checked, not necessarily correct. Feel free to ignore, or tell me about it.

Occasionally you say "hey, I'm creating some costly-to-calculate intemediate data. It could be a subquery, but I'd like to use it more than once, so can I keep this somewhere - and have it be its own thing that auotmatically goes away when the connection ends?"

Basically, create tables using [2]: CREATE TEMPORARY.

There are two things that TEMPORARY guarantees:

  • That they will be dropped at the end of a session.
Or, if you add ON COMMIT DROP, at the end of the current transaction
  • That these tables exist in their own schema
meaning table names won't conflict with existing tables
though when not using fully qualified names, you can mask existing them in your queries
won't conflict between sessions (because that temporary schema's name is picked per session(verify)).


  • will be in RAM only for smaller data -- once the session becomes larger than temp_buffers it will spill over onto disk
so if you want to rely on these, consider raising temp_buffers
  • it said 'session' up there, not connection. The difference can be important e.g. around connection pooling
so prefer the transaction variant when convenient
otherwise drop it yourself
  • "The autovacuum daemon cannot access and therefore cannot vacuum or analyze temporary tables. For this reason, appropriate vacuum and analyze operations should be performed via session SQL commands. For example, if a temporary table is going to be used in complex queries, it is wise to run ANALYZE on the temporary table after it is populated."[3]

See also:

Memory tables

This article/section is a stub — some half-sorted notes, not necessarily checked, not necessarily correct. Feel free to ignore, or tell me about it.

tl;dr: doesn't really exist, but you can get a good imitation if you must

The use case is typically a sort of memcache: shared, transaction-safe, guarantee of no disk IO.

A global temporary table is what you'ld want, and is considered/planned, but not implemented as of 9.4.

You can get something very close in a few ways:

  • An UNLOGGED table means one that does not use the WAL
truncated after a crash (because they're not crash-safe)
but is still transactioned, and visible between sessions
handled largely in RAM, will still go to disk if storing a lot of data
arguably a feature for anything you can't limit the size of
which all makes a bunch of sense for caches (except for the bit where it can occupy disk)
  • A temporary table stays in memory until it's too large for it
apparently because it's a more specific type of unlogged table
but you can't share these, so generally not what you want
  • Add a tablespace on a RAMdisk, then create tables on that specific tablespace.
If you want these tables to be predictably there (around reboots and such), you would do this creation as part of OS/service startup.
Keep in mind that if the WAL is on regular disk, this won't give you much faster alterations, only faster selects.

There are some extensions that do similar things.

Date math

This article/section is a stub — some half-sorted notes, not necessarily checked, not necessarily correct. Feel free to ignore, or tell me about it.

On timezones

There are two basic date-and-time types:

  • timestamp, a.k.a. timestamp without time zone
  • timestamptz, a.k.a. timestamp with time zone

There is no difference in storage values, both take 8 bytes.

The difference is that the latter type implies conversion to and from the client's timezone (based on what they have set), and the former skips any such conversion.


  • timestamptz always stores UTC
the server converts from the timezone the client has set
client timezone must be correct, or this becomes messy
You may want to demand clients always set it explicitly, because it may be inherited(verify)
practical when you interact with more than one timezone
  • timestamp you store exactly what you send
useful when
you have existing data without timezone info, and just want to store and report it as you got it
you consider the server timezone authoritative (e.g. entered now(), difference to now() but absolutely always based on server logic)
you consider the client timezone authoritative (e.g. only ever reporting what they sent back to them, doing absolutely no time math server side or comparison betwen clients)
...or want to deal with all timezone logic yourself (hint: you generally don't)

You can compare timestamp with timestamptz, because they're both numbers.

But you don't want to do this, because you generally cannot know it is correct, unless you happen to accidentally have all timestamp values in the same timezone (UTC).

Note that if you have server-side logic involving time, e.g. comparisons like NOW() - INTERVAL '7 days', then you probably want to store timestamptz.

Because otherwise you'ld be mixing the two types and have to think very hard about every such mix.

For example, the docs say,

"Conversions between timestamp without time zone and timestamp with time zone  normally assume 
 that the timestamp without time zone value should be taken or given as timezone local time.
 A different time zone can be specified for the conversion using AT TIME ZONE."


# set timezone TO 'Europe/Amsterdam';
# select now();
2019-07-22 17:52:49.146931+02
# select now()::timestamp;
2019-07-22 17:52:49.146931
# set timezone TO 'GMT';
# select now()
2019-07-22 15:52:49.146931+00
# select now()::timestamp;
2019-07-22 15:52:49.146931

The main points to be made here are probably that

  • the two now() values refer to the same moment in absolute time, and the two now()::timestamp values are not.
  • it does communicate timezone information on the way out (based on where you say you are)

The connection's timezone can be viewed with

SELECT current_setting('TIMEZONE');

Or the less standard but easier to remember:

show timezone;


"Show timestamps from the last week" can look something like:

SELECT * from files where entered > NOW() - INTERVAL '7 days'

"Are time functions volatile?"

In postgres, NOW() and CURRENT_TIMESTAMP() are aliases for TRANSACTION_TIMESTAMP(), which is defined STABLE within a transaction.

There are other functions, e.g. clock_timestamp(), that are volatile.

See also:

data types

This article/section is a stub — some half-sorted notes, not necessarily checked, not necessarily correct. Feel free to ignore, or tell me about it.

serial, bigserial notes

As the documentation notes:

CREATE TABLE tablename (
    colname SERIAL

is equivalent to:

CREATE SEQUENCE tablename_colname_seq;
CREATE TABLE tablename (
    colname integer DEFAULT nextval('tablename_colname_seq') NOT NULL

Bigserial and smallserial are the same idea, but use bigint (int64) and smallint (int16) instead of integer (int32)

UNIQUE is not implied on the column. You may want it.

nextval() always increment, and calls never returns the same value. Use from transactions that are rolled back create gaps in the sequence.

See also General_RDBMS_use_and_speed_tweaking_notes#consider_not_using_auto_increment.2Fserial

Text and character notes

This article/section is a stub — some half-sorted notes, not necessarily checked, not necessarily correct. Feel free to ignore, or tell me about it.
  • varchar and text map to the same internal types, but varchar implies length check when entered, and text has no limit(verify).
  • char(n) will be padded with spaces
  • char is short for char(1) (verify)

An E before a string is a string including escapes (a non-standard extension of SQL).

See also:

JSON notes


  • if you do operations on the data, in the database, JSONB's lower access time is worth the extra space
  • if it's actually structured data (or the important core is), use columns
  • the query planner cannot help you on unstructured data
but well-targeted indices work as well as any
  • If it's store-and-return and only the app cares JSON is smaller
...or use TEXT, if you don't want to handle insert/update time validation errors

JSONB is a binary serialization, and

takes some shortcuts that are typically fine
e.g. doesn't maintain key ordering, doesn't allow duplicate keys (rarely used but JSON technically doesn't rule out) (last key wins)
slightly faster at extracting values, because it's already parsed
has more operations defined
sometimes little larger to store(verify)
you can index on fields

JSON is stored as text, and only validated

will preserve key ordering, whitespace, duplicate keys. (Only?) If you need any of these should your prefer JSON

See also:

Date and time notes

This article/section is a stub — some half-sorted notes, not necessarily checked, not necessarily correct. Feel free to ignore, or tell me about it.

Tests like "in last half hour":

WHERE entered > ( NOW() - INTERVAL '30 minute' )  

Casts to integer should be done via extract. For example, to get a timezone-less(verify) since-epoch time for the current time, use something like (date_part('epoch'::text, now()))::integer.


Basically, read:

To change a column to an enumerated type you'll need some explicit casting. See e.g.

Inspecting an enumerated type

You can

  • For an overview as data, look at pg_catalog.pg_enum. A query you may like (from [4]):
 select n.nspname as enum_schema,  
        t.typname as enum_name,  
        e.enumlabel as enum_value
 from pg_type t 
    join pg_enum e on t.oid = e.enumtypid  
    join pg_catalog.pg_namespace n ON n.oid = t.typnamespace
  • For a specific one: select enum_range(null::mytype)
  • use \dT+ in psql

On large binary/text data; LO versus TOAST


  • Pre-TOAST, storing larger things (in bytea/text) meant overhead in the main table.
arguably usually okayish for objects up to a few hundred KBbytes, beyond would really affect accesses on that table, so you probably preferred LO (Large Objects)
  • LO is essentially an enumeration into a postgres-managed file store.
Which you still manage yourself in your app, but LO saves you a bunch work and permission related headaches and mistakes.

  • Now that we have TOAST, it mostly just does what you want: large values transparently go to what internally is a separate table
meaning the main table contains only the non-TOASTed coumns/variables, which is more compact and faster to work on

Large Objects versus TOASTed bytea:

  • the text type is actually wrapped around bytea.
(so yes, large text is TOASTed by default as well)
The main difference between bytea and text is that text has text codings, bytea does not
e.g. bytea allows a 0x00 byte, many text codings do not
  • TOASTed bytea is often easiest for binary blobs: mostly transparent, LO is more work to use
  • bytea cleanup is automatic, LO's is not (as they act as references to files)

  • LO provides interfaces to files, so can stream, seek, and modify in-place, bytea are only handled as a whole
so there are still cases where LO is preferable

  • both bytea and LO are limited to a few GB (depends on config?) (verify)

  • a few ORMs will always fetch all columns and cannot be told not to (yes, that's really a bug)
for TOAST columns that may be very large values - while for LO that's basically just a number you happen not to use


This article/section is a stub — some half-sorted notes, not necessarily checked, not necessarily correct. Feel free to ignore, or tell me about it.


  • TOAST (The Oversized-Attribute Storage Technique) avoids storing large values in the main table
by putting them into a large-stuff-side-table instead (sort of a very basic key-value store for blob-like values)
no functional difference to you: values are placed into and fetched from that external table transparently when you use them
  • TOAST only applies to types that can be large.
which internally means things based on varlena. In terms of the types you most often use this mostly means
means arbitrary-length text types (including VARCHAR (larger or all?(verify)), text,
things like JSON/JSONB
  • by default only actively used when things are large (generally useful behaviour; see below)
becomes likely to be applied for values over ~2K, a given if over ~8K (because page size)
(technical reason: postgres uses a fixed page size (commonly 8 kB), and does not allow tuples (=rows) to span multiple pages)
will be compressed and/or stored outside of the table itself. The preference for which is tweakable via column settings
  • can make sense for columns that you don't fetch in most of your queries.
which is generally most large columns
that that the columns in the main table itself stay much smaller (cache better)

Consider text.

Short text values like words and text identifiers are likely to be things you select on, commonly fetch, sometimes transform while selecting, so you would prefer to avoid the extra work from decompression or the extra fetch, so are best kept in the table.

Few-kilobyte-large text values are likelier to be documents, or other things kept more for archiving reasons, e.g. the unparsed version of something or an article abstract, never filtered on, not fetched in most queries on the table, and generally only fetched in a "present this single item fully" style query.

Meaning 90% of the time, keeping it in the main table would only mean more data you typically skip past. Putting it elsewhere makes it likelier that that main table's more-commonly-used data will be (and stay) in RAM.

If it's something you indeed rarely fetch (or never in typical use, e.g. unparsed version) then compression also saves disk (and memory).

Sometimes you know more about how values will be used (than the above assumptions), so you can control the TOAST strategies (see SET STORAGE), mainly to disable the compression and/or out-of-lineness.

  • PLAIN (also describes non-TOASTable types)
prevents compression
prevents out-of-line storage (i.e. forces it to be in the main table)
Puts a limit on the row's combined size, due to the page size - crossing that will make inserts/updates fail.
still a good idea for things you commonly handle in queries and know will always be short (keep in mind this is about the size of the entire row rather than a specific text field)
allows out-of-line storage
prevents compression
  • EXTENDED (default for most TOASTable types)
allows out-of-line storage
allows compression
compression will be attempted first. If row is still too big, use out-of-line storage
  • MAIN
allows compression
out-of-line storage is only used if there is no way to fit the row in a page (so effectively the threshold is 8K instead of 2K(verify))
The difference to EXTENDED is priority: Basically, within a row tuple, MAIN columns stay in until it's clear that pushing out EXTENDEDs won't make things fit.

To change it, it's something like


To see current STORAGE type in psql:

\d+ tablename

(TODO: find SQL equivalent)


  • you cannot currently force TOAST-style compression for small text values
  • you cannot force things out of the table always
(both seem to partly the design decision that the small gains aren't worth the overhead)
  • TOAST implementation details are the reason values single values are limited to at most 1GB.
(it steals two bits from the varlena 32-bit length value)
  • Limits: Each table gets its own TOAST table, and it uses OIDs.
so this is the basic 232 OIDs limit with the standard "degraded performance if you get anywhere close to that"
yet that won't matter until you have got billions of rows and at least (because ~2K threshold) terabytes of data in a single table.
so it's something to keep in mind if you plan to put all of your big data directly in postgres (note that sharding to more tables may be a good idea anyway)
  • On storing files in the database:
pro: you get transactional integrity (with path-to-filestore and LO this is more your responsibility)
pro/con: replication and backup will include them. This will often be large. But also makes it complete, and won't go out of sync (as a path-to-filestore and LO might)
con(/pro): doesn't do files over ~1GB (LO does 2GB If you need larger, you should really want the filesystem solution)
con: goes via memory, which means large files will clobber your caches(verify) (for contrast: LO and path-to-filestore can be streamed)
con: valena in most uses(verify) is escaped on the wire, meaning they're larger
note: take a look at the Foreign Data interface, it's sometimes a better fit. Particularly if you want to use a filsystem style solution but specifically a distributed one
  • the 2K threshold mentioned above is TOAST_TUPLE_THRESHOLD, and its value is settled at compile time

See also:


This article/section is a stub — some half-sorted notes, not necessarily checked, not necessarily correct. Feel free to ignore, or tell me about it.


This article/section is a stub — some half-sorted notes, not necessarily checked, not necessarily correct. Feel free to ignore, or tell me about it.


  • when to use one depends on
    • how you use the columns
    • whether the index overhead is worth it (whether there are more selects or more index-updates)
    • table size. Tiny tables have no use for indices. Huge tables can still benefit from avoiding tablescans even if you have constant index work)
  • a btree index is automatically created to support any uniqueness constraints (UNIQUE, PRIMARY KEY)
  • types:
    • btree is fine for most basic types, for equality, range, and sorting needs
    • GIN and GIST are mostly for fancier types, fancier operations
think array logic, 2D geometry, full-text indexing (various types)
tend to be larger, and therefore slower to build
    • hash usually isn't worth it
  • tweaking GiST can help. Takes some study, though.
  • Also:
    • partial indexes make sense when you typically search only for a few specific values (when the work of keeping and updating the rest is pointless)

Index types

  • btree (balanced tree. The default, and usually what you want)
works on numeric types and text types, and deals with NULL
fast for equality and range queries
supports basic inequality too (=, <, >, <=, >=)
amount of work to find is relatively constant (B is for balanced)
optimized for case where a single value maps to a single row
note: a btree index is implied when you add a uniqueness constraint (UNIQUE / PRIMARY KEY)

  • GiST (Generalized Search Tree)
balanced tree style
beyond that a framework more than an implementation - allows
choice of operators
choice of index type (e.g. text with equality, or text with pattern search(verify))
custom data types
made it easier to implement geometric types (more flexible with its operators)
more than equality/range operations; can support various geometric operators
lossy, in that it's fine with keeping false positives for some time, and has to check the table.
See also
  • SP-GiST is a variant of GiST that supports not-necessarily-equally-space-partitioned data (think quad-trees, k-d trees, and suffix trees)
  • GIN (Generalized Inverted Indexes)
can map many values to one row
can work with custom data types
can deal with array values
easily much larger (often means slower), so use only when needed
  • BRIN (Block Range Index)
  • hash
supports operations: =
not transaction safe, meaning (?)
not WAL-logged, meaning you should rebuild these after a crash to be sure
comparable speed to btree - hash may be slightly faster (perhaps due to being smaller?)
  • rtree
supports more operators than btree
not transaction safe
no upsides of using this over GiST (which came later)
rtree now removed, effectively merged into GiST (since 8.2)

Variations for specific uses

Sorted indices

Consider a time/date field ('created' or 'lastupdated' or so), and the common query on that being 'select the most recent hundred' , i.e. ORDERed and with a small LIMIT. An index kept in DESCending order (and NULLS LAST) will find enough entries at a small clump at the start. (The direction of sorting probably doesn't make a huge amount of difference (verify))

Partial indexes

if you know you'll always select on a column with specific values, and never others, you can create an index missing those values.

Say you have a system storing images, and want to detect for all if they're animated or not. That column will be TRUE, FALSE, or NULL meaning 'not checked yet'. The background task would typically select matching NULL, and it would typically be relatively small number among the whole, in which case you can go with a index supporting basically just that query, with

CREATE INDEX ani_unknown  ON images (animated)  WHERE (animated is null);

Expression Indexes

Index the result of an expression on the value, rather than the value itself.

Usually only has a point when supporting a very specific select, and there is no point to also storing that transformed value in the table.


  • have a datetime column, index it with just the date() (without the time)
for a "get all stuff within this day" query, without needing a separate date column (or transform / filter the datetime)
  • Store names case-sensitive, index them case-insensitive, by lower(columname)
and querying it that way too (but consider queries that this might not support!)
  • indexes on specific parts when using hstore, json and such, see e.g. [5]

multi-column indexes

create indexes on multiple columns. Only makes sense when you query on that same combination, and in the same order.

Unique Indexes

supports fast checking of unique restraints, otherwise a bit lighter

Other notes

TODO: look at

Keep in mind:

  • postgres will always use an index for UNIQUE columns, so adding an index manually would only mean two indexes and double work.
  • row visibility/currentness (see MVCC) is not stored in indices, and has to be fetched from table data

  • Indices that see a lot of alterations have the same empty space problem tables do
space is reclaimed in a page-based way, meaning that pathological cases can still lead to index bloat.
which degrades performance if the table is doing very large amounts of writes and/or deletes.
the simplest fix is to just accept this
if you want to clean up
A REINDEX is possible, but requires an exclusive lock on the table
you can DROP and CREATE it separately, but often want no time without index (for uniqueness and/or query-planning/performance reasons)
a CREATE INDEX (e.g. in combination with a DROP INDEX in a transaction) still needs a write lock you probably want something involving CREATE INDEX CONCURRENTLY, which builds slower but needs no locks
and e.g. probably rename it into place (in a transaction) once it's done

  • Creating indices will by default imply a write lock.
try CONCURRENTLY - it'll build slower, but won't lock out the table from what is probably regular use
some caveats and can fail (see [6]), meaning some extra handling around it

  • VACUUM FULL often means a lot of work not only to the table data, but also the index
Before 8.4, this would be work that altered the existing index, so likely increased index size and fragmentation (there are some ways to cheat this a little)
Since then, it basically builds a new index, so is equivalent to a reindex (verify)

Understanding EXPLAIN / queries to rewrite for indices

This article/section is a stub — some half-sorted notes, not necessarily checked, not necessarily correct. Feel free to ignore, or tell me about it.

You should be passingly familiar with the major decisions the query optimizer makes:

  • table scan method
    • sequential scan
    • index scan
    • bitmap heap scan
  • table join method
    • nested loop with inner index
    • nested loop without inner index
    • hash join
    • merge join
  • join order, in that
    • which to use as the right relation can matter, e.g. in hash join's amount of keys, index use in nested loop.

...there are variants - e.g. backwards index scans (typically makes a max() much faster), hash anti join. Most are specific-case optimizations you can generally read as their basis. (Note that an anti-join is a "where a match does not exist in the other table)

Sequential scans

Mean looking through the entire table.

The larger the table, the linearly slower this is to do,

If this is a frequent query that returns a small part of the table, ask yourself whether there is a possible index on the column(s) you're filtering on most.

An example where you wouldn't care is an infrequent maintenance thing that is intended to look at all data.

See more on index choice below.

Sequential scan because of a lack of join

Joining on an indexed thing is usually fast, so you want that to happen when possible.

The simplest case is where you explicitly join.

A more interesting case is where a query doesn't explicitly join on an indexed thing. (Note: related to the (un)correlated subquery problem)

Consider "I want tags for files I no longer have" like:

select distinct guid from tags_by_guid
  where guid not in (select guid from files)


Unique  (cost=1000.58..71216562338.86 rows=558360 width=43)
  ->  Gather Merge  (cost=1000.58..71216556610.58 rows=2291314 width=43)
        Workers Planned: 2
        ->  Parallel Index Only Scan using pkey on tags_by_guid  (cost=0.56..71216291136.07 rows=954714 width=43)
              Filter: (NOT (SubPlan 1))
              SubPlan 1
                ->  Materialize  (cost=0.00..73031.15 rows=625077 width=41)
                      ->  Seq Scan on files  (cost=0.00..64411.77 rows=625077 width=41)

If you somehow joined on the condition that they match, you'ld be telling the database to compare based on column values (using what it knows about them, and allowing use of their indices).

The fact that they come from specific matching columns is not explicit. We find this obvious, more so because they have the same name, but what you've said is "have one produce strings, then compare that against another set of strings". Which it'll do, typically by materializing the inner query (= execute and keep its results around), then do a scan over the outer one comparing values to that.

The crux is often figuring out how to express this join (in this case an anti-join, meaning a "does not exist"-style query). The best syntax / performance can vary a little with specific case.

One (slightly funny looking) alternative for this case is:

select distinct guid from tags_by_guid tbg
 where not exists (select 1 from files where files.guid=tbg.guid)


Unique  (cost=299434.23..299434.23 rows=1 width=43)
  ->  Sort  (cost=299434.23..299434.23 rows=1 width=43)
        Sort Key: tbg.guid
        ->  Gather  (cost=78719.23..299434.22 rows=1 width=43)
              Workers Planned: 2
              ->  Hash Anti Join  (cost=77719.23..298434.11 rows=1 width=43)
                    Hash Cond: (tbg.guid = files.guid)
                    ->  Parallel Seq Scan on tags_by_guid tbg  (cost=0.00..76636.29 rows=1909429 width=43)
                    ->  Hash  (cost=64411.77..64411.77 rows=625077 width=41)
                          ->  Seq Scan on files  (cost=0.00..64411.77 rows=625077 width=41)

Things to note:

  • Ignore the outermost sort and unique, it's just for the DISTINCT, and adds little cost
  • in the first, the outer query is much larger than the inner cost
...roughly just their table sizes multiplied
suggesting a dumb 'compare everything to everything one by one' plan.
  • the cost of the inner query is the same in both cases.
Makes sense, it's doing a sequential scan selecting everything (files is a smaller table than tags_by_guid).

See also:

external merge disk

...often suggesting an intermediate is very large.

Sometimes a slightly larger work_mem helps (if the work will easily fit in RAM if you only tell it to be less conservative), but more usually, some redesign is the better solution, because a large intermediate result is often slow and avoidable.

When/why is my index not used?

This article/section is a stub — some half-sorted notes, not necessarily checked, not necessarily correct. Feel free to ignore, or tell me about it.

Many possible cases, which can vary with your query, your data, and your table definitions.

  • If postgres needs to transform the stored data, it can't use the index on that column.
including some reasonable things you might not initially expect, e.g. e.g. differing collations between text columns
  • because the planner figures that doing so is more expensive than an alternative.
And it's usually right.
For example:
doing index lookup before having to read almost all of the same table's data is more work than reading all table data.
a condition that an index cannot assist, such a larger-than condition on an index type that does not support that (or possibly for which the planner can guess the result is 'most results'?(verify))
Having only condition(s) on non-indexed columns means no narrowing down can be done before that condition can be checked, so it has to go to all data
  • Anything that by nature doesn't scale very well, e.g.
GROUP BY on a lot of data
e.g. various uses of DISTINCT
ORDER BY on a lot of data
...though a well-placed WHERE can do wonders
  • sometimes use of a LIMIT can make it prefer an index over a tablescan
...though ORDER BY before LIMIT means all data has to be evaluated
  • in some cases nullability matters(verify)

Estimating index usefulness

The query optimizer estimates the cost of various applicable ways to satisfy a query, and that number in EXPLAIN can be useful in an order-of-magnitude way.

It's pretty good at choosing the best given the situation -- but you can influence that context

being smart in your queries can help more
adding indices can sometimes make for much faster plans
tweaking indices can make them cheaper to use

When it's net-positive:

  • Every column you commonly select on may be useful to index
...when you typically select small sets, and the table is large
  • Every column you commonly join on may be useful to index
...but not always, and not necessarily both of he compared columns (read up on the join types)


  • unused indices just slow you down a little, because they must be kept up to date
there sometimes is gray area, e.g. ones that are rarely used, written more often than read.

When an index is huge relative to the table, it won't help much

...and maintaining it is probably also considerable work.
There are some clever alleviations, most when you know more about your data
  • when a condition usually implies most data needs to be fetched anyway, there is no point in the extra work of figuring that out via an index
  • sometimes writing queries differently can help the optimizer a lot
TODO: many examples :)

"When is an index better?" depends on a few different things, including:

  • the nature of your data
...including the amount of rows, and the amount of distinct values
  • the nature of the queries you (most commonly) run
(e.g. what you select on, what you join on, sometimes what columns you use in conditions)
  • the query planner/optimizer
  • the amount of work done to maintain the indices

It varies per case which one of those weight strongest.

Text indices and matching

See also (indexes)


The rule system is a rewrite system, which conditonally does something instead of OR before a given query.

Rules can be used as augmented queries, exceptions, stored queries or very simple stored procedures.

Creating them takes the form:

CREATE RULE  rule_name 
      AS ON  event 
         TO  object
[WHERE       rule_qualification]
DO [INSTEAD] [ action | (actions) | NOTHING ];

See CREATE RULE docs and DROP RULE docs.


You can use this to:

  • implement views (e.g. adding some useful calculated columns) by making the event SELECT ON tablename
  • implement view updates (INSERT/UPDATE) by making rules do the proper changes to the actual table
  • implement stored queries (on some given conditions)
  • do (very) basic data validation
  • protect certain INSERTs, UPDATEs and/or DELETEs from doing anything, via a DO INSTEAD NOTHING.
    • Note this will allow no changes from any user, which is probably not what you want
    • ...unless you made a table purely for rules, in which case you can manage grants for that view (or rather 'view table') and the actual table separately, which can be useful (verify)
  • record how many queries the rule was triggered for

When rules' conditions do selects, or when you rewrite into more complex queries, then they can make things slow in less-avoidable ways because you made it overrule yourself :) Use rules with due consideration.



  • Absent: the rule action is done unconditionally - before the user's query
  • Present, and WHERE applies: Done instead (or, pedantically, before the user query isn't done)


  • (to write)

You may need some complex code to make rules apply differently to wildly different insert types, but on properly normalized databases that should not be a problem.

Example: Word counter

The following imitates a simple use of what would be INSERT (...) ON DUPLICATE KEY UPDATE (...) in MySQL (see the respective notes): Assuming a table:

CREATE TABLE wordcount (word text primary key, count int);
CREATE INDEX word_index on wordcount(word);

Adding the rule:

CREATE RULE  counter 
AS ON        INSERT TO wordcount
WHERE        (NEW.word) IN (SELECT word FROM wordcount)
             SET count = count+1
             WHERE word=NEW.word;

...will allow you to do:

INSERT INTO wordcount (word) VALUES ('word');
INSERT INTO wordcount (word) VALUES ('word');
INSERT INTO wordcount (word) VALUES ('word');

This would normally complain about a duplicate key, but the rule always tests and falls back to incrementing the count when necessary. (Note: the index isn't necessary, but avoids a sequential scan every for every insert/count)


using stale statistics instead of current ones because stats collector is not responding

postgres keeps table statistics to help the planning. These statistics are manged by a specific process, in a file

The message means that the collector did not respond to a "give me a recent statistics snapshot" request very fast.

Which usually means your IO is currently overloaded.

checkpoints are occurring too frequently

WARNING: there is already a transaction in progress

Seems to means it sees a BEGIN a transaction when one was already started.

This often means either

a missing commit/rollback, or
an explicit BEGIN when a transaction was already automatically started (e.g. when autocommit is off).


for SELECT DISTINCT, ORDER BY expressions must appear in select list

ERROR: current transaction is aborted, commands ignored until end of transaction block

In general the first error in that transaction should make the transaction fail as a whole. Postgres, like many databases, expects your code to handle errors in transactions, including an explicit rollback.

In practice:

  • You are probably interested in the error that happened just before this one. Check your logs if you have them, go the verbose-debug way if you only have your code.
  • If the SQL can fail for a good (e.g. data-related) reason, it is also a smart idea to make your client-side logic more fine-grained, e.g. deal with the possibility of failure of individual statements in a transaction.
  • How to handle the error path is up to your preference of correctness, and of necessity.
in some cases you'ld want to retry, in others giving up is fine,
in some cases it's a code bug, in others it's database state
In many situations, having an exception path that just rolls back whatever you were doing is good enough.
you could do a partial commit, when parts of a transaction are logically unrelated (e.g. you chose to put it in the same transaction for ease, or perhaps for WAL related reasons)
  • You may wish to make your transactions a bit more local and explicit
e.g. when working with psycopg, it will implicitly start a transaction on the first command after the last commit/rollback (or initial connection).
  • It can be a good habit to rollback any connections you're not using after the latest select (also to avoid "idle in transaction" connections)

Note that this error can also be caused by code out of your hands. For example, django being out of sync with its database.

Error: Cannot stat /var/run/postgresql / No PostgreSQL clusters exist; see "man pg_createcluster"

...when you try to start an existing installation.

My issue -- quite possibly specific to ubuntu or the service script for a particular pg version, or maybe some assumption that breaks under systemd -- is that the /var/run/postgresql directory is assumed to exist, and it is not created automatically.

Which means you need to create it manually and restart postgres, in my case

mkdir /var/run/postgresql ; chown postgres: /var/run/postgresql ; pg_ctlcluster 10 main start

You could also hack this into the service script, I just hope the power grid behaves decently - and forget the details by the time I need them ^^

Permission denied - Is the server running locally and accepting connections on Unix domain socket "/var/run/postgresql/.s.PGSQL.5432"?

Probably means that the file permissions on that unix socket don't allow the effective user to use it.

The simplest solution is often to add your effective user to the postgres group.

(Though this isn't always ideal. For example, if you add apache to it and your pg_hba trusts all localhost access, then scripts get somewhat wide access to your database by implication)

No PostgreSQL clusters exist; see "man pg_createcluster"


  • running ubuntu (14.something-specific perhaps?)
  • things were previously running fine
  • this happened after a reboot

...then it may mean the /var/run/postgresql/ directory isn't created, but postgres assumes it's there, and needs it for... let's call it management of its state. The error is not a very good indication of what's actually wrong.

Probably fixed in a later package update (cause is possibly that it's on tmpfs now?), but since I don't update, I initially mkdir'd it at reboots, and eventually tweaked the upstart script to do that.

could not connect to server: Cannot assign requested address

This is an OS error, specifically coming from a socket bind().

In case you're creating connections very quickly (e.g. hundreds per second), you're probably triggering the TIME_WAIT problem by running yourself out of sockets.

Consider reusing connections where possible.

ERROR: MultiXactId has not been created yet – apparent wraparound

invalid memory alloc request size

Replication, Backup

This article/section is a stub — some half-sorted notes, not necessarily checked, not necessarily correct. Feel free to ignore, or tell me about it.

Replication, Clustering, etc.

Basic options:

  • pg_dump
  • PITR
  • Replication as backup (With PITR, or by other means)
  • LVM

Clustering/load balancing/failover, etc: See [8] Slony?

pg_dump / pg_dumpall notes

The format you ask for implies the utility you need to restore Format and implied restore utility:

    • -f p plain text, use to restore with psql (default)
    • -f t tar, use to restore with pg_restore
    • -f c custom, usually for pg_restore


sudo -u postgres pg_dump dbname > dbname.sql

which you can restore

cat dbname.sql | sudo -u postgres psql dbname

...with footnotes to whether you also dump the creation commands, so whether the datbase has to exist and be empty, or not exist yet

(the sudo to do this as user postgres, because that's usually the database admin user that has enough permissions to do this. Can vary in specific setups)


  • Consider compressing right away, e.g.
sudo -u postgres pg_dump dbname | pigz > /backups/dbname__`date +\%Y-\%m-\%d`.sql.gz
  • Since v12, pg_dumpall takes an --exclude-database=dbname
  • you can often use SQL dumps-and-restore during upgrades to a newer postgresql version (but not as often to older ones).
Depending on the case, you may be able to work around details with a few edits (in plain text dumps anyway).

PITR notes

initial online copy of database + archiving all transactions logs


psql and transactions

psql is autocommit by default.

psycopg and transactions

psycopg defaults autocommit to off.

And the first first command in a connection (and after a commit) will imply a START TRANSACTION ((verify).

This means most of the time, you will be in a transaction, and (potentially idle) in transaction, even after a SELECT.

So get used to doing a rollback() or commit() as soon as sensible.

Ideally you also check the success of all transactions, because if there was an error you can only rollback()

If you like to do things in chunks and not quite fail as a whole, look at SAVEPOINT


note you can use with (since psycopg 2.5) to automatically commit (if no exceptions) or rollback (if exceptions) happen within that code block

See also:


Notes on performance and innards

The defaults are conservative so it'll run on ten-year-old hardware comfortably, and you need some tweaking to be more efficient on modern systems (e.g. more RAM), or for specific types of hardware.

Memory related

This article/section is a stub — some half-sorted notes, not necessarily checked, not necessarily correct. Feel free to ignore, or tell me about it.

You may notice that postgres processes have little memory mapped, perhaps 10-20M when idle (and certain work).

This is unlike some different database engines, but quite normal for postgres, because it assumes to be on a relatively dedicated database host, where the system caches will be doing most of the work of keeping the most useful things in memory.

See also things like:

The main interesting variables seem to be the following few.

Memory use

Most memory is taken in the background processes (and their bgworkers), because they process queries.

The rest tends to fall away in order of magnitude, unless you've configured certain things unusually.

As shared_buffers is by design shared between the workers, most of the memory allocation comes from (the values of)

max_connections * (temp_buffers + 3*work_mem)
the 3 is arbitrary, and pointing out that much is allocation per operation that needs it, not per query or session.
Actually, if you write particularly nasty queries you can completely blow out performance
so yes, if you notice only specific queries go crazy on RAM, you could rewrite them

Note that tools like top, ps, htop include SHM in memory use so count it many times, which is not the most informative. You mighy like to take a look at smem, which reports SHM use spread among its participating processes, which is fake but at least that adds up correctly, and arguably a more reasonable representation.


Memory shared between all running workers.

Is a holding-area cache for table data. It seems(verify) all data to and from disk is managed via shared_buffers

Data sticks around for quick reuse of high-demand data; it has a usage count on each page.

Defaults have been 8MB or 32MB.

Should be scaled up somewhat with amount of allowed workers/connections, in that each worker pins some data.

Larger shared_buffers means more is immediately available, yet this only has measurable value for the highest-demand data.

In particular, postgres is designed to rely on the OS page cache, and shared_buffers is effectively duplicating that, so overly large shared_buffer size can actually be counterproductive.

If there are other memory-intensive or IO-intensive apps on the same host (which is not the best design, but you do you), there is value in taking some for postgres this way.


  • There are some queries you can do to inspect how much of the contents are popular versus transient
which you can use as an estimate of how useful your current buffer cache size is

  • if all your access patterns mean this cache won't help much at all (e.g. very random), keep it small
  • Setting it very high is counterproductive, in that it lessens the amount of RAM usable for other things, in particular work_mem (the buffers per connection to do actual querying work in) and the system's page cache (which postgres is designed to take advantage of - which is why a low shared_buffers is generally not so harmful)
  • Even for serious servers where it's allowed to use gigabytes for this, you may see relatively little actual shared use

  • On unices, shared_buffers is implemented using SHM-style shared memory
your OS's SHM settings may be more conservative than what you set in postgres, so you may have to tweak kernel settings to take full advantage
Read e.g. (and use a live sysctl and/or edit /etc/sysctl.conf)
Quick check on linux (details can be OS-specific):
sysctl -a | grep shm
Most important is SHMMAX (the other SHM defaults are decent). For example:
kernel.shmmax = 33554432
kernel.shmall = 2097152
...means the individual shm size limit is 32 megabytes. You probably want to raise that.
The total limit of shared memory is in SHMALL, and it is in pages, so the 2097152 here, times 4096-byte pages, is 8GB.

See also:


work_mem is the amount of data a worker can handle in RAM for certain operations (mostly mostly sorts, some hash-based operations), An operation crossing this starts to do it on disk.

Defaults to something on the order of 1MB or 4MB.

Note that this is per operation that needs it, not per connection/session/query. Some queries may none. Some may have a bunch.

For some unusually complex queries (some of which you want to rewrite), higher values have a measurably positive effect, up to maybe 32MB.

Beyond that it becomes more of a liability, via (this amount * a handful per query * max_connections pushing against shared memory and the page cache, and eventually can be a cause for swapping)


maintenance_work_mem (default 64MB, previously 16MB): like work_mem, but specifically for VACUUM, CREATE INDEX, and ALTER TABLE ADD FOREIGN KEY ( mostly background maintenance and loading database dumps).

At most one of these will run per session(verify), and these operations are fairly rare anyway, so it's fairly safe (but also often not too pointful) to set this a bunch higher than work_mem.

That said, on very minimal containers/VMs you might even want to lower it.


temp_buffers (per session, default 8MB): maximum amount of memory used for temporary tables.

The overhead to setting this high when they are typically unused is negligible,
but like work_mem, serving many connections can imply unreasonable memory use

max_connections - keep in mind that memory use and other things scale up with this.

For example, if (this times work_mem) approaches the order of usable RAM, it'll be fighting with other things at peak use (see also trashing).

effective_cache_size: your estimation as to how much of the OS page cache will store postgres table and index data (so approx how much memory you expect to typically be free beyond postgres and other uses).

This is a hint given to the query planner. It may make cleverer decisions like estimating that loading in a common index may be a good idea, or decide that right now a table scan is less disruptive.

It is counterproductive to overestimate this for your setup (and note that if you don't have a dedicated DB server, that is a tendency over time).

A pragmatic value could be half the size of your OS cache when the system is set up and running everything it will in production (including the database server), preferably when things that use noticeable and/or variable amounts of memory have also run for a while.


  • The values of memory size variables
can be KB/MB-type sizes,
or the older(verify), more confusing way, in segments (that are internally) of 8 kilobyte (BLCKSZ?(verify)), so 1000 is 8MB, 4000 is 32MB, 20000 is 160MB, 50000 is 400MB, etc.


This article/section is a stub — some half-sorted notes, not necessarily checked, not necessarily correct. Feel free to ignore, or tell me about it.

Short summary:

  • Ensure that a basic VACUUM runs regularly - and autovacuum has been a thing for ages now
  • Don't run VACUUM FULL unless you have good reason to
because it locks the table, and is usually not necessary
  • This was a little more complex before 8.4, because of FSM details


Postgres's MVCC allows concurrent access to different versions of each row, meaning that in some types of use, it can avoid row-level locking when multiple transactions what conceptually is the same data (it also makes rollbacks fast, and allows hot backups).

Resolving which row is current is usually simple and fast.

This also means that old rows are not immediately considered to be free for new data, for some mildly complex reasons, which also means that if you do any alterations (other than one initial series of inserts), your table will always become at least a little larger than the data considered current. More so when more transactions interact, and/or for tables that change a lot.

The FSM makes this a little more interesting - more so in the past.

After updates and deletes remove tuples from pages, those unused parts of pages can eventually get reused rather than staying holes. The thing that assists that is the Free Space Map (FSM), which keeps track how much space there is in each (8K) page, free to be used for new data. (...for tables. For indices it only keeps track of fully empty pages)

While various operations interact with the FSM directly (e.g. inserts), others (e.g. update, delete, truncate) have more sluggish effect (and the visibility map is also relevant), and the FSM isn't kept immediately up to date, to avoid some amount of overhead.

A basic VACUUM (among other things) ensures the FSM is up to date with the table data, and because autovacuum is automatic, you typically don't even have to think about it (unless you have unusual loads).

Before 8.4, the FSM sat in memory (for reasons of speed and simplicity) and has a fixed, configurable size. This was nicely resource-conservative, but also meant that under some (heavy, unusual) loads it might not be able to store reference to all free space, so not give out that space until many VACUUMs later.

The table might still grow over time as a result, and/or need wanted to configure a larger FSM to avoid that.

Since 8.4, it's a file on disk, and complete rather than limited in size(verify)

A VACUUM FULL basically rewrites the entire table before removing the old copy.

This was one fix to "halp my table grew for FSM related reasons, maybe around a bulk import or something else unusual".

It also, in general, helps defragment a table, though this is usually not reason enough.

In particular, one reason you want to avoid a VACUUM FULL is that it requires an exclusive table lock for as long as it takes to read and write all data. This is a Bad Thing in a production system.


This article/section is a stub — some half-sorted notes, not necessarily checked, not necessarily correct. Feel free to ignore, or tell me about it.

Auto-vacuuming means a VACUUM will happen automatically every now and then. You can enable/disable this as you wish, tweak how often it happens, and how quickly it selects tables to vacuum.

(Autovacuuming used to be done by a separate process. It was integrated in the daemon in 8.1)


  • having it enabled: autovacuum=on
  • having stats_start_collector and stats_row_level both enabled (true in the config), for the ability to see which tables changed a lot.

When it happens:

  • once every autovacuum_naptime time, postgres will choose one database and vacuum the tables in it that meet certain conditions (to avoid unnecessary work), mostly those that have seen many changes (this can be tweaked)
  • it will also happen every ~2 billion transactions regardless of whether it was otherwise enabled (to avoid potential problems related to internal wraparound. It's actually hard to cause this issue, but it can't hurt to avoid it)

See also:

Cost-Based Vacuum Delay

VACUUM is in itself a lot of read operations, which implies that it easily becomes a major part of disk scheduling, which can slow production operations somewhat.

You can make VACUUM back off whenever there are database operations waiting. This will make maintenance take longer, but will lessen the impact of this maintenance on user response and, unless the database never sees any quiet time, will do most of the work in the times between real work.

This is disabled by default.

To enable it, set vacuum_cost_delay to something larger than zero, namely the amount of time, in milliseconds, that the vacuuming process should sleep after the last real operation (the assumption is that others will come in soon after the last).

Through the nature of the backing OS's timing, the resolution of this sleep may be no better than 10ms (on Windows, mostly). This resolution will then also also implies this is the effective minimum.

Apparently 10 or 20 is also usually enough(verify) to balance with user responsiveness.

The other vacuum_cost_* settings are set to decent defaults for the average case, but you may have specific cases:

  • if you decide to vacuum during off hours, you may want to tweak it to finish fast, while
  • in a 24/7 system you may want to tweak it for minimal impact, even if that means it'll take longer.


This article/section is a stub — some half-sorted notes, not necessarily checked, not necessarily correct. Feel free to ignore, or tell me about it.

(B-tree) indices are spread on disk, so in tables that see a lot of alterations they will easily suffer from an empty-space problem, and may degrade - within weeks if the table is doing very large amounts of writes and/or deletes.

This applied more before 7.4 than it does now. Space is reclaimed, but in a page-based way, meaning that pathological cases can still lead to index bloat.

In such cases, you may want to rebuild the indices every now and then. A REINDEX is possible, but requires an exclusive lock.

A 'hot' reindex can be done by doing a DROP INDEX and CREATE INDEX within a transaction(verify) (which will still require a write lock).

TODO: read

Table fill factor

This article/section is a stub — some half-sorted notes, not necessarily checked, not necessarily correct. Feel free to ignore, or tell me about it.

Fill factor below the default 100 means that INSERTs will leaving space in each page, so that a later UPDATEs can place a new copy in the same page rather than elsewhere.

This helps update-heavy loads when it means it can keep writing to the same page, rather than go to multiple.

If you expect heavy UPDATE load, consider lowering fill factor.

If not, you will just lower the efficiency of your storage. Though not by much.

For write-once, append-only tables, 100 makes sense

Less space is used, less disk reading is done.

Index fill factor

This article/section is a stub — some half-sorted notes, not necessarily checked, not necessarily correct. Feel free to ignore, or tell me about it.

Again, the concept is how full to pack pages.

The details are more interesting because of the way each index type works.

For example, for B-tree indices the default is 90, and lower values make for fewer page splits in heavily updated pages.

Architecture, and disk related details

This article/section is a stub — some half-sorted notes, not necessarily checked, not necessarily correct. Feel free to ignore, or tell me about it.

Process-wise, you have roughly: ((verify) whether this has since changed)

Postmaster process (one)

  • the thing on port 5432
  • mostly does authentication and handing off a client to a backend process

Utility process (one)

  • WAL writer
  • WAL sender/receiver (in pg9)
  • bgwriter
  • stats collection
  • syslogging

Backend processes (one per active connection/client, limited by max_connections)

  • various details and tweaks apply per backend, including...
    • work_mem
    • temp_buffer
    • maintenance_work_mem
    • catalog cache
    • optimizer/executor
  • Backends share a bunch of things (all via SHM?), including...
    • shared buffers
    • wal buffers
    • locks
    • clog buffers
    • other buffers

Disk details: WAL, fsync, logging

This article/section is a stub — some half-sorted notes, not necessarily checked, not necessarily correct. Feel free to ignore, or tell me about it.
For context

The Write-ahead log (WAL) is journaling of transactions on their way to the actual database blocks.

The WAL is used for recovery at transaction granulaity, and can be used for replication. Some details relates to the MVCC nature of postgres.

Once a session commits a transaction, the data is stored in the WAL buffers in RAM, then soon written into a WAL segment on disk (segments are particular files in your pg_xlog directory, each 16MB).

WAL checkpoint interval

Checkpoints are points in the WAL where postgres guarantees up to that point is in the database blocks (heap and index).

How often you want checkpoints to happen is a tradeoff in runtime performance, and how much time will be spent in recovery (recovery checks the xlog for the last checkpoint record, essentially the time/state in the at which everything on disk was known to be consistent, and works the journal from there).

Checkpoints will happen, with default settings, the soonest of the following two things happen:

  • at most five minutes (checkpoint_timeout=300)
  • at most three segments apart (checkpoint_segments=3)
    • where a checkpoint segment is 16MB, so 3 means 48MB of data
    • People have suggested that this can be considerably larger, at least for high-write-volume cases. A setting of 10 (160MB) isn't crazy on a modern setup, and a few factors more can make sense on a few setups.

People may increase the checkpoint interval when doing a lot of small transactions, because it can help performance and/or improve IO load patterns.


  • this is about how fast to put data on disk, and has no effect on what data you will see.
  • many people decide that recovery is rare enough that that slightly better performance is worth extra time taken during recovery
note that recovery time is less significant on SSD, which may affect how you think of this tradeoff
  • Checkpoint writing is always done with some backoff, that avoids dominating IO. This too is tweakable.
  • If checkpoints are happening very quickly (default is faster than every 30s), postgres will complain in its logs.
if the cause is regular use (rathe than a huge import, or many manual CHECKPOINTs) you may wish to tweak to put them further apart, to spare IO a bit. Or consider improving your storage system.

WAL size

On wal_buffers size:

Useful to make large enough to hold data from any single transaction (and possibly a few).
Used to default to 64KB - which is very conservative, because when it's full it will block.
Since postgres 9 it defaults to 1/32nd (~3%) of shared_buffers (up to the WAL segment size, 16MB), which is also a good suggestion for earlier versions[9].
More than a few MB doesn't help speed (because it's written to disk fairly quickly anyway), though can rarely hurt (though you want to check your kernel's SHM limits (no point in stealing limited SHM from other useful things). More than 16MB is often nonsense.

WAL and IO

There is some additional tweaking possible. Consider that if there are a bunch of related transactions that will all commit very soon (because they are short-lived), then waiting for a little time might allows us to write all these transactions with fewer fsyncs. The two settings:

  • commit_siblings (default 5): the amount of transactions that must be currently running before we consider this behaviour
  • commit_delay (default is 0, meaning it is disabled): The amount of time (in microseconds) to wait for another commit/WAL record. This can be useful e.g. when applications consider autocommit sensible.
    • Note that setting this too high will make the WAL writer sit around pointlessly. Unless you expect this to help for your workload, this feature is not that useful(verify). Example setting is 1000 (1000us, which is 1ms).


You can, for any database on platter disks, expect commit rate to be at most one per disk revolution. This may be slightly better with synchronous_commit (or fsync) off.

On synchronous_commit

This setting controls whether to block while WAL flushes data to disk. Default is on.

If off, the data might be written to disk a few hundred ms (3*wal_writer_delay) after success is reported.

It poses no risk of corrupt tables, but a crash will lose whatever WAL data not flushed at that time.


  • When looking for IO relief and not caring about the last few seconds of data, disabling this is a much better idea than disabling fsync.
  • Can also be disabled per transaction, which is a more fine grained way of making non-critical commits back off a bit.
SET synchronous_commit TO OFF
note that some things are always synchronous, such as DROP TABLE and things like two-phase transactions[10]

See also

Database fsync

First, see fsync in general.

For most database engines, recovery after a crash is up to the last verifiably correct part, which is the last bit of data+journaling that was fully written to disk.

Usually that is very recent state, because we fsync often - the most robust way is to follow each transaction commit with an fsync that waits for disk to say "yes, it's on there".

If any part of this plays any looser, that means data committed at SQL level may not in fact not on disk. Do that in a controlled way and, roughly speaking, verifiable recovery may be up to a minute or two rather than seconds ago.

Do it in a less controlled way, e.g. fsync=off which basically hands behaviour over to your OS (and your tweaking of it), and in relatively pathological cases, this may even lead to unrecoverable states. But you tweaking could bring it out, so when in any doubt, leave it on.

...if only because you were probably doing that for speed, and there are safer ways to get most of that. Aside from the obvious "Use an SSD".

Most OSes implement fsync variants that mean "accept and collect data for a very short amount of time, to use disk rotations more efficiently" meaning they'll get to it very soon instead of right now.

This gets you comparable speed increases with less risk and recovery up to more recent state.
And may already be the default for you.

If you are willing to make the tradeoff to "losing a few minutes ago is fine", set synchronous_commit=off -- see above. Roughly speaking this also delays the write, but in a much more controlled way that should always be recoverable.

Separating different IO streams

You may have:

  • system drive (syslogs and such)
  • system swap partition/file
  • transaction log (xlog)
  • postgres WAL
  • postgres data
  • postgres indexes (verify)
  • postgres temporary data (verify)

...all on the same disk.

If feasible, you can consider putting at least the major ones on different physical drives (JBOD style). It'll help latency in general, and when heavy writes happen it can avoid doing some things effectively twice on the same disk (WAL+tables).

How much depends entirely on the kind of load.

If your load is almost entirely reading, it matters less
there's no waiting on writes
for small enough databases, reads are served by RAM anyway
If read-heavy and limited by read throughput, then more platters in your RAID is more valuable
in general it can matter how easily things are paged in or out (if you have other heavy RAM/disk users)
how often joins are very large
how often the system log is fsynced (often is better for debug, not ideal for IO)
...and so on.

This is mostly about when things start being bound by disk, and write-heavy - in which case you also want to look SSDs.

Most postgres parts can be moved by symlinking the directory it's in to a location on another drive. Make sure the permissions are correct, of course.

for table data you can do it a little more fine-grained by using CREATE TABLESPACE (creates a new storage location) and handing its name to CREATE TABLE and ALTER TABLE (verify)

Note that WAL is shared by one 'cluster', but if you have several, e.g. for different users, you may wish to put all of these on SSD.

WAL Replication

copying the xlogs

pg9 added streaming replication, which is basically a networked communication of WAL records.


  • minimal - enough for crash recovery
  • archive - helps archiving
  • hot_standby - lets you create read-only slaves

Note that while the logs record all data changes and they can be used for replication, only so much past data is kept, meaning these logs are not useful as backups.

other disk-related details

planner costs

This relates to postgres considering your system when planning reads, writes, use of indices, and other aspects of query execution. Most of these settings take a little reading to learn to use.

One example is effective_cache_size, your estimate of much postgres data sits in the OS cache. When larger, postgres will assume that it is more likely that an index it should read from disk is already in OS cache, and figure using an index is more likely to pay off than a sequential scan. The setting should not be overestimated, and you should be aware of the difference a server beding dedicated or not makes.

While there are reasons to change other values (e.g. random_page_cost), the most interesting value here is probably

TODO: read:


Generally you will either want it to go to syslog or, if that does not allow enough level filtering control, postgres's own logging.

log_destination (defaults to stderr) - one of

stderr - postgres default (distro packages may alter this)
csvlog - parseable comma-separated format (e.g. to then store in a database), applies only when using logging_collector
eventlog (windows only)

logging_collector (defaults to on) - captures each postgres process's stdout and sends it to log file, stored in...

log_directory (defaults to 'pg_log')

You probably also care about

  • log_min_messages (default is WARNING) - log level to log at all
you can configure more details (see the docs)
  • log_min_error_statement (default is ERROR) - log level to log with more contextual information (e.g. current query), meant for debugging
  • client_min_messages - (default is INFO) - log level filter towards connected clients
  • log_min_duration_statement (defaults to -1, which is off) - slow query logging
a positive values in milliseconds.
note you can do this per connection with e.g. SET log_min_duration_statement TO 10000
can make sense for maintenance / bg stuff you know takes longer
some permission blah?(verify)
There are some tools you may like, see e.g.

See also:

Full text search notes

This article/section is a stub — some half-sorted notes, not necessarily checked, not necessarily correct. Feel free to ignore, or tell me about it.

Since 8.3, postgres has fulltext search.

The mechanics are a fairly basic inverted index, and because it's indexable, it's quite performant for exact search (and fuzziness via normalization), and still decent at fuzzy searches.

My personal take is that if your use case is served by a single database (you don't don't need sharding), avoiding a whole separate service just to add search can be less development and potentially more efficient.

Basic guts

tsvector type

  • contains document in parsed form
  • basically a list of (lexeme,position,weight) (verify)
  • you often want to use to_tsvector to create a (normalized) one from text

tsquery type

  • contains query in parsed form
  • may combine multiple terms using AND, OR, and NOT operators
  • you often want to use either to_tsquery() or plainto_tsquery()

@@ operator (from the docs):

@@ is the match operator, which returns 'whether a tsvector (document) matches a tsquery (query)
e.g. the following returns true and false respectively
select 'quick brown fox'::tsvector @@ 'quick'::tsquery,
       'quick brown fox'::tsvector @@ 'bear'::tsquery;

(these are not the parsing/conversions you would use in practice, they're just shorter to write)

ts_rank() / ts_rank_cd() will rank results in a set,

using weighting that involves both the data and the query - see details below
note this is not TF/IDF, though there are extensions that add that , and you could store IDF separately yourself.

Practical details - indices, columns

This article/section is a stub — some half-sorted notes, not necessarily checked, not necessarily correct. Feel free to ignore, or tell me about it.

To see how matching might work, (without any table alterations or index creation), find find an existing text column, and do something like:

SELECT textid  FROM mydocs     WHERE to_tsvector(body) @@ to_tsquery('fork');

That works, but is slow.

In practice you only ever want to run queries against a tsvector index.

So, create an index, choosing either GiST or GIN. As in general with those two:

  • GIN searches faster, but updates slower, so are probably preferred on relatively static tables.
Indexes are larger than GiST's (so mean more memory/IO)
  • GiST updates faster, but searches may be slower
Handier for often-updated data (particularly if it has a smallish amount of lexemes)

...but that skips a step.

Your job has become "feed tsvectors into the indexing" but where do those come from?

tsvector column or not?

You have a choice between:

  • given a text column, add an additional tsvectors column, and add an index on that column
easier to massage your text outside of postgres
easier exeriment/debug, in particular inspecting what the parsing does (making it useful in your first experiments with this)
because you never actually search that tsvector column, you search its index, this takes more space than required
ALTER TABLE mydocs  ADD COLUMN searchvector tsvector;
CREATE INDEX mydocs_vector_column_ft_index ON mydocs  USING gin(searchvector);
  • given a text column, not having a tsvector column and isntead telling the database to update the index directly
takes less space, which can be good in production
index updates are handled for you
a little more black-box, in that you don't get to see it or do much with it, not great for debug in development
CREATE INDEX mydocs_body_ft_index          ON mydocs  USING gin(to_tsvector('english', body));

If you've gone for the former (tsvector-in-a-column) approach, then you have to set the tsvector to get the index to update.

  • You could update the tsvector column yourself in every INSERT and UPDATE, something like
UPDATE mydocs set body='foo bar', searchvector=to_tsvector('simple','foo bar') where id=%s
  • ...or have it done for you via a trigger whenever the text field changes, like:
CREATE TRIGGER searchvectorupdate 
 EXECUTE PROCEDURE tsvector_update_trigger(searchvector, 'pg_catalog.english', body);
That built-in tsvector_update_trigger() function mostly just does a to_tsvector().

For simple cases there isn't much difference, but you might want to evaluate what practice means for you when you do things like composing a tsvector from multiple columns, or adding weights.

The latter choice then means you probably want a custom function (plpgsql probably covers your needs) to do something like (see docs for full example)

 setweight(to_tsvector('pg_catalog.english', coalesce(new.title,)), 'A') ||
 setweight(to_tsvector('pg_catalog.english', coalesce(new.body,)), 'D');

regconfigs, dictionary, thesaurus

This article/section is a stub — some half-sorted notes, not necessarily checked, not necessarily correct. Feel free to ignore, or tell me about it.

Quoth the documentation [11][12]

"A text search configuration specifies all options necessary to transform a document into a tsvector: 
the parser to use to break text into tokens, and the dictionaries to use to transform each token into a lexeme.
Every call of to_tsvector or to_tsquery needs a text search configuration to perform its processing."

In other words, it's processing that includes

a parser to tokens,
a dictionary to apply a series of rules (called templates) to these tokens, mostly used to
remove stopwords
map synonyms to the same tokens and/or add tokens (via a thesaurus)
normalize tokens into a canonical form

That normalization is often intended to reduce things to a canonical form, e.g. based on

  • ispell - lowercases, checks stopwords, stems
  • snowball - lowercases, checks stopwords, stems
  • simple lowercases (no stopword removal)
select to_tsquery('english', 'The & Fat & Rats'),
       to_tsquery('simple',  'The & Fat & Rats');

Is respectively:

"'fat' & 'rat'"
"'the' & 'fat' & 'rats'"

You can also alter and create new dictionaries to vary this behaviour, e.g. have english stemming, no stopword removal, add your own synonyms, etc.


  • You want to use the same processing on the ts_vector and ts_query, or you will get few and/or weird matches.
In general, don't mix configs unless you know what you're doing.
if you do mix them (e.g. handle documents according to their own language), consider storing the config you used alongside in a column
  • Each query will use the default search configuration (default_text_search_config), unless told otherwise
so either set that consistently, hardcode your regconfig, or otherwise
  • The default configs and dictionaries are mostly for stemming for a bunch of languages, and simple for not doing that.
I have (as \dF told me in pSQL (\dFd
arabic,   danish,   dutch,    english,  finnish,  french,   german,   hungarian, 
indonesian, irish,    italian,    lithuanian,   nepali,  norwegian, portuguese,   
romanian,  russian,    spanish,    swedish,  tamil     turkish   
  • maybe poke around your tsearch_data directory for a few minutes

See also:

Search syntax

Search syntax

When you use to_tsquery(), special characters are (, ), |, &, ! :, *, whitespace, and null.

  • &, |, and ! are your typical boolean operators. Usable with (, ) parentheses.
  • :* indicates prefix matching, like 'string':*
  • Single quoted strings are necessary when adding phrases as such, and can trigger the dictionary.
Terms themselves are single words, or single-quoted words/phrases (which can trigger the dictionary?)

Some examples (without SQL's quoting/escaping for readability):

supernova & !crab
'supernova' & 'sun' 
cat:* &!'catsuit'

plainto_tsquery() does not parse these operators - it plain text and constructs a basic AND search, e.g.

plainto_tsquery('search term')


'search & term'::tsquery

to_tsquery() will parse these operators.

Note that if the syntax is invalid, you'll get a ERROR: syntax error in tsquery.
Because of that, and because the syntax is somewhat awkward for people, you probably want to generate queries from code, and possibly via a simpler query syntax.

Proximity matching, weighed terms, and ranking

This article/section is a stub — some half-sorted notes, not necessarily checked, not necessarily correct. Feel free to ignore, or tell me about it.

The @@ operator just gives you a match-or-not, as a boolean.

You can additionally give scores to the things that filters.

what it's based on

tsvector always stores the position of each token, e.g. 'quick:1 brown:2 fox:3'::tsvector,

This can be used both for matching based on proximity, and on scoring based on proximity matches.

Optionally, tsvector elements can have one of four categories, which you can use as a coarse weighing-in scheme, to be able to construct things that e.g. amount to "match in any of title, keywords, body, but score is strongly based on appearance keywords, middling effect from title, and little effect fro matches in the body"

This can be used for scoring, and for matching.

So your options for scoring are roughly:

  • frequency of matching lexemes only
  • option: adding proximity
  • option: have some parts in a document respond more strongly (via categories used as weights)
you can't have parts of a query weigh differently, though

Proximity scoring

For example

select ts_rank_cd('quick:1 brown:2 fox:3'::tsvector, to_tsquery('quick & fox')),
       ts_rank_cd('quick:1 brown:2 fox:3'::tsvector, to_tsquery('brown & fox')),
       ts_rank_cd('quick:1 brown:2 fox:3'::tsvector, to_tsquery('quick & brown & fox'));

...returns 0.05, 1.0, 1.0, based purely on the distance.

Read the docs if you want more control.

Proximity matching

Note that if you want to filter strictly-adjacent phrase matching, older versions would force you to do your own inspection on @@'d results, e.g.

SELECT * FROM articles WHERE fulltext @@ 'word1 & word2'::tsquery AND fulltext LIKE '%word1 word2%'

...which is also not the fastest for large result sets.

Since 9.6 there is <-> style syntax, usable within to_tsquery

<-> is strict proximity, <2> for within 2 words, <3> for within 3, etc.
note that this is ordered, so is not symmetric

tsvector categories

Each token in a tsvector can have one of four categories, A, B, C, or D. The default is D, and D weighing is omitted when printing these.

These often come from pasting together a tsvector from tsvector parts, e.g.

UPDATE tablename SET searchvector = setweight(to_tsvector(coalesce(title,)),    'A') ||
                                    setweight(to_tsvector(coalesce(keyword,)),  'B') ||
                                    setweight(to_tsvector(coalesce(abstract,)), 'C') ||
                                    setweight(to_tsvector(coalesce(body,)),     'D');

You could also specify it yourself, e.g.

'fat:2,4C cat:3B rat:5A'::tsvector

represents "fat cat fat rat" with some manually added categories.

tsvector categories as weights

Weighing means different terms contribute differently (the default weights are A=1.0, B=0.4, C=0.2, and D=0.1).

For example, the following:

select ts_rank( 'fat:2,4C cat:3B rat:5A'::tsvector,  to_tsquery('fat'))
select ts_rank( 'fat:2,4C cat:3B rat:5A'::tsvector,  to_tsquery('fat & cat')) 
select ts_rank( 'fat:2,4C cat:3B rat:5A'::tsvector,  to_tsquery('fat & rat'))

work out as approximately 0.2, 0.4, and 0.6.


  • The {{inlinecode|||{{ above acts as if the text was concatenated (position indexes of concatenated tsvectors are higher in the result)
coalesce ("returns first non-NULL argument") is used here to avoid cases where NULL elements make the whole thing NULL

  • normalization allowss yout to consider the document length, in particular to have long documents score higher mostly because it has more words.
Default is not to do that, values specify a few different strategies.

tsvector categories matching

When you use categories in tsqueries, e.g. rat:AB, it will match tokens only if they are in one of the mentioned categories.

For example, 'cat:2A,2B'::tsvector @@ 'cat:C'::tsquery is false.

This seems intended as a "you can sort of have parts in a tsvector even after it's concatenated".

See also:

Some practicalities

Inspecting tsvectors for very common things (that you may want to remove)

SELECT * FROM ts_stat('SELECT searchvector FROM mydocs')
ORDER BY nentry DESC,  ndoc DESC, word        LIMIT 250;

See also


template databases

This article/section is a stub — some half-sorted notes, not necessarily checked, not necessarily correct. Feel free to ignore, or tell me about it.
  • template1 is what every CREATE DATABASE copies, effectively a skeleton database
This lets you add site defaults, or site-default modifications
  • template0 is an alternative you can forcibly use
its main point is verbatim restores, which are portable only without first adding site defaults

See also:


This article/section is a stub — some half-sorted notes, not necessarily checked, not necessarily correct. Feel free to ignore, or tell me about it.

host-based authentication (hence the filename), is mostly about figuring what kind of authentication applies, depending on how you're connecting (and sometimes to not allow a connection at all - this part has some functional overlap with network-level firewalling).

This is a layer that applies partly before we even talk authentication or privileges.

...except that trust/peer/ident can effectively bypass auth in specific conditions.

This config file is possibly easiest to find via sudo locate pg_hba.conf (locate since database installs aren't always in the same place, sudo because your typical login user may not be allowed to read it).

Note that after editing, you can reload without a restart using SIGHUP (often killall -HUP postmaster) [13]

More verbosely

Line syntax is roughly:

CONNTYPE   DATABASE  USER  [ADDRESS] METHOD  [OPTIONS]                                                                                             



  • local for unix domain socket
  • hostnossl for TCP/IP without SSL
  • hostssl for TCP/IP with SSL
  • host for TCP/IP (with or without SSL)

DATABASE can be:

  • database name (multiple need to be comma-separated)
  • all
  • samerole for database names matching role name
  • replication (note: all does not include this, so you often want separate all and replication lines)

USER can be

  • all
  • a username
  • a group name (prefixed with +)
  • a comma-separated list of users/groups


  • samehost for any currently configured IP on the local host
  • samenet for anyone on the networks currently configured on the local host
  • a IPv4/IPv6 address
  • a IPv4/IPv6 address with CIDR-style mask (/numbits)
  • a hostname
  • a domain name (start wit a dot, it will match everything under there)

METHOD can be:

  • trust unconditionally
  • reject unconditionally
  • peer - for unix domain sockets, use host auth via the kernel, with optional username mapping (see more below)
  • ident - like peer, but over TCP by querying using the ident protocol (think RFC 1413).
...which, note, is not an auth protocol
  • md5 password
  • password (plain text, please avoid)
  • cert - SSL client certificates
  • External auth:
    • pam
    • ldap
    • radius
    • gss GSSAPI
    • sspi SSPI
    • krb5

OPTIONS are further NAME=VALUE style options to the auth module

See also:

On peer auth (and ident)

This article/section is a stub — some half-sorted notes, not necessarily checked, not necessarily correct. Feel free to ignore, or tell me about it.

Peer auth applies only when connecting via local domain sockets.

It asks the (*nix) kernel for the connecting user's username (via getpeereid).

When said username matches database username(verify), it allows login (possibly passwordless).

((note: for libpq, domain sockets apply when you fill in nothing for the host, rather than 'localhost'. Other things may have their own connstring logic)

It's e.g. quite convenient for local db admin tools.

Securitywise it has its upsides and downsides. Consider e.g.

Which reduces attack surface when that the database doesn't need to listen to IP,

as in that case there is no password to steal.

At the same time, this shifts auth to the ability to become a particular user on that particular host (have its password, or appropriate sudo rights).

Also some practical sides:

If your app is fine running on the same host, you could give it only peer access.
On the other hand, it means your auth config is now in several places (harder to oversee), and this is not portable to e.g. windows.

Default pg_hba often has peer for all databases, though usually, after a plain install the only user that this applies to is postgres (verify)

local   all             all                                     peer
host    all             all               md5
host    all             all             ::1/128                 md5

Ident auth is roughly the same idea, but asks via identd instead, meaning it will work for TCP connections (...given you've set up identd).

You want this only when you understand the security implications, in particular that you are moving parts of your auth off your host (while not being auth as such). In some tightly controlled mutually-trusting hosts this may be convenient, in anything that doesn't live in such a bubble it's an unnecessary attack vector. Avoid use unless you understood that and are fine with it.

FATAL: Peer authentication failed for user "postgres"

Since peer only applies to domain sockets, this comes down to that your effective username doesn't match the database username.

When you wanted peer auth, then you're probably running as the wrong user.

When you actually wanted to log in with a password the issue my well be that your database client library / connector assuming you want a domain socket when you didn't -- which implies that the wrong pg_hba line applies.

The fix is to specify an IP address from your client, so that it won't choose to use a domain socket. (note that when you don't specify a hostname to libpq, it assumes you want domain socket)

FATAL: Ident authentication failed for user "postgres"

This often means your OS has installed a local(-only?) identd setup.

It seems some distros do this, specifically for postgres, to imitate peer (possibly figuring it's more portable than the syscall that peer uses). For example Redhat's initdb defaults to ident auth [14][15]

In that case you can learn how RH/other wants you to use this, but I prefer to use md5 (or maybe trust or peer) as I figure appropriate.

(i.e. not using ident at all, it annoys me)

pgadmin notes and errors

Note that you can imply domain socket by leaving address field blank. (is this actually a libpq thing?(verify))

On prewarming caches with indices and/or data

Seeing what your server is up to

If you log in via shells, it can be convenient to use


I've used the following to expose details via a web page:

SELECT count(*), query  FROM pg_stat_activity  GROUP BY query;

Also working on a CLI utility that display more from that table more usefully.

"idle" and "idle in transaction"

Idle are those connected, not currently talking to the database, and not in the middle of a transaction.

They won't block other queries from progressing, but they do count towards the connection limit.

If the client just forgot about the connection, then these open-but-unused connections may add up over time, and lead to blocking new connections even though nothing is being done. So it's good practice to close connections as soon as you can.

Note that using a connection pooler, you would expect a fairly-constant handful of connections to always be there and idle, ready to be given out.

Idle in transaction means a connection has started a transaction, has not committed it or rolled it back, and (if you see these, most probably) hasn't talked to the database for a while.

Idle in transaction can lead to blocking other queries, depending on the read-or-write nature of the operations involved, and implicit and explicit locks in that transaction.

Usually, idle in transaction means the app does a few seconds of work between parts of its transaction.

Within a single maintenance script this is often nothing to worry about.

As this is lazy (arguably buggy) code that might block other things for a little while (so become a performance bottleneck), it's something you want to iron out of your regular code.


  • try to figure out the entire transaction before starting it
  • (if that's too annoying,) commit changes as soon as sensible
  • if you reuse a connection a few times, rollback after selects
  • Handle errors - at least the cases of "ignore state of a now dangling connection and hope the problem goes away via a timeout or something"
the brute force fix is "rollback or close immediately after errors"
  • keep implicit transactions in mind
  • Use auto-commit where sensible (it avoids transactions)
You could try it as a default, and use transactions only where you need them.

...whichever apply. If you can't, timeouts may help a bit.

On implicit transactions

Postgres itself: [16]

"PostgreSQL actually treats every SQL statement as being executed within a transaction. 
If you do not issue a BEGIN command, then each individual statement has an implicit BEGIN 
and (if successful) COMMIT wrapped around it."

(which you could call autocommit)

Client may easily add some behaviour to this. While psql follows the above, e.g. [17]:

By default, Psycopg opens a transaction before executing the first command: 
if commit() is not called, the effect of any data manipulation will be lost.
The connection can be also set in “autocommit” mode: no transaction is automatically open,
commands have immediate effect. See Transactions control for details.

Some libraries may add some more behaviour around that

Finding the offending code

(Chances are you'll see something like an error case that doesn't lead to a rollback)

The most recent query on the idle-in-transaction connection is usually a strong hint. I sometimes add comments to queries to help.

pg_top shows the state but not the most recent query.

You could still use it, when you add something to logging you can grep for, probably the PID. Look at log_line_prefix , e.g.:

log_line_prefix [%p] %u@%d  %r
# ...particularly if you want to build some of your own reporting around this

It's probably simpler to run something like the following in psql

select usename, datname, query from pg_stat_activity where state like 'idle in%';
\watch 1

See also:

Estimating disk size of things


databases size

  • pg_database_size

table and index size

This article/section is a stub — some half-sorted notes, not necessarily checked, not necessarily correct. Feel free to ignore, or tell me about it.

Keep in mind that

  • pg_relation_size(relname)
    • counts direct table size
    • does not count TOAST
    • does not count indices
  • pg_table_size(regclass)
    • counts direct table size
    • counts TOAST
    • does not count indices
  • pg_total_relation_size(regclass)
    • counts direct table size
    • counts TOAST
    • counts indices
    • ('s equivalent to pg_table_size() + pg_indexes_size()) (verify)

See also:

row sizes

You can ask for the stored size of a value, and thereby estimate the size of parts of a row.

  • individual row size can be fetched with something like:
select pg_column_size(tablename.*) from tablename
  • average row size can be done with adding an avg() to the above,
...or by dividing table size by the row count, which will also count tuple headers, padding, etc. (verify)
remember how NULLs (don't) count.

  • keep in mind you're not counting indices. To consider those too, you may be more interested in something like:
select pg_total_relation_size('relname')::real / count(*) from tablename

column sizes

  • pg_column_size(value) (since 8.4) says how much space it takes to store a value (and you can avg() or sum() them for a table)
When TOAST applies...
a columns may take four bytes more (the reference into the toast storage) (verify)
when compresses happened, it's the compressed size.
  • octet_length(value) gives the uncompressed size -- of text/byte fields only (verify)
  • keep in mind you're not counting indices.
when e.g. calculating the percentage of storage used by a column, you likely want to include index size

It's trickier to compare all a table's columns individually, within a single SQL statement, because pg_column_size takes values, not column names. You can write a server-side function (makes sense if you want regular monitoring), but it may be easier cheat and do it from the client side, e.g.

def colsizesums(tablename, curs):
   ''' For the given table name, in the currently selected database, returns each column's:
        - name
        - type
        - total size       (bytes,  e.g. 995782220)
        - total size       (string, e.g. 950 MB)
        - average row size (bytes, rounded a bit, e.g.  588.7)
        Note that this does a tablescan for each column, so isn't fast. 
          To do all at once you need a little more cheating.
        Note also that it's not injection-safe (second select), 
 don't expose this without hardcoding table names.
    import re
    tablename = re.sub('[^A-Za-z0-9_]','',tablename) # basic sanitization
    curs.execute("""SELECT column_name,data_type  FROM information_schema.columns
                    WHERE table_name = %s  ORDER BY ordinal_position""", (tablename,))
    for colname,coltype in curs.fetchall():
        # yes, that's interpolated. Should be safe _enough_,
        #  assuming table and column names are not insane.
        curs.execute("""SELECT sum(pg_column_size(%s)),
                        FROM %s"""%(colname,colname,colname, tablename))
        ret.append( (colname, coltype, int(row[0]), row[1], round(row[2],1)) )
    return ret

Which gives output like:

[('id',           'integer',                       531360, '519 kB',   4.0),
 ('path',         'text',                        18399078,  '18 MB', 138.5),
 ('ext',          'character varying',             537550, '525 kB',   4.0),
 ('datetime',     'timestamp without time zone',   949232, '927 kB',   8.0),
 ('stillpresent', 'boolean',                       132840, '130 kB',   1.0),
 ('data',         'text',                        33423962,  '32 MB', 251.6)]


  • NULL values barely add storage
on top of constant row overhead, anyway - NULL is not stored as values, instead there is a per-tuple bitmap of which columns are present in this tuple
  • you can hand in literals.
rows can be handed into pg_column_size (and not octet_length, directly). Asking for rows, or tuples, will also count the 24-byte header, so:
select pg_column_size( 1::real ) + pg_column_size( 1::real ) = 8
select pg_column_size( (1::real,1::real) ) = 32
  • you can estimate TOAST compression in a (text/byte) column like:
SELECT avg(pg_column_size(colname)::real / octet_length(colname)::real) from tablename;
note that this may be very little when TOAST decides it fits in the table itelf.
Counting only actually-compressed things is a little harder

Fast count from a table

If you need an accurate count:

In general, if you do:

select count(*) from mytable

you'll find this is slow for very large tables, more or less regardless of anything.

This because to give a fully accurate answer, it has to scan a lot.

The reason lies in MVCC. You could say this counting operation is considered rare, and less important than MVCC's tendency to relieve locking for more typical queries.

Since 9.2 you can occasionally satisfy a subset of counting queries with an index, which is typically faster than a table scan. It comes with some restrictions, though, so evaluate whether you really need this.

If you need a fast and accurate count

Keep a counter, keep it updated. Using triggers on table alterations may be the easiest way. It's extra work on most alterations, though, so only do this when it's important.

If a recent-ish estimate is good enough


SELECT reltuples::bigint AS estimate FROM pg_class where oid='myschema.mytable'::regclass;

The accuracy of this result varies with how often you ANALYSE (with autovacuum in place you're typically good).

Alternatively, have a background job do a count every minute or two, store it somewhere, and have everything else fetch that count. Note that when this still applies sequential scans this still stops scaling when the data doesn't fit RAM.

See also:


Read the docs for details, but a tablespace roughly means putting some of your databases in a different directory on the filesystem.

The metadata is still in the main database (cluster) it belongs to, so you don't want to separate the two if you care about the health of either part.

It seems mainly meant to let you store some of a cluster's database onto distinct storage without symlinking things yourself (or requiring that you set up a distinct cluster).


This article/section is a stub — some half-sorted notes, not necessarily checked, not necessarily correct. Feel free to ignore, or tell me about it.

A database is created with a specific character encoding - initdb picks these up from template1, which will typically mean UTF-8 a.k.a. UNICODE.

(An old default may also be SQL_ASCII, which is not great if if you ever store anything but ASCII. Similarly, postgres supports codepages, but you usually don't want to use them (unless you know you really do).)

The encoding is applied to its text types in text operations(verify).

The database encoding is essentially a default, in that in many operations you can force others.

The client connection also has an encoding - which the database knows about, to be able to convert between database and client encoding (when necessary and possible). as a user you only need to worry about the connection coding.

You can set the client coding with SET CLIENT_ENCODING TO 'value';, or the more standard (though slightly wider) SET NAMES 'value';

Exactly what your client-side library can do, and defaults to do, will depend on the programming language.

Both because of the text types in your programming language of choice, and because libraries may be configured to do (or not do) extra conversion. What comes in is still a byte coding, and if your language has a unicode string type, it can be very convenient to have it converted to that automatically.

Personally, I dislike implicit convenience conversions. I like to use UTF8 as the database and client encoding, and explicitly convert on the client side when necessary. Yes, that makes for longer code, but it's also clearer (also because not all text needs to be decoded. A case I run into frequently is the *nix filesystem API, which is essentially just arbitrary bytes (other than NUL and slash) - it just happens that most people use utf8 by convention, not least because their LANG is usually that)

Client details are up to the client, and API.

It seems that e.g. psycopg2

client encoding defaults to database encoding
unicode is encoded according to client encoding
python2 str is not encoded, so has to be correct according to the server coding(verify)
(so both may work as you wish)
incoming: you get a str
in python2: str means bytestrings, so hands things back in client encoding(verify)
arguably the most controlled way is to both set client encoding, and hardcode the decode
arguably the more convenient way is to register a typecaster to get it in unicode always. On a cursor or right after import
in python3: decodes to unicode (using the client encoding)

See also: