Postgresql notes

From Helpful
Jump to: navigation, search

For other database related things, see:


"I installed the package, now what?"

The following steps should get you to the point where an client/app can get to the database it needs.

(If your packager didn't:) create postgres database

Postgres is made to allow multiple entirely-separate instances (distinct configs, distinct storage directories), and some distros are set up to take advantage of this.

If so, then their package managers may choose to not automatically create a blank setup, assuming you wanted control over that, and/or may be upgrading.

As such, do one of

  • create a config
when starting from scratch
initdb at the core, and some have pg_createcluster (which wraps initdb) to make this easier, see below
  • supply a config
mostly when upgrading an existing one

e.g. on RHEL, it's basically "run initdb, enable the service, start the service". [1]

Some distros: clusters


While database clusters are a postgresql concept ("a collection of databases served by a postmaster instance"; SQL standard calls it catalog cluster) various distros create just the one - they seem to assume that if you need more than one, you know your stuff anyway.

e.g. debian/ubuntu modeled clusters into their service management and package management, so you can install distinct clusters (of disctinct versions) side by side.

It theoretically also eases updates between them[2]).


You want database clusters e.g. when managing databases to distinct clients.

For simple test setups, just one cluster for all your databases is fine.

Assuming it created a single database cluster (package management usually does), you're already done.

On debian/derivatives

If a cluster does not yet exist (you'll see a lot of messages like "No PostgreSQL clusters exist"), and


will show none.

A cluster has a name and a version. For example, after installing the postgresql-9.3 package you might run

pg_createcluster 9.3 main --start

which creates a set of config files inside


And you can start like

pg_ctlcluster 9.3 main start

(...though that part is usually done via services)

See also:

Check that you can log in

In most cases, after basic installation you have

a host user called postgres with admin rights, and a postgres role within the database
a database called postgres
a pg_hba that trusts localhost

That means the easiest check is to become the postgres user, and run psql (a command line tool).

Assuming you have sudo rights, that amounts to:

sudo -u postgres psql postgres
If it gives you an SQL prompt (
), then there is a running postgres instance, and you're logged into it as the local admin.

If it gives an error, search for the error.

Optional: Set password

That is, the
role within postgresql is basically its superuser.

You'll likely use it for major admin (e.g. creating databases),

It's probably set to trust local admins (see #On_peer_auth_.28and_ident.29). This is fine on your own box / instance, while on shared-user systems there are cases where it's not as secure, so if the data is sensitive then consider setting a password.

(Also, it's sometimes useful to allow remote login, e.g. for tools like pgadmin3, and you'ld probably allow that via password auth -- though note you can SSH-tunnel to avoid needing this)

To set its password, connect to the database, e.g.

sudo -u postgres  psql postgres

Once you're at its prompt:

\password postgres

Optional: creating roles, setting access privileges, and client auth settings

I typically do this later (and frankly am too lazy to do it on my home server), but it's a thing to consider.

By default you are:

  • doing everything as the database superuser (postgres role)
  • only allowed to so so on localhost (because peer auth, see below)

This is fine for initial mucking about, but in production you may want e.g.:

  • admin can do everything (but no regular users can authenticate as them)
  • backup can only read everything
  • each app can read/write, but only its specific database

Access privileges are combinations of

  • actions (SELECT, INSERT, UPDATE, DELETE, ec.)
  • on specific things (DATABASE, TABLESPACE, TABLE, etc.)
  • to roles (users or groups)

Roles are postgres's internal user management (separate from the OS's users - though note that you can tie things to host accounts in pg_hba).

It is preferable to read up on common practice - and also to keep this as simple as makes sense to you, in that the easier you can remember/explain it, the more that random implications won't sneak up on you.

Optional and sometimes useful: client authentication (pg_hba)

pg_hba is part of config, and is rough-grained control of which sorts of connections get a chance to log in at all.

Often filtering out by type of connection, type of authentication, and such.

People often seem to use it as a firewall-like whitelist. The defaults are usually decent for that.

Note that this is is about connections, and unrelated to role privileges.

See #pg_hba

Optional: performance tweaks

There are some conservative defaults that you can increase, e.g. a few relating to memory use.

And on platter disks it helps when not everything (tablespace, WAL, logs) goes to the same platter(s).

Note that if you're unsure where the config file is, or if there may be several and you don't know which one,
psql -U postgres -c 'SHOW config_file'
will tell you which file the running database is using

See also

Use notes


The .psqlrc file is psql's config. See the man page, but example:

# see how fast things are
\timing on
# Setting PROMPT1 to show the server you're on
\set PROMPT1 '%n@%m:%/ \#'
# colors are handled but don't make things more readable
\set PROMPT1 '%[%033[90m%]%n%[%033[0m%]@%m:%[%033[92m%]%/%[%033[0m%]\# '
# See also
# one of:
#\pset pager off
\pset pager on       # won't trigger when output has fewer than ~25 lines
#\pset pager always
# many-column tables can be annoying to read 
#   (also depending on the PAGER itself, see below)
# expanded formatting shows records at a time, vertically.
# to get this only when it's wider than the screen:
\x auto
# See nulls as distinct from empty string. 
\pset null '(NULL)'
# You can also make it distinct from literal whatever-this-string-is 
# by using some unicode character here, like ¤ or ∅,
# as psql will show unicode ''data'' escaped (configurable maybe?)
\pset null '¤'

Your PAGER is a shell thing, that psql also chooses to use. You may want to, in general, tweak it as you like (e.g. if using less, I want -S, to crop rather than wrap lines, as that makes scrolling to the right a lot less confusing), e.g.

export PAGER="less -SaXz-5j3R"
# or
export PAGER="/usr/local/bin/gvim -f -R -"
# or look at things like

You can also consider putting some canned queries in there, see e.g.

Note that if you adminster the database via another user (e.g. postgres),

sudo -u postgres psql
will not pick up the .psql file in that account.

The simple solution is to switch to that user (or ssh as that user) for a normal login.

Another possible fix is something like

alias psql="sudo -u postgres sh -c 'cd ~postgres; . ~postgres/.bashrc; psql'"


Upserts refers to an INSERT where, if a matching(=conflicting) row exists already, falls back to an UPDATE. (see also Relational database and SQL notes#Upsert)

This is not a core SQL feature, but useful.

Postgres will have a "INSERT ... ON CONFLICT UPDATE" in 9.5

Upserts are a weirder thing than you think / it should be, in part because the condition is actually a "would an insert cause a duplicate in a UNIQUE index", and that can race with another such case.

The fallback logic must necessarily happen in a mini-transaction(verify). This makes it interesting to implement efficiently, MVCC makes it even more interesting, as does doing it correctly according to the set isolation mode, and being well defined in terms of the possible errors.

See also:


Mass inserts

temporary tables

This article/section is a stub — probably a pile of half-sorted notes, is not well-checked so may have incorrect bits. (Feel free to ignore, fix, or tell me)

Occasionally you say "hey, I'm creating some costly-to-calculate intemediate data. It could be a subquery, but I'd like to use it more than once, so can I keep this somewhere - and have it be its own thing that auotmatically goes away when the connection ends?"

Basically, create tables using [3]:

There are two things that TEMPORARY guarantees:

  • That they will be dropped at the end of a session.
Or, if you add ON COMMIT DROP, at the end of the current transaction
  • That these tables exist in their own schema
meaning table names won't conflict with existing tables
though when not using fully qualified names, can mask them in your queries
won't conflict between sessions (because that temporary schema's name is picked per session(verify)).


  • will be in RAM for smaller data -- but once the session becomes larger than temp_buffers it will wpill over onto disk
so you may want to increase temp_buffers, or at least keep this in mind
  • it said 'session' up there, not connection. The difference can be important e.g. around connection pooling
so prefer the transaction variant when convenient
otherwise drop it yourself
  • "The autovacuum daemon cannot access and therefore cannot vacuum or analyze temporary tables. For this reason, appropriate vacuum and analyze operations should be performed via session SQL commands. For example, if a temporary table is going to be used in complex queries, it is wise to run ANALYZE on the temporary table after it is populated."

See also:

Memory tables

This article/section is a stub — probably a pile of half-sorted notes, is not well-checked so may have incorrect bits. (Feel free to ignore, fix, or tell me)

tl;dr: doesn't really exist, but you can get a good imitation if you must

The use case is typically a sort of memcache: shared, transaction-safe, guarantee of no disk IO.

A global temporary table is what you'ld want, and is considered/planned, but not implemented as of 9.4.

You can get something very close in a few ways:

  • An UNLOGGED table means one that does not use the WAL
truncated after a crash (because they're not crash-safe)
but is still transactioned, and visible between sessions
handled largely in RAM, will still go to disk if storing a lot of data
arguably a feature for anything you can't limit the size of
which all makes a bunch of sense for caches (except for the bit where it can occupy disk)
  • A temporary table stays in memory until it's too large for it
apparently because it's a more specific type of unlogged table
but you can't share these, so generally not what you want
  • Add a tablespace on a RAMdisk, then create tables on that specific tablespace.
If you want these tables to be predictably there (around reboots and such), you would do this creation as part of OS/service startup.
Keep in mind that if the WAL is on regular disk, this won't give you much faster alterations, only faster selects.

There are some extensions that do similar things.

Date math

This article/section is a stub — probably a pile of half-sorted notes, is not well-checked so may have incorrect bits. (Feel free to ignore, fix, or tell me)

On timezones

There are two basic date-and-time types:

  • timestamp
    , a.k.a.
    timestamp without time zone
  • timestamptz
    , a.k.a.
    timestamp with time zone

There is no difference in storage, both take 8 bytes.

The difference is that the latter does conversion to and from the client's timezone, and the former does no conversion.


  • timestamptz always stores UTC
the server converts from the timezone the client has set
client timezone must be correct or this becomes messy
You may want to demand clients always set it explicitly, because it may be inherited(verify)
practical when you interact directly with more than one timezone
  • timestamp you store exactly what you send
useful when
you have existing data without timezone info, and just want to store and report it without coversions anywhere
you consider the server timezone authoritative (e.g. entered now(), difference to now() but absolutely always based on server logic)
you consider the client timezone authoritative (e.g. just reporting what they sent back to them)
...or want to deal with all timezone logic yourself (hint: you generally don't)

You can compare timestamp with timestamptz, because they're both numbers, but you don't want to.

because you generally cannot know it is correct, unless you happen to accidentally have all timestamp values in the same timezone (UTC).

Note that if you have server-side logic involving time, e.g. comparisons like NOW() - INTERVAL '7 days', then you probably want to store timestamptz.

Because otherwise you'ld be mixing the two types and have to think very hard about every such mix.

For example, the docs say,

"Conversions between timestamp without time zone and timestamp with time zone  normally assume 
 that the timestamp without time zone value should be taken or given as timezone local time.
 A different time zone can be specified for the conversion using AT TIME ZONE."


# set timezone TO 'Europe/Amsterdam';
# select now();
2019-07-22 17:52:49.146931+02
# select now()::timestamp;
2019-07-22 17:52:49.146931
# set timezone TO 'GMT';
# select now()
2019-07-22 15:52:49.146931+00
# select now()::timestamp;
2019-07-22 15:52:49.146931

The main points to be made here are probably that

  • the two now() values refer to the same moment in absolute time, and the two now()::timestamp values are not.
  • it does communicate timezone information on the way out (based on where you say you are)

The connection's timezone can be viewed with

SELECT current_setting('TIMEZONE');

Or the less standard but easier to remember:

show timezone;


"Show timestamps from the last week" can look something like:

SELECT * from files where entered > NOW() - INTERVAL '7 days'

"Are time functions volatile?"

In postgres, NOW() and CURRENT_TIMESTAMP() are aliases for TRANSACTION_TIMESTAMP(), which is defined STABLE within a transaction.

There are other functions, e.g. clock_timestamp(), that are volatile.

See also:

data types

This article/section is a stub — probably a pile of half-sorted notes, is not well-checked so may have incorrect bits. (Feel free to ignore, fix, or tell me)

serial, bigserial notes

As the documentation notes:

CREATE TABLE tablename (
    colname SERIAL

is equivalent to:

CREATE SEQUENCE tablename_colname_seq;
CREATE TABLE tablename (
    colname integer DEFAULT nextval('tablename_colname_seq') NOT NULL

Bigserial and smallserial are the same idea, but use bigint (int64) and smallint (int16) instead of integer (int32)

UNIQUE is not implied on the column. You may want it.

nextval() always increment, and calls never returns the same value. Use from transactions that are rolled back create gaps in the sequence.

See also Relational_database_and_SQL_notes#avoid_auto_increment.2Fserial_if_that_isn.27t_inconvenient

Text and character notes

This article/section is a stub — probably a pile of half-sorted notes, is not well-checked so may have incorrect bits. (Feel free to ignore, fix, or tell me)
  • varchar
    map to the same internal types, but varchar implies length check when entered, and text has no limit(verify).
  • char(n)
    will be padded with spaces
  • char
    is short for char(1) (verify)

An E before a string is a string including escapes (a non-standard extension of SQL).

See also:

JSON notes


  • if you do operations on the data, in the database, JSONB's lower access time is worth the extra space
  • if it's actually structured data (or the important core is), use columns
  • the query planner cannot help you on unstructured data
but well-targeted indices work as well as any
  • If it's store-and-return and only the app cares JSON is smaller
...or use TEXT, if you don't want to handle insert/update time validation errors

JSONB is a binary serialization, and

takes some shortcuts that are typically fine (doesn't maintain key ordering, duplicate keys (rarely used but JSON technically doesn't rule out) (last key wins))
slightly faster at extracting values, because it's already parsed
has more operations defined
sometimes little larger to store(verify)

JSON is stored as text, and only validated

will preserve key ordering, whitespace, duplicate keys. (Only?) If you need any of these should your prefer JSON

See also:

Date and time notes

This article/section is a stub — probably a pile of half-sorted notes, is not well-checked so may have incorrect bits. (Feel free to ignore, fix, or tell me)

Tests like "in last half hour":

WHERE entered > ( NOW() - INTERVAL '30 minute' )  

Casts to integer should be done via extract.

For example, to get a timezone-less(verify) since-epoch time for the current time, use something like
(date_part('epoch'::text, now()))::integer


Basically, read:

To change a column to an enumerated type you'll need some explicit casting. See e.g.

Inspecting an enumerated type

You can

  • For an overview as data, look at pg_catalog.pg_enum. A query you may like (from [4]):
SELECT n.nspname AS enum_schema,  
        t.typname AS enum_name,  
        e.enumlabel AS enum_value
 FROM pg_type t 
    JOIN pg_enum e ON t.oid = e.enumtypid  
    JOIN pg_catalog.pg_namespace n ON n.oid = t.typnamespace
  • For a specific one:
    select enum_range(null::mytype)
  • use
    in psql

On large binary/text data; LO versus TOAST


  • Pre-TOAST, storing larger things (in bytea/text) meant overhead in the main table.
usually okayish for objects up to a few hundred KBbytes, beyond would really affect accesses on that table, so you probably preferred LO (Large Objects)
  • LO is essentially an enumeration into a postgres-managed file store. Which you manage yourself at app level, but LO saves you a bunch work and permission related headaches and mistakes.

  • Now that we have TOAST, it mostly just does what you want: large values transparently go to a separate table
meaning queries on all other columns work on a compact table

Large Objects versus TOASTed bytea:

  • the text type is actually wrapped around bytea.
(so yes, large text is TOASTed by default)
The only real difference is that text has text codings, bytea does not. e.g. bytea allows a 0x00 byte, many text codings do not
  • TOASTed bytea is often easiest for binary blobs: mostly transparent, LO is more work to use
  • bytea cleanup is automatic, LO's is not (as they act as references to files)

  • LO can stream, seek, and modify in-place, bytea are only handled as a whole
Relevant for some use cases on very large data, where LO can be preferable

  • both bytea and LO are limited to a few GB (depends on config?) (verify)

  • a few ORMs will always fetch all columns and cannot be told not to, which for TOASTed fields may be a lot.
the same typically wouldn't fetch data from LO


This article/section is a stub — probably a pile of half-sorted notes, is not well-checked so may have incorrect bits. (Feel free to ignore, fix, or tell me)


  • TOAST (The Oversized-Attribute Storage Technique) avoids storing large values in the main table
by putting them into a large-stuff-side-table instead.
no functional difference to you: values are placed into and fetched from that external table transparently when you use them
  • TOAST only applies to types that can be large.
which internally means things based on varlena
and in terms of types you use mostly means arbitrary-length text types (including VARCHAR (larger or all?(verify)), text, bytea, also things like JSON/JSONB.
  • by default only actively used when things are large (generally useful behaviour; see below)
becomes likely to be applied for values over ~2K, a given if over ~8K (because page size)
(technical reason: postgres uses a fixed page size (commonly 8 kB), and does not allow tuples (=rows) to span multiple pages)
will be compressed and/or stored outside of the table itself. The preference for which is tweakable via column settings
  • can make sense for columns that you don't fetch in most of your queries.
which is generally most large columns
that that the columns in the main table itself stay much smaller (cache better)

Consider text.

Short text values like words and text identifiers are likely to be things you select on, commonly fetch, sometimes transform while selecting, so you would prefer to avoid the extra work from decompression or the extra fetch, so are best kept in the table.

Few-kilobyte-large text values are likelier to be documents, or other things kept more for archiving reasons, e.g. the unparsed version of something or an article abstract, never filtered on, not fetched in most queries on the table, and generally only fetched in a "present this single item fully" style query.

Meaning 90% of the time, keeping it in the main table would only mean more data you typically skip past. Putting it elsewhere makes it likelier that that main table's more-commonly-used data will be (and stay) in RAM.

If it's something you indeed rarely fetch (or never in typical use, e.g. unparsed version) then compression also saves disk (and memory).

Sometimes you know more about how values will be used (than the above assumptions), so you can control the TOAST strategies (see SET STORAGE), mainly to disable the compression and/or out-of-lineness.

  • PLAIN (also describes non-TOASTable types)
prevents compression
prevents out-of-line storage (i.e. forces it to be in the main table)
Puts a limit on the row's combined size, due to the page size - crossing that will make inserts/updates fail.
still a good idea for things you commonly handle in queries and know will always be short (keep in mind this is about the size of the entire row rather than a specific text field)
allows out-of-line storage
prevents compression
  • EXTENDED (default for most TOASTable types)
allows out-of-line storage
allows compression
compression will be attempted first. If row is still too big, use out-of-line storage
  • MAIN
allows compression
out-of-line storage is only used if there is no way to fit the row in a page (so effectively the threshold is 8K instead of 2K(verify))
The difference to EXTENDED is priority: Basically, within a row tuple, MAIN columns stay in until it's clear that pushing out EXTENDEDs won't make things fit.

To change it, it's something like


To see current STORAGE type in psql:

\d+ tablename

(TODO: find SQL equivalent)


  • you cannot currently force TOAST-style compression for small text values
  • you cannot force things out of the table always
(both seem to partly the design decision that the small gains aren't worth the overhead)
  • TOAST implementation details are the reason values single values are limited to at most 1GB.
(it steals two bits from the varlena 32-bit length value)
  • Limits: Each table gets its own TOAST table, and it uses OIDs.
so this is the basic 232 OIDs limit with the standard "degraded performance if you get anywhere close to that"
yet that won't matter until you have got billions of rows and at least (because ~2K threshold) terabytes of data in a single table.
so it's something to keep in mind if you plan to put all of your big data directly in postgres (note that sharding to more tables may be a good idea anyway)
  • On storing files in the database:
pro: you get transactional integrity (with path-to-filestore and LO this is more your responsibility)
pro/con: replication and backup will include them. This will often be large. But also makes it complete, and won't go out of sync (as a path-to-filestore and LO might)
con(/pro): doesn't do files over ~1GB (LO does 2GB If you need larger, you should really want the filesystem solution)
con: goes via memory, which means large files will clobber your caches(verify) (for contrast: LO and path-to-filestore can be streamed)
con: valena in most uses(verify) is escaped on the wire, meaning they're larger
note: take a look at the Foreign Data interface, it's sometimes a better fit. Particularly if you want to use a filsystem style solution but specifically a distributed one
  • the 2K threshold mentioned above is TOAST_TUPLE_THRESHOLD, and its value is settled at compile time

See also:


This article/section is a stub — probably a pile of half-sorted notes, is not well-checked so may have incorrect bits. (Feel free to ignore, fix, or tell me)


This article/section is a stub — probably a pile of half-sorted notes, is not well-checked so may have incorrect bits. (Feel free to ignore, fix, or tell me)


  • when to use one depends on
    • how you use the columns
    • whether the index overhead is worth it (whether there are more selects or more index-updates)
    • table size. Tiny tables have no use for indices. Huge tables can still benefit from avoiding tablescans even if you have constant index work)
  • a btree index is automatically created to support any uniqueness constraints (UNIQUE, PRIMARY KEY)
  • types:
    • btree is fine for most basic types, for equality, range, and sorting needs
    • GIN and GIST are mostly for fancier types, fancier operations
think array logic, 2D geometry, full-text indexing (various types)
tend to be larger, and therefore slower to build
    • hash usually isn't worth it
  • tweaking GiST can help. Takes some study, though.
  • Also:
    • partial indexes make sense when you typically search only for a few specific values (when the work of keeping and updating the rest is pointless)

Index types

  • btree (balanced tree. The default, and usually what you want)
works on numeric types and text types, and deals with NULL
fast for equality and range queries
supports basic inequality too (=, <, >, <=, >=)
amount of work to find is relatively constant (B is for balanced)
optimized for case where a single value maps to a single row
note: a btree index is implied when you add a uniqueness constraint (UNIQUE / PRIMARY KEY)

  • GiST (Generalized Search Tree)
balanced tree style
beyond that a framework more than an implementation - allows
choice of operators
choice of index type (e.g. text with equality, or text with pattern search(verify))
custom data types
made it easier to implement geometric types (more flexible with its operators)
more than equality/range operations; can support various geometric operators
lossy, in that it's fine with keeping false positives for some time, and has to check the table.
See also
  • SP-GiST is a variant of GiST that supports not-necessarily-equally-space-partitioned data (think quad-trees, k-d trees, and suffix trees)
  • GIN (Generalized Inverted Indexes)
can map many values to one row
can work with custom data types
can deal with array values
easily much larger (often means slower), so use only when needed
  • BRIN (Block Range Index)
  • hash
supports operations: =
not transaction safe, meaning (?)
not WAL-logged, meaning you should rebuild these after a crash to be sure
comparable speed to btree - hash may be slightly faster (perhaps due to being smaller?)
  • rtree
supports more operators than btree
not transaction safe
no upsides of using this over GiST (which came later)
rtree now removed, effectively merged into GiST (since 8.2)

Variations for specific uses

Sorted indices

Consider a time/date field ('created' or 'lastupdated' or so), and the common query on that being 'select the most recent hundred' , i.e. ORDERed and with a small LIMIT. An index kept in DESCending order (and NULLS LAST) will find enough entries at a small clump at the start. (The direction of sorting probably doesn't make a huge amount of difference (verify))

Partial indexes

if you know you'll always select on a column with specific values, and never others, you can create an index missing those values.

Say you have a system storing images, and want to detect for all if they're animated or not. That column will be TRUE, FALSE, or NULL meaning 'not checked yet'. The background task would typically select matching NULL, and it would typically be relatively small number among the whole, in which case you can go with a index supporting basically just that query, with

CREATE INDEX ani_unknown  ON images (animated)  WHERE (animated is null);

Expression Indexes

Index the result of an expression on the value, rather than the value itself.

Usually only has a point when supporting a very specific select, and there is no point to also storing that transformed value in the table.


  • have a datetime column, index it with just the date() (without the time)
for a "get all stuff within this day" query, without needing a separate date column (or transform / filter the datetime)
  • Store names case-sensitive, index them case-insensitive, by lower(columname)
and querying it that way too (but consider queries that this might not support!)
  • indexes on specific parts when using hstore, json and such, see e.g. [5]

multi-column indexes

create indexes on multiple columns. Only makes sense when you query on that same combination, and in the same order.

Unique Indexes

supports fast checking of unique restraints, otherwise a bit lighter

Other notes

TODO: look at

Keep in mind:

  • postgres will always use an index for UNIQUE columns, so adding an index manually would only mean two indexes and double work.
  • row visibility/currentness (see MVCC) is not stored in indices, and has to be fetched from table data

  • Indices that see a lot of alterations have the same empty space problem tables do
space is reclaimed in a page-based way, meaning that pathological cases can still lead to index bloat.
which degrades performance if the table is doing very large amounts of writes and/or deletes.
the simplest fix is to just accept this
if you want to clean up
A REINDEX is possible, but requires an exclusive lock on the table
you can DROP and CREATE it separately, but often want no time without index (for uniqueness and/or query-planning/performance reasons)
a CREATE INDEX (e.g. in combination with a DROP INDEX in a transaction) still needs a write lock you probably want something involving CREATE INDEX CONCURRENTLY, which builds slower but needs no locks
and e.g. probably rename it into place (in a transaction) once it's done

  • Creating indices will by default imply a write lock.
try CONCURRENTLY - it'll build slower, but won't lock out the table from what is probably regular use
some caveats and can fail (see [6]), meaning some extra handling around it

  • VACUUM FULL often means a lot of work not only to the table data, but also the index
Before 8.4, this would be work that altered the existing index, so likely increased index size and fragmentation (there are some ways to cheat this a little)
Since then, it basically builds a new index, so is equivalent to a reindex (verify)

Understanding EXPLAIN / queries to rewrite for indices

This article/section is a stub — probably a pile of half-sorted notes, is not well-checked so may have incorrect bits. (Feel free to ignore, fix, or tell me)

You should be passingly familiar with the major decisions the query optimizer makes:

  • table scan method
    • sequential scan
    • index scan
    • bitmap heap scan
  • table join method
    • nested loop with inner index
    • nested loop without inner index
    • hash join
    • merge join
  • join order, in that
    • which to use as the right relation can matter, e.g. in hash join's amount of keys, index use in nested loop.

...there are variants - e.g. backwards index scans (typically makes a max() much faster), hash anti join. Most are specific-case optimizations you can generally read as their basis. (Note that an anti-join is a "where a match does not exist in the other table)

Sequential scans

Mean looking through the entire table. The larger the table, the linearly slower this is to do,

If this is a frequent query that returns a small part of the table, then consider adding and index on the column(s) you're filtering on most.

An example where you wouldn't care is an infrequent maintenance thing that is intended to look at all data.

See more on index choice below.

Sequential scan because of a lack of join

Joining on an indexed thing is usually fast, so you want that to happen when possible.

The simplest case is where you explicitly join.

A more interesting case is where a query doesn't explicitly join on an indexed thing. (Note: related to the (un)correlated subquery problem)

Consider "I want images that are untagged" like:

select guid from tags_by_guid
  where guid not in (select guid from files)


HashAggregate  (cost=418235428718.38..418235431725.66 rows=300729 width=41)
  ->  Seq Scan on tags_by_guid  (cost=0.00..418235423257.31 rows=2184425 width=41)
        Filter: (NOT (SubPlan 1))
        SubPlan 1
          ->  Materialize  (cost=0.00..187840.80 rows=1448653 width=41)
                ->  Seq Scan on files f  (cost=0.00..167864.53 rows=1448653 width=41)

If you somehow joined on the condition that they match, you'ld be telling the database to look compare based on column values (using what it knows about them, and allowing use of their indices).

The fact that they come from specific matching columns is not explicit. We find this obvious, more so because they have the same name, but what you've said is "have one produce strings, then compare that against another set of strings". Which it'll do, typically by materializing the inner query (= execute and keep its results around), then do a scan over the outer one comparing values to that.

The crux is often figuring out how to express this join (in this case an anti-join, meaning a "does not exist"-style query). The best syntax / performance can vary a little with specific case.

One (slightly funny looking) alternative for this case is:

select distinct guid from tags_by_guid tbg
 where not exists (select 1 from files where files.guid=tbg.guid)


HashAggregate  (cost=431068.70..431068.71 rows=1 width=41)
  ->  Hash Anti Join  (cost=198705.69..431068.69 rows=1 width=41)
        Hash Cond: (tbg.guid = files.guid)
        ->  Seq Scan on tags_by_guid tbg  (cost=0.00..99143.50 rows=4368850 width=41)
        ->  Hash  (cost=167864.53..167864.53 rows=1448653 width=41)
              ->  Seq Scan on files  (cost=0.00..167864.53 rows=1448653 width=41)

Things to note:

  • in the first, the outer query is much larger than the inner cost (roughly just their table sizes multiplied)
indicating a dumb 'compare everything to everything one by one' plan.
  • the cost of the inner query is the same in both cases.
Makes sense, it's a sequential scan selecting everything.

See also:

external merge disk

...often suggesting an intermediate is very large.

Sometimes indicates a slightly larger larger work_mem helps, but more usually, some redesign is the better solution, because a large intermediate result is often slow and avoidable anyway.

When/why is my index not used?

This article/section is a stub — probably a pile of half-sorted notes, is not well-checked so may have incorrect bits. (Feel free to ignore, fix, or tell me)

Many possible cases, which can vary with your query, your data, and your table definitions.

  • If postgres needs to transform the stored data, it can't use the index on that column.
e.g. differing collations for text columns
  • because the planner figures that doing so is more expensive than an alternative.
And it's usually right.
For example:
index lookup before having to read almost all table data is more work than reading all table data.
a condition that an index cannot assist (easily or at all), such a larger-than condition on an index type that does not support that (or possibly for which the planner can guess the result is 'most'?(verify))
Having only condition(s) on non-indexed columns means no narrowing down can be done before that condition can be checked, so it has to go to all data
  • Anything that by nature doesn't scale very well, e.g.
GROUP BY on a lot of data
e.g. various uses of DISTINCT
ORDER BY on a lot of data
...though a well-placed WHERE can do wonders
  • sometimes use of a LIMIT can make it prefer an index over a tablescan
...though ORDER BY before LIMIT means all data has to be evaluated
  • in some cases nullability matters(verify)

Estimating index usefulness

The query optimizer estimates the cost of various applicable ways to satisfy a query, and that number in EXPLAIN can be useful in an order-of-magnitude way.

It's pretty good at choosing the best given the situation -- but you can influence that context

being smart in your queries can help more
adding indices can sometimes make for much faster plans
tweaking indices can make them cheaper to use

When it's net-positive:

  • Every column you commonly select on may be useful to index
...when you typically select small sets, and the table is large
  • Every column you commonly join on may be useful to index
...but not always, and not necessarily both of he compared columns (read up on the join types)


  • unused indices just slow you down a little, because they must be kept up to date
there sometimes is gray area, e.g. ones that are rarely used, written more often than read.

When an index is huge relative to the table, it won't help much

...and maintaining it is probably also considerable work.
There are some clever alleviations, most when you know more about your data
  • when a condition usually implies most data needs to be fetched anyway, there is no point in the extra work of figuring that out via an index
  • sometimes writing queries differently can help the optimizer a lot
TODO: many examples :)

"When is an index better?" depends on a few different things, including:

  • the nature of your data
...including the amount of rows, and the amount of distinct values
  • the nature of the queries you (most commonly) run
(e.g. what you select on, what you join on, sometimes what columns you use in conditions)
  • the query planner/optimizer
  • the amount of work done to maintain the indices

It varies per case which one of those weight strongest.

Text indices and matching

See also


The rule system is a rewrite system, which conditonally does something instead of OR before a given query.

Rules can be used as augmented queries, exceptions, stored queries or very simple stored procedures.

Creating them takes the form:

CREATE RULE  rule_name 
      AS ON  event 
         TO  object
[WHERE       rule_qualification]
DO [INSTEAD] [ action | (actions) | NOTHING ];

See CREATE RULE docs and DROP RULE docs.


You can use this to:

  • implement views (e.g. adding some useful calculated columns) by making the event SELECT ON tablename
  • implement view updates (INSERT/UPDATE) by making rules do the proper changes to the actual table
  • implement stored queries (on some given conditions)
  • do (very) basic data validation
  • protect certain INSERTs, UPDATEs and/or DELETEs from doing anything, via a DO INSTEAD NOTHING.
    • Note this will allow no changes from any user, which is probably not what you want
    • ...unless you made a table purely for rules, in which case you can manage grants for that view (or rather 'view table') and the actual table separately, which can be useful (verify)
  • record how many queries the rule was triggered for

When rules' conditions do selects, or when you rewrite into more complex queries, then they can make things slow in less-avoidable ways because you made it overrule yourself :) Use rules with due consideration.



  • Absent: the rule action is done unconditionally - before the user's query
  • Present, and WHERE applies: Done instead (or, pedantically, before the user query isn't done)


  • (to write)

You may need some complex code to make rules apply differently to wildly different insert types, but on properly normalized databases that should not be a problem.

Example: Word counter

The following imitates a simple use of what would be
in MySQL (see the respective notes):

Assuming a table:

CREATE TABLE wordcount (word text primary key, count int);
CREATE INDEX word_index on wordcount(word);

Adding the rule:

CREATE RULE  counter 
AS ON        INSERT TO wordcount
WHERE        (NEW.word) IN (SELECT word FROM wordcount)
             SET count = count+1
             WHERE word=NEW.word;

...will allow you to do:

INSERT INTO wordcount (word) VALUES ('word');
INSERT INTO wordcount (word) VALUES ('word');
INSERT INTO wordcount (word) VALUES ('word');

This would normally complain about a duplicate key, but the rule always tests and falls back to incrementing the count when necessary. (Note: the index isn't necessary, but avoids a sequential scan every for every insert/count)


using stale statistics instead of current ones because stats collector is not responding

postgres keeps table statistics to help the planning. These statistics are manged by a specific process, in a file

The message means that the collector did not respond to a "give me a recent statistics snapshot" request very fast.

Which usually means your IO is currently overloaded.

checkpoints are occurring too frequently

WARNING: there is already a transaction in progress

Means you BEGIN a transaction when one has already begun.

This often means either a missing commit/rollback, or an explicit BEGIN when a transaction was automatically started (e.g. when autocommit is off).


for SELECT DISTINCT, ORDER BY expressions must appear in select list

ERROR: current transaction is aborted, commands ignored until end of transaction block

In general the first error in that transaction should make the transaction fail as a whole. Postgres, like many databases, expects your code to handle errors in transactions, including an explicit rollback.

In practice:

  • You are probably interested in the error that happened just before this one. Check your logs if you have them, go the verbose-debug way if you only have your code.
  • If the SQL can fail for a good (e.g. data-related) reason, it is also a smart idea to make your client-side logic more fine-grained, e.g. deal with the possibility of failure of individual statements in a transaction.
  • How to handle the error path is up to your preference of correctness, and of necessity.
in some cases you'ld want to retry, in others giving up is fine,
in some cases it's a code bug, in others it's database state
In many situations, having an exception path that just rolls back whatever you were doing is good enough.
you could do a partial commit, when parts of a transaction are logically unrelated (e.g. you chose to put it in the same transaction for ease, or perhaps for WAL related reasons)
  • You may wish to make your transactions a bit more local and explicit
e.g. when working with psycopg, it will implicitly start a transaction on the first command after the last commit/rollback (or initial connection).
  • It can be a good habit to rollback any connections you're not using after the latest select (also to avoid "idle in transaction" connections)

Note that this error can also be caused by code out of your hands. For example, django being out of sync with its database.

Error: Cannot stat /var/run/postgresql / No PostgreSQL clusters exist; see "man pg_createcluster"

...when you try to start an existing installation.

My issue -- quite possibly specific to ubuntu or the service script for a particular pg version, or maybe some assumption that breaks under systemd -- is that the /var/run/postgresql directory is assumed to exist, and it is not created automatically.

Which means you need to create it manually and restart postgres, in my case

mkdir /var/run/postgresql ; chown postgres: /var/run/postgresql ; pg_ctlcluster 10 main start

You could also hack this into the service script, I just hope the power grid behaves decently - and forget the details by the time I need them ^^

Permission denied - Is the server running locally and accepting connections on Unix domain socket "/var/run/postgresql/.s.PGSQL.5432"?

Probably means that the file permissions on that unix socket don't allow the effective user to use it.

The simplest solution is often to add your effective user to the postgres group.

(Though this isn't always ideal. For example, if you add apache to it and your pg_hba trusts all localhost access, then scripts get somewhat wide access to your database by implication)

No PostgreSQL clusters exist; see "man pg_createcluster"


  • running ubuntu (14.something-specific perhaps?)
  • things were previously running fine
  • this happened after a reboot

...then it may mean the /var/run/postgresql/ directory isn't created, but postgres assumes it's there, and needs it for... let's call it management of its state. The error is not a very good indication of what's actually wrong.

Probably fixed in a later package update (cause is possibly that it's on tmpfs now?), but since I don't update, I initially mkdir'd it at reboots, and eventually tweaked the upstart script to do that.

could not connect to server: Cannot assign requested address

This is an OS error, specifically coming from a socket bind().

In case you're creating connections very quickly (e.g. hundreds per second), you're probably triggering the TIME_WAIT problem by running yourself out of sockets.

Consider reusing connections where possible.

ERROR: MultiXactId has not been created yet – apparent wraparound

Replication, Backup

This article/section is a stub — probably a pile of half-sorted notes, is not well-checked so may have incorrect bits. (Feel free to ignore, fix, or tell me)

Replication, Clustering, etc.

Basic options:

  • pg_dump
  • PITR
  • Replication as backup (With PITR, or by other means)
  • LVM

Clustering/load balancing/failover, etc: See [8] Slony?

pg_dump / pg_dumpall notes

The format you ask for implies the utility you need to restore Format and implied restore utility:

    • -f p
      plain text, use to restore with psql (default)
    • -f t
      tar, use to restore with pg_restore
    • -f c
      custom, usually for pg_restore


sudo -u postgres pg_dump dbname > dbname.sql

which you can restore (into an already-created and presumably still empty database named dbname) with

cat dbname.sql | sudo -u postgres psql dbname

(the sudo to do this as user postgres, because that's usually the database admin user that has enough permissions to do this. Can vary in specific setups)

PITR notes

initial online copy of database + archiving all transactions logs


psql and transactions

psql is autocommit by default.

psycopg and transactions

psycopg defaults autocommit to off.

And the first first command in a connection (and after a commit) will imply a START TRANSACTION ((verify).

This means most of the time, you will be in a transaction, and (potentially idle) in transaction, even after a SELECT.

So get used to doing a rollback() or commit() as soon as sensible.

Ideally you also check the success of all transactions, because if there was an error you can only rollback()

If you like to do things in chunks and not quite fail as a whole, look at SAVEPOINT


note you can use
(since psycopg 2.5) to automatically commit (if no exceptions) or rollback (if exceptions) happen within that code block

See also:


This article/section is a stub — probably a pile of half-sorted notes, is not well-checked so may have incorrect bits. (Feel free to ignore, fix, or tell me)

"Pgpool-II maintains established connections to the PostgreSQL servers, and reuses them whenever a new connection with the same properties (i.e. user name, database, protocol version, and other connection parameters if any) comes in. It reduces the connection overhead, and improves system's overall throughput. "

It can also do load balancing, and some other useful things

Notes on performance and innards

The defaults are conservative so it'll run on ten-year-old hardware comfortably, and you need some tweaking to be more efficient on modern systems (e.g. more RAM), or for specific types of hardware.

Memory related

This article/section is a stub — probably a pile of half-sorted notes, is not well-checked so may have incorrect bits. (Feel free to ignore, fix, or tell me)

You may notice that postgres processes have little memory mapped, perhaps 10-20M when idle (and certain work).

This is unlike some different database engines, but quite normal for postgres, because it assumes to be on a relatively dedicated database host, where the system caches will be doing most of the work of keeping the most useful things in memory.

See also things like:

The main interesting variables seem to be the following few.

Memory use

Most memory is taken in the background processes (and their bgworkers), because they process queries.

The rest tends to fall away in order of magnitude, unless you've configured certain things unusually.

As shared_buffers is by design shared between the workers, most of the memory allocation comes from (the values of)

max_connections * (temp_buffers + 3*work_mem)
the 3 is arbitrary, and pointing out that much is allocation per operation that needs it, not per query or session.
Actually, if you write particularly nasty queries you can completely blow out performance
so yes, if you notice only specific queries go crazy on RAM, you could rewrite them

Note that tools like top, ps, htop include SHM in memory use so count it many times, which is not the most informative. You mighy like to take a look at smem, which spreads SHM use among its users, which is fake but at least that adds up correctly, and arguably a more reasonable representation.


Memory shared between all running workers.

Is a holding-area cache for table data. It seems(verify) all data to and from disk is managed via shared_buffers

Data sticks around for quick reuse of high-demand data; it has a usage count on each page.

Defaults have been 8MB or 32MB.

Should be scaled up somewhat with amount of allowed workers/connections, in that each worker pins some data.

Larger shared_buffers means more is immediately available, yet this only has measurable value for the highest-demand data.

In particular, postgres is designed to rely on the OS page cache, and shared_buffers is effectively duplicating that, so overly large shared_buffer size can actually be counterproductive.

If there are other memory-intensive or IO-intensive apps on the same host (which is not the best design, but you do you), there is value in taking some for postgres this way.


  • There are some queries you can do to inspect how much of the contents are popular versus transient
which you can use as an estimate of how useful your current buffer cache size is

  • if all your access patterns mean this cache won't help much at all (e.g. very random), keep it small
  • Setting it very high is counterproductive, in that it lessens the amount of RAM usable for other things, in particular work_mem (the buffers per connection to do actual querying work in) and the system's page cache (which postgres is designed to take advantage of - which is why a low shared_buffers is generally not so harmful)
  • Even for serious servers where it's allowed to use gigabytes for this, you may see relatively little actual shared use

  • On unices, shared_buffers is implemented using SHM-style shared memory
your OS's SHM settings may be more conservative than what you set in postgres, so you may have to tweak kernel settings to take full advantage
Read e.g. (and use a live sysctl and/or edit /etc/sysctl.conf)
Quick check on linux (details can be OS-specific):
sysctl -a | grep shm
Most important is SHMMAX (the other SHM defaults are decent). For example:
kernel.shmmax = 33554432
kernel.shmall = 2097152
...means the individual shm size limit is 32 megabytes. You probably want to raise that.
The total limit of shared memory is in SHMALL, and it is in pages, so the 2097152 here, times 4096-byte pages, is 8GB.

See also:


work_mem is the amount of data a worker can handle in RAM for certain operations (mostly mostly sorts, some hash-based operations), An operation crossing this starts to do it on disk.

Defaults to something on the order of 1MB or 4MB.

Note that this is per operation that needs it, not per connection/session/query. Some queries may none. Some may have a bunch.

For some unusually complex queries (some of which you want to rewrite), higher values have a measurably positive effect, up to maybe 32MB.

Beyond that it becomes more of a liability, via (this amount * a handful per query * max_connections pushing against shared memory and the page cache, and eventually can be a cause for swapping)


maintenance_work_mem (default 16MB): like work_mem, but specifically for VACUUM, CREATE INDEX, and ALTER TABLE ADD FOREIGN KEY ( mostly background maintenance and loading database dumps).

At most one of these will run per session(verify), and these operations are fairly rare anyway, so it's fairly safe to set this quite a bunch higher than work_mem.


temp_buffers (per session, default 8MB): maximum amount of memory used for temporary tables.

The overhead to setting this high when they are typically unused is negligible,
but like work_mem, serving many connections can imply unreasonable memory use

max_connections - keep in mind that memory use and other things scale up with this.

For example, if (this times work_mem) approaches the order of usable RAM, it'll be fighting with other things at peak use.

effective_cache_size: your estimation as to how much of the OS page cache will store postgres table and index data (so approx how much memory you expect to typically be free beyond postgres and other uses).

This is a hint given to the query planner. It may make cleverer decisions like estimating that loading in a common index may be a good idea, or decide that right now a table scan is less disruptive.

It is counterproductive to overestimate this for your setup (and note that if you don't have a dedicated DB server, that is a tendency over time).

A pragmatic value could be half the size of your OS cache when the system is set up and running everything it will in production (including the database server), preferably when things that use noticeable and/or variable amounts of memory have also run for a while.


  • The values of memory size variables
can be KB/MB-type sizes,
or the older(verify), more confusing way, in segments (that are internally) of 8 kilobyte (BLCKSZ?(verify)), so 1000 is 8MB, 4000 is 32MB, 20000 is 160MB, 50000 is 400MB, etc.


Short summary:

  • Periodically run a basic VACUUM, which can easily automated with auto-vacuuming.
  • Set up the FSM generously enough so that you will rarely need VACUUM FULL
  • Don't run VACUUM FULL unless you know you really need to (it locks the table -- and in a well-tuned up system it is fairly unnecessary)


Postgres' Multiversion Concurrency Control (MVCC) allows concurrent access to different versions of each row, meaning that it can avoid row-level locking when multiple transactions access the same data in various types of use (it also makes rollbacks fast, and allows hot backups).

Resolving which row is current is usually simple and fast.

Old rows are not immediately reclaimed, for some mildly complex reasons, which also means tables are somewhat larger than the current data. And this can be more true when more transactions interact. And be more noticeable for tables that change a lot.

It also means free space may be in various places, so postgres records pages that become unused in the Free Space Map (FSM), a bitmap that lists the empty pages within postgres' data files, which is used when placing new data. The FSM is

  • loaded in memory (for reasons of speed and simplicity) so has a fixed, configurable size.
  • not continuously up to date, so it needs to be updated now and then by looking at the table on disk and recording what we now know is old and reclaimable.

The last is what a basic VACUUM does, and why (auto)vacuuming aggressively enough is important.

It also means that the size of the FSM should be higher if your transaction rate is higher, because if the FSM is too small to refer to all the reclaimable pages, then these pages will not be cleaned up until a VACUUM FULL (which under properly tuned system you can do rarely or never), and the table's disk size will grow over time.

Now, autovacuum usually means that FSM-marking vacuum happens often enough, and you don't really need to think about it.

But probably you're reading this after your table bloated and you want to know how to fix and/or prevent it.

Your options:

  • Rewrite the table completely with a
    • Guarantees all old space is reclaimed, also effectively defragments it
    • ...but holds an exclusive table lock for as long as it takes to read and write all data. This is a Bad Thing in a production system

That last detail means you usually want to do both of the following:

  • do a basic VACUUM more often
    • you could do it manually, but it's easier to ensure auto-vacuuming is doing its job (and possibly tweak it), since that usually means you rarely have to look at it
    • scans the entire table to mark the empty space
    • Takes time and IO, but does not lock the table, so the worst effect is a lower request handling speed while it's running. You can tweak it to back off while hard work is being done.
    • This makes it a decent option to give predictable/constant performance while you also re-use space.
    • exactly how often this should happen depends on your workload - how quickly rows go out of style
  • make the FSM appropriately large, large enough to (comfortably) cover all reuse
    • ...which depends on the table size, but more on the use patterns and the VACUUM interval.
    • Takes more RAM. On servers there's usually plenty of it and the benefit is often worth it.
    • ...but a ridiculously large FSM will only mean RAM that won't be used for this and can't be used for anything else (like caching of table data).

If you do the latter two right, you may find all unused space is always reclaimed (except perhaps after unpredicted usage spikes). You'll still have tables fragment over time, though.

One way to determine a decent configuration (which is both a science and an art in that it's a case-specific cost-benefit tradeoff) is simply running the system for the time it usually takes between tables to be vacuumed, do a "VACUUM VERBOSE ANALYSE" to get a detailed report, look on how much there was to mark in the FSM, and base your FSM settings on that plus a fat safety margin.

The most important FSM setting is
, the maximum amount of pages we expect to find reclaimable (the maximum that can be marked that way).
  • default 10000 - or chosen by initdb, up to perhaps 200000
  • Setting this too low for your database use will mean pages will not get reclaimed and table files will grow.
  • Setting this extremely high will just mean use of RAM that will never actually be used.
  • Recommended to be set based on analysis of real use.
  • When you have some very busy tables, you may want to up this more than the somewhat conservative documentation would suggest you do.
  • Each page takes 6 bytes, so you get ~174K pages per MB of memory use
There is also
, the maximum amount of tables/indexes that will be vacuumed.
  • default is 1000, and unless you're serving a lot of tables/databases from a single installation, that's a very generous number.
  • Overhead is approx. 70 bytes each, so setting this high is safe, though setting it ridiculously high is also pointless.


This article/section is a stub — probably a pile of half-sorted notes, is not well-checked so may have incorrect bits. (Feel free to ignore, fix, or tell me)

Auto-vacuuming means a VACUUM will happen automatically every now and then. You can enable/disable this as you wish, tweak how often it happens, and how quickly it selects tables to vacuum.

(Autovacuuming used to be done by a separate process. It was integrated in the daemon in 8.1)


  • having it enabled: autovacuum=on
  • having stats_start_collector and stats_row_level both enabled (true in the config), for the ability to see which tables changed a lot.

When it happens:

  • once every autovacuum_naptime time, postgres will choose one database and vacuum the tables in it that meet certain conditions (to avoid unnecessary work), mostly those that have seen many changes (this can be tweaked)
  • it will also happen every ~2 billion transactions regardless of whether it was otherwise enabled (to avoid potential problems related to internal wraparound. It's actually hard to cause this issue, but it can't hurt to avoid it)

See also:

Cost-Based Vacuum Delay

VACUUM is in itself a lot of read operations, which implies that it easily becomes a major part of disk scheduling, which can slow production operations somewhat.

You can make VACUUM back off whenever there are database operations waiting. This will make maintenance take longer, but will lessen the impact of this maintenance on user response and, unless the database never sees any quiet time, will do most of the work in the times between real work.

This is disabled by default. To enable it, set vacuum_cost_delay to something larger than zero, namely the amount of time, in milliseconds, that the vacuuming process should sleep after the last real operation (the assumption is that others will come in soon after the last).

Through the nature of the backing OS's timing, the accuracy/resolution of this sleep may be no better than 10ms, which also implies this is the effective minimum. Apparently 10 or 20 is also usually enough(verify) to balance with user responsiveness.

The other vacuum_cost_* settings are set to decent defaults for the average case, but you may have specific cases:

  • if you decide to vacuum during off hours, you may want to tweak it to finish fast, while
  • in a 24/7 system you may want to tweak it for minimal impact, even if that means it'll take longer.

TODO: check out suggestion that since VACUUM holds a write lock, these delays can sometimes be counterproductive.(verify)


This article/section is a stub — probably a pile of half-sorted notes, is not well-checked so may have incorrect bits. (Feel free to ignore, fix, or tell me)

(B-tree) indices are spread on disk, so in tables that see a lot of alterations they will easily suffer from the same empty space problem, and may degrade - within weeks if the table is doing very large amounts of writes and/or deletes. This applied more before 7.4 than it does now. Space is reclaimed, but in a page-based way, meaning that pathological cases can still lead to index bloat.

In such cases, you may want to rebuild the indices every now and then. A REINDEX is possible, but requires an exclusive lock. A 'hot' reindex can be done by doing a DROP INDEX and CREATE INDEX within a transaction(verify) (which will still require a write lock).

TODO: read

Table fill factor

This article/section is a stub — probably a pile of half-sorted notes, is not well-checked so may have incorrect bits. (Feel free to ignore, fix, or tell me)

Fill factor below the default 100 means that INSERTs will leaving space in each page, so that a later UPDATE can place a new copy in the same page rather than elsewhere.

This because it typically helps performance when multiple versions of the same row are in stored the same page. (if the new row version goes to a new page, we involve that new page and update the index. If it goes to the same, we write to just that one page. The improvement may be less on busy pages as page writes are aggregated)


  • If you expect heavy UPDATE load, consider lowering fill factor.
If not, you will just lower the efficiency of your storage - and lower your performance

  • For write-once tables, 100 makes sense. Less space is used, less disk reading is done.
  • if a row is ≥4k you won't get more than
it can be hard to estimate how large a row is, with
  • leaves space in a page so that UPDATE can go to the same page
  • Will we UPDATE or just DELETE entries

Index fill factor

This article/section is a stub — probably a pile of half-sorted notes, is not well-checked so may have incorrect bits. (Feel free to ignore, fix, or tell me)

Again, the concept is how full to pack pages.

The details are more interesting because of the way each index type works.

For example, for B-tree indices the default is 90, and lower values make for fewer page splits in heavily updated pages.

Architecture, and disk related details

This article/section is a stub — probably a pile of half-sorted notes, is not well-checked so may have incorrect bits. (Feel free to ignore, fix, or tell me)

Process-wise, you have roughly: ((verify) whether this has since changed)

Postmaster process (one)

  • the thing on port 5432
  • mostly does authentication and handing off a client to a backend process

Utility process (one)

  • WAL writer
  • WAL sender/receiver (in pg9)
  • autovacuum
  • bgwriter
  • stats collection
  • syslogging

Backend processes (one per active connection/client, limited by max_connections)

  • various details and tweaks apply per backend, including...
    • work_mem
    • temp_buffer
    • maintenance_work_mem
    • catalog cache
    • optimizer/executor
  • Backends share a bunch of things (all via SHM?), including...
    • shared buffers
    • wal buffers
    • locks
    • clog buffers
    • other buffers

Disk details: WAL, fsync, logging

This article/section is a stub — probably a pile of half-sorted notes, is not well-checked so may have incorrect bits. (Feel free to ignore, fix, or tell me)
For context

The Write-ahead log (WAL) is journaling of transactions on their way to the actual database blocks, which are the table data that the database might hand to users. The WAL is used for recovery, can be used for replication. Some details relates to the MVCC nature of postgres.

Once a session commits a transaction, the data is stored in the WAL buffers in RAM, then soon written into a WAL segment on disk (segments are particular files in your pg_xlog directory, each 16MB).

WAL checkpoint interval

Checkpoints are points in the WAL where postgres guarantees up to that point is in the database blocks (heap and index).

How often checkpoints are done is largely a tradeoff in the amount of time involved in recovery (which checks the xlog for the last checkpoint record, essentially the time/state in the at which everything on disk was known to be consistent, and works the journal from there).

Checkpoints will happen, with default settings, the soonest of the following twot things happen:

  • at most five minutes (checkpoint_timeout=300)
  • at most three segments apart (checkpoint_segments=3)
    • where a checkpoint segment is 16MB, so 3 means 48MB of data
    • People have suggested that this can be considerably larger, at least for high-write-volume cases. A setting of 10 (160MB) isn't crazy on a modern setup, and a few factors more can make sense on a few setups.

People may increase the checkpoint interval when doing a lot of small transactions, because it can help performance and/or improve IO load patterns.


  • this is about how fast to put data on disk, and has no effect on what data you will see.
  • many people decide that recovery is rare enough that that slightly better performance is worth extra time taken during recovery
note that recovery time is less significant on SSD, which may affect how you think of this tradeoff
  • Checkpoint writing is always done with some backoff, that avoids dominating IO. This too is tweakable.
  • If checkpoints are happening very quickly (default is faster than every 30s), postgres will complain in its logs.
if the cause is regular use (rathe than a huge import, or many manual CHECKPOINTs) you may wish to tweak to put them further apart, to spare IO a bit. Or consider improving your storage system.

WAL size

On wal_buffers size:

Useful to make large enough to hold data from any single transaction (and possibly a few).
Used to default to 64KB - which is very conservative, because when it's full it will block.
Since postgres 9 it defaults to 1/32nd (~3%) of shared_buffers (up to the WAL segment size, 16MB), which is also a good suggestion for earlier versions[9].
More than a few MB doesn't help speed (because it's written to disk fairly quickly anyway), though can rarely hurt (though you want to check your kernel's SHM limits (no point in stealing limited SHM from other useful things). More than 16MB is often nonsense.

WAL and IO

There is some additional tweaking possible. Consider that if there are a bunch of related transactions that will all commit very soon (because they are short-lived), then waiting for a little time might allows us to write all these transactions with fewer fsyncs. The two settings:

  • commit_siblings (default 5): the amount of transactions that must be currently running before we consider this behaviour
  • commit_delay (default is 0, meaning it is disabled): The amount of time (in microseconds) to wait for another commit/WAL record. This can be useful e.g. when applications consider autocommit sensible.
    • Note that setting this too high will make the WAL writer sit around pointlessly. Unless you expect this to help for your workload, this feature is not that useful(verify). Example setting is 1000 (1000us, which is 1ms).


You can, for any database on platter disks, expect commit rate to be at most one per disk revolution. This may be slightly better with synchronous_commit (or fsync) off.

On synchronous_commit

This setting controls whether to block while WAL flushes data to disk. Default is on.

If off, the data might be written to disk a few hundred ms (3*wal_writer_delay) after success is reported.

It poses no risk of corrupt tables, but a crash will lose whatever WAL data not flushed at that time.


  • When looking for IO relief and not caring about the last few seconds of data, disabling this is a much better idea than disabling fsync.
  • Can also be disabled per transaction, which is a more fine grained way of making non-critical commits back off a bit.
SET synchronous_commit TO OFF
note that some things are always synchronous, such as DROP TABLE and things like two-phase transactions[10]

See also

Database fsync

First, see fsync in general.

For most database engines, recovery after a crash is up to the last verifiably correct part, which is the last bit of data+journaling that was fully written to disk.

Usually that is very recent state, because we fsync often - the most robust way is to follow each transaction commit with an fsync that waits for disk to say "yes, it's on there".

If any part of this plays any looser, that means data committed at SQL level may not in fact not on disk. Do that in a controlled way and, roughly speaking, verifiable recovery may be up to a minute or two rather than seconds ago.

Do it in a less controlled way, e.g. fsync=off which basically hands behaviour over to your OS (and your tweaking of it), and in relatively pathological cases, this may even lead to unrecoverable states. But you tweaking could bring it out, so when in any doubt, leave it on.

...if only because you were probably doing that for speed, and there are safer ways to get most of that. Aside from the obvious "Use an SSD".

Most OSes implement fsync variants that mean "accept and collect data for a very short amount of time, to use disk rotations more efficiently" meaning they'll get to it very soon instead of right now.

This gets you comparable speed increases with less risk and recovery up to more recent state.
And may already be the default for you.

If you are willing to make the tradeoff to "losing a few minutes ago is fine", set synchronous_commit=off -- see above. Roughly speaking this also delays the write, but in a much more controlled way that should always be recoverable.

Separating different IO streams

You may have:

  • system drive (syslogs and such)
  • system swap partition/file
  • transaction log (xlog)
  • postgres WAL
  • postgres data
  • postgres indexes (verify)
  • postgres temporary data (verify)

...all on the same disk.

If feasible, you can consider putting at least the major ones on different physical drives (JBOD style). It'll help latency in general, and when heavy writes happen it can avoid doing some things effectively twice on the same disk (WAL+tables).

How much depends entirely on the kind of load.

If your load is almost entirely reading, it matters less
there's no waiting on writes
for small enough databases, reads are served by RAM anyway
If read-heavy and limited by read throughput, then more platters in your RAID is more valuable
in general it can matter how easily things are paged in or out (if you have other heavy RAM/disk users)
how often joins are very large
how often the system log is fsynced (often is better for debug, not ideal for IO)
...and so on.

This is mostly about when things start being bound by disk, and write-heavy - in which case you also want to look SSDs.

Most postgres parts can be moved by symlinking the directory it's in to a location on another drive. Make sure the permissions are correct, of course.

for table data you can do it a little more fine-grained by using CREATE TABLESPACE (creates a new storage location) and handing its name to CREATE TABLE and ALTER TABLE (verify)

Note that WAL is shared by one 'cluster', but if you have several, e.g. for different users, you may wish to put all of these on SSD.

WAL Replication

copying the xlogs

pg9 added streaming replication, which is basically a networked communication of WAL records.


  • minimal - enough for crash recovery
  • archive - helps archiving
  • hot_standby - lets you create read-only slaves

Note that while the logs record all data changes and they can be used for replication, only so much past data is kept, meaning these logs are not useful as backups.

other disk-related details

planner costs

This relates to postgres considering your system when planning reads, writes, use of indices, and other aspects of query execution. Most of these settings take a little reading to learn to use.

One example is effective_cache_size, your estimate of much postgres data sits in the OS cache. When larger, postgres will assume that it is more likely that an index it should read from disk is already in OS cache, and figure using an index is more likely to pay off than a sequential scan. The setting should not be overestimated, and you should be aware of the difference a server beding dedicated or not makes.

While there are reasons to change other values (e.g. random_page_cost), the most interesting value here is probably

TODO: read:


Generally you will either want it to go to syslog or, if that does not allow enough level filtering control, postgres's own logging.

log_destination (defaults to stderr) - one of

stderr - postgres default (distro packages may alter this)
csvlog - parseable comma-separated format (e.g. to then store in a database), applies only when using logging_collector
eventlog (windows only)

logging_collector (defaults to on) - captures each postgres process's stdout and sends it to log file, stored in...

log_directory (defaults to 'pg_log')

You probably also care about

  • log_min_messages (default is WARNING) - log level to log at all
you can configure more details (see the docs)
  • log_min_error_statement (default is ERROR) - log level to log with more contextual information (e.g. current query), meant for debugging
  • client_min_messages - (default is INFO) - log level filter towards connected clients
  • log_min_duration_statement (defaults to -1, which is off) - slow query logging
a positive values in milliseconds.
note you can do this per connection with e.g.
SET log_min_duration_statement TO 10000
can make sense for maintenance / bg stuff you know takes longer
some permission blah?(verify)
There are some tools you may like, see e.g.

See also:

Full text search notes

This article/section is a stub — probably a pile of half-sorted notes, is not well-checked so may have incorrect bits. (Feel free to ignore, fix, or tell me)

Since 8.3, postgres has fulltext search.

The mechanics are fairly basic, and because it's indexable, it can scale well for exact search, and fuzzy-via-normalization.

Basic guts

@@ operator (from the docs):

Full text searching in PostgreSQL is based on the match operator @@,
which returns true if a tsvector (document) matches a tsquery (query)

tsvector type

  • contains document in parsed form
  • basically a list of (lexeme,position,weight) (verify)
  • you often want to use to_tsvector to create a (normalized) one from text

tsquery type

  • contains query in parsed form
  • may combine multiple terms using AND, OR, and NOT operators
  • you often want to use either to_tsquery() or plainto_tsquery()

Note that lexemes must be normalized the same way (between data/index and query) or you'll probably get very few results. While falling back on defaults generally works, it's probably more robust to hardcode regconfigs in an app.

ts_rank() / ts_rank_cd() will rank results using weights (in the query and/or data; see details below).

It works on a whole result set, so at scale you probably do not want to run this on your entire set. You could choose to first filter with @@, or to run it on a LIMITed set.

Practical details - indices, columns, regconfigs

This article/section is a stub — probably a pile of half-sorted notes, is not well-checked so may have incorrect bits. (Feel free to ignore, fix, or tell me)

You can play with this without alterations - find find an existing text column (no need for extra columns or indices) and do something like:

SELECT textid  FROM mydocs
WHERE to_tsvector(body) @@ to_tsquery('term');

In practice you only ever want to run searches against a tsvector index.

So, create an index, choosing either GiST or GIN. As in general with those two:

  • GIN searches faster, but updates slower, so are probably preferred on relatively static tables.
Indexes are larger than GiST's (so mean more memory/IO)
  • GiST updates faster, but searches may be slower
Handier for often-updated data (particularly if it has a smallish amount of lexemes)

tsvector column or not?

You have a choice between:

  • given a text column, storing tsvectors in a separate column, and add an index on that column
takes more space than strictly required, but...
take from multiple text columns, add weights, and do ranking in a way that was not completely predetermined
makes it easier to massage your text outside of postgres
makes it easier to exeriment/debug, in particular inspecting what the parsing does
ALTER TABLE mydocs  ADD COLUMN searchvector tsvector;
CREATE INDEX mydocs_ftextidx  ON mydocs  USING gin(searchvector);
  • given a text column, telling the database to update the index directly
less space than the previous version, because there's no generally-unused tsvector column
is a little more black-box, in that you don't get to see it or do much with it
CREATE INDEX mydocs_body_index  ON mydocs  USING gin(to_tsvector('english', body));

If you've gone for the former (tsvector-in-a-column) approach, then you could update the tsvector column yourself, something like

UPDATE mydocs set title='foo', searchvector=to_tsvector('simple','foo') where id=%s

...or have it done for you via a trigger, like:

CREATE TRIGGER searchvectorupdate 
 ON mydocs FOR EACH ROW 
 EXECUTE PROCEDURE tsvector_update_trigger(searchvector, 'pg_catalog.english', body);

The built-in tsvector_update_trigger() function mostly just does a to_tsvector(). If you want slightly fancier handling, e.g. parsing from multiple fields, adding weights, then you may want a custom function (probably plpgsql) to do something like (see docs for full example)

 setweight(to_tsvector('pg_catalog.english', coalesce(new.title,)), 'A') ||
 setweight(to_tsvector('pg_catalog.english', coalesce(new.body,)), 'D');

On regconfigs

  • Mix configs at your own risk.
You usually want the config on your ts_query and ts_vector to be the same,
It won't bork out - but different configs may stem in different ways, so reduce to different lexemes, meaning you will get fewer or no matches
  • It's also a good idea to use these explicitly, to avoid a default throwing you off.
  • You can tell various parts that convert to a tsvector to take the config name from another column.
but see notes above
  • You may like to normalize in ways you understand
e.g. use 'simple' instead of any language,
use extensions like unaccent
...or your own normalization, in your own code (and either store the result in a column, or embed that code into postgres)

Inspecting tsvectors for very common things (that you may want to remove)

SELECT * FROM ts_stat('SELECT searchvector FROM mydocs')
ORDER BY nentry DESC,  ndoc DESC, word        LIMIT 250;

Search syntax

plainto_tsquery() expects plain text and constructs a basic AND search, e.g.

plainto_tsquery('search term')


'search & term'::tsquery

When you want complex queries, rewrite into postgres's syntax.

You have to to_tsquery() is used when you want operators - but you have to ensure the syntax is valid (or you'll get a ERROR: syntax error in tsquery).

This gets interesting in combination.

Special characters are
, whitespace, and null.
  • &
    , and
    are your typical boolean operators. Usable with
  • :*
    indicates prefix matching, like
  • Single quoted strings are necessary when adding phrases as such, and can trigger the dictionary.
Terms themselves are single words, or single-quoted words/phrases (which can trigger the dictionary?)

Some examples (removing SQL's quoting/escaping for readability):

supernova & !crab
'supernova' & 'sun' 
cat:* &!'catsuit'

Proximity matching, weighed terms, and ranking

Positions are always stored, e.g.

'quick:1 brown:2 fox:3'::tsvector

Which is enough to give higher value to words that happen in proximity.

Use ts_rank_cd (over ts_rank). For the following it would give 0.05, 1.0, 1.0, purely because there is less distance in the latter two

select ts_rank_cd('quick:1 brown:2 fox:3'::tsvector, to_tsquery('quick & fox')),
       ts_rank_cd('quick:1 brown:2 fox:3'::tsvector, to_tsquery('brown & fox')),
       ts_rank_cd('quick:1 brown:2 fox:3'::tsvector, to_tsquery('quick & brown & fox'));

Read the docs if you want more control.

Note that if you want strictly-adjacent phrase matching, older versions would force you to do your own inspection on @@'d results, e.g.

SELECT * FROM articles WHERE fulltext @@ 'word1 & word2'::tsquery AND fulltext LIKE '%word1 word2%'
Since 9.6 there is have
usable within to_tsquery.
<-> is strict proximity, <2> for within 2 words, <3> for within 3, etc.
note that this is ordered, so not symmetric

Optionally, the tsvector elements you search have weights, in four categories, because it is primarily meant as a coarse weighing-in scheme, e.g. to be able to say

article's keywords should contribute a lot
title words should contribute a bunch
body words should contribute a little

When you put together a tsvector from parts you can overwrite the weight for parts, allowing you to re-weigh your indexed tsvector according (or even re-weigh during a query).

For example, "fat cat fat rat" with some weighing that prefers nouns and particularly rats might become:

'fat:2,4C cat:3B rat:5A'::tsvector

The letters are the four weight categories. The default is D, and D weighing is omitted when printing these.

Weighing means different terms contribute differently (the way tsvector elements are weighed by default A=1.0, B=0.4, C=0.2, and D=0.1). The values for the following

select ts_rank( 'fat:2,4C cat:3B rat:5A'::tsvector,  to_tsquery('fat'))
select ts_rank( 'fat:2,4C cat:3B rat:5A'::tsvector,  to_tsquery('fat & cat')) 
select ts_rank( 'fat:2,4C cat:3B rat:5A'::tsvector,  to_tsquery('fat & rat'))

work out as approximately 0.2, 0.4, and 0.6.


  • altering a tsvector with weights involved looks something like:
UPDATE tablename SET searchvector = setweight(to_tsvector(coalesce(title,  )), 'A')  ||
                                    setweight(to_tsvector(coalesce(keyword,)), 'B');
this acts as if the text was concatenated (position indexes of concatenated tsvectors are higher in the result)
coalesce ("returns first non-NULL argument") is used here to avoid cases of 'something || NULL' as the result of that is defined as NULL

  • you can't weigh queries
Yes, you can put weights in a tsquery, and yes,
tsvector @@ tsquery
with such a query has weight is valid expression
but it does something rather specific: it matches that term only when the tsvector has it in the same weight(-class-thingie).
For example,
'cat:2A,2B'::tsvector @@ 'cat:C'::tsquery
is false.
ts_rank() with weights in the tsquery seems to treat them as if those weights were not present(verify))}}
  • normalization allows to consider the document length, in particular to have long documents score higher mostly because it has more words. Default is not to do that, values specify a few different strategies.

See also:

configurations / regconfigs, a.k.a. parsing behaviour

Default parsing is good enough for tests and, if you know its limitations, production.

Each query will use the default search configuration, unless told otherwise.

Go to psql, do
for configs (or
for dictionaries).

I have:

danish, dutch, english, finnish, french, german, hungarian,
italian, norwegian, portuguese, romanian, russian

For fiddling, see

Keep in mind that you may not want to rely on default config, since not everything does the same thing. For example:

select to_tsquery('english', 'The & Fat & Rats'),
       to_tsquery('simple','The & Fat & Rats');

Is respectively:

"'fat' & 'rat'"
"'the' & 'fat' & 'rats'"

See also


template databases

This article/section is a stub — probably a pile of half-sorted notes, is not well-checked so may have incorrect bits. (Feel free to ignore, fix, or tell me)
  • template1 is what every CREATE DATABASE copies, effectively a skeleton database
This lets you add site defaults, or site-default modifications
  • template0 is an alternative you can forcibly use
its main point is verbatim restores, which are portable only without first adding site defaults

See also:


This article/section is a stub — probably a pile of half-sorted notes, is not well-checked so may have incorrect bits. (Feel free to ignore, fix, or tell me)

host-based authentication (hence the filename), is mostly about figuring what kind of authentication applies, depending on how you're connecting (and sometimes to not allow a connection at all - this part has some functional overlap with network-level firewalling).

This is a layer that applies partly before we even talk authentication or privileges.

...except that trust/peer/ident can effectively bypass auth in specific conditions.

This config file is possibly easiest to find via
sudo locate pg_hba.conf
(locate since database installs aren't always in the same place, sudo because your typical login user may not be allowed to read it). Note that after editing, you can reload without a restart using SIGHUP (often
killall -HUP postmaster
) [11]

More verbosely

Line syntax is roughly:

CONNTYPE   DATABASE  USER  [ADDRESS] METHOD  [OPTIONS]                                                                                             



  • local
    for unix domain socket
  • hostnossl
    for TCP/IP without SSL
  • hostssl
    for TCP/IP with SSL
  • host
    for TCP/IP (with or without SSL)

DATABASE can be:

  • database names (multiple need to be comma-separated)
  • all
  • samerole
    for database names matching role name
  • replication
    (note: all does not imply this, you want two lines if you want that)

USER can be

  • all
  • a username
  • a group name (prefixed with +)
  • a comma-separated list of users/groups


  • samehost
    for any currently configured IP on the local host
  • samenet
    for anyone on the networks currently configured on the local host
  • a IPv4/IPv6 address
  • a IPv4/IPv6 address with CIDR-style mask (amount of bits)
  • a hostname
  • a domain name (start wit a dot, it will matche everything under there)

METHOD can be:

  • trust
  • reject
only when connecting via unix domain sockets
asks the (*nix) kernel for the connecting user's username (via getpeereid),
applies optional username mapping
when said username matches database username(verify), it allows (passwordless) login
((note: for libpq, domain sockets apply when you fill in nothing for the host, rather than 'localhost'. Other things may have their own connstring logic)
  • ident
    - like peer, but over TCP by querying using the ident protocol (think RFC 1413).
...which, note, is not an auth protocol
  • md5
  • password
    (plain text, please avoid)
  • cert
    - SSL client certificates
  • External auth:
    • pam
    • ldap
    • radius
    • gss
    • sspi
    • krb5

OPTIONS are further NAME=VALUE style options to the auth module

See also:

On peer auth (and ident)

Peer auth, to recap the above description, applies only when connecting via local domain sockets, and asks the (*nix) kernel for the connecting user's username (via getpeereid), and allows passwordless trust when local username matches the role name.

Note that this shifts auth to the ability to become a particular user on that particular host (have its password, or appropriate sudo rights).

It's e.g. quite convenient for local db admin tools.

Securitywise it has its upsides and downsides.

It reduces attack surface when that the database doesn't need to listen to IP, as in that case there is no password to steal.

If your app is fine running on the same host, you could give it only peer access.

On the other hand, it means your auth config is now in several places (harder to oversee), and this is not portable to e.g. windows.

Default pg_hba often has peer for all databases, though usually, after a plain install the only user that this applies to is
local   all             all                                     peer
host    all             all               md5
host    all             all             ::1/128                 md5

Ident auth is roughly the same idea, but asks via identd instead, meaning it will work for TCP connections (...given you've set up identd).

You want this only when you understand the security implications, in particular that you are moving parts of your auth off your host (while not being auth as such). In some tightly controlled mutually-trusting hosts this may be convenient, in anything that doesn't live in such a bubble it's an unnecessary attack vector. Avoid use unless you understood that and are fine with it.

FATAL: Peer authentication failed for user "postgres"

Since peer only applies to domain sockets, this comes down to that your effective username doesn't match the database username.

when you wanted peer, then you're probably running as the wrong user.

When you actually wanted to log in with a password the issue is probably that your database client library / connector assuming you want a domain socket when you didn't -- which implies that the wrong pg_hba line applies.

The fix is to specify an IP address from your client, so that it won't choose to use a domain socket. (note that when you don't specify a hostname to libpq, it assumes you want domain socket)

FATAL: Ident authentication failed for user "postgres"

This often means your OS has installed a local(-only?) identd setup.

It seems some distros do this, specifically for postgres, to imitate peer (possibly figuring it's more portable than the syscall that peer uses). For example Redhat's initdb defaults to ident auth [13][14]

In that case you can learn how RH/other wants you to use this, but I prefer to use md5 (or maybe trust or peer) as I figure appropriate.

(i.e. not using ident at all, it annoys me)

pgadmin notes and errors

Note that you can imply domain socket by leaving address field blank. (is this actually a libpq thing?(verify))

On prewarming caches with indices and/or data

Seeing what your server is up to

If you log in via shells, it can be convenient to use


I've used the following to expose details via a web page:

SELECT count(*), query  FROM pg_stat_activity  GROUP BY query;

Also working on a CLI utility that display more from that table more usefully.

"idle" and "idle in transaction"

Idle are those connected, not currently talking to the database, and not in the middle of a transaction.

They won't block other queries from progressing, but they do count towards the connection limit.

If the client just forgot about the connection, then these open-but-unused connections may add up over time, and lead to blocking new connections even though nothing is being done. So it's good practice to close connections as soon as you can.

Note that using a connection pooler, you would expect a fairly-constant handful of connections to always be there and idle, ready to be given out.

Idle in transaction means a connection has started a transaction, has not committed it or rolled it back, and (if you see these, most probably) hasn't talked to the database for a while.

Idle in transaction can lead to blocking other queries, depending on the read-or-write nature of the operations involved, and implicit and explicit locks in that transaction.

Usually, idle in transaction means the app does a few seconds of work between parts of its transaction.

Within a single maintenance script this is often nothing to worry about.

As this is lazy (arguably buggy) code that might block other things for a little while (so become a performance bottleneck), it's something you want to iron out of your regular code.


  • try to figure out the entire transaction before starting it
  • (if that's too annoying,) commit changes as soon as sensible
  • if you reuse a connection a few times, rollback after selects
  • Handle errors - at least the cases of "ignore state of a now dangling connection and hope the problem goes away via a timeout or something"
the brute force fix is "rollback or close immediately after errors"
  • keep implicit transactions in mind
  • Use auto-commit where sensible (it avoids transactions)
You could try it as a default, and use transactions only where you need them.

...whichever apply. If you can't, timeouts may help a bit.

On implicit transactions

Postgres itself: [15]

"PostgreSQL actually treats every SQL statement as being executed within a transaction. 
If you do not issue a BEGIN command, then each individual statement has an implicit BEGIN 
and (if successful) COMMIT wrapped around it."

(which you could call autocommit)

Client may easily add some behaviour to this. While psql follows the above, e.g. [16]:

By default, Psycopg opens a transaction before executing the first command: 
if commit() is not called, the effect of any data manipulation will be lost.
The connection can be also set in “autocommit” mode: no transaction is automatically open,
commands have immediate effect. See Transactions control for details.

Some libraries may add some more behaviour around that

Finding the offending code

(Chances are you'll see something like an error case that doesn't lead to a rollback)

The most recent query on the idle-in-transaction connection is usually a strong hint. I sometimes add comments to queries to help.

pg_top shows the state but not the most recent query.

You could still use it, when you add something to logging you can grep for, probably the PID. Look at log_line_prefix , e.g.:

log_line_prefix [%p] %u@%d  %r
# ...particularly if you want to build some of your own reporting around this

It's probably simpler to run something like the following in psql

select usename, datname, query from pg_stat_activity where state like 'idle in%';
\watch 1

See also:

Estimating disk size of things


databases size

  • pg_database_size

table and index size

Keep in mind that

  • pg_relation_size(relname)
    • counts direct table size
    • does not count TOAST
    • does not count indices
  • pg_table_size(regclass)
    • counts direct table size
    • counts TOAST
    • does not count indices
  • pg_total_relation_size(regclass)
    • counts direct table size
    • counts TOAST
    • counts indices
    • ('s equivalent to pg_table_size() + pg_indexes_size())

See also:

row sizes

You can ask for the stored size of a value, and thereby estimate the size of parts of a row.

  • individual row size can be fetched with something like:
select pg_column_size(tablename.*) from tablename
  • average row size can be done with adding an avg() to the above,
...or by dividing table size by the row count, which will also count tuple headers, padding, etc. (verify)
remember how NULLs (don't) count.

  • keep in mind you're not counting indices. To consider those too, you may be more interested in something like:
select pg_total_relation_size('relname')::real / count(*) from tablename

column sizes

  • pg_column_size(value)
    (since 8.4) says how much space it takes to store a value (and you can avg() or sum() them for a table)
When TOAST applies...
a columns may take four bytes more (the reference into the toast storage) (verify)
when compresses happened, it's the compressed size.
  • octet_length(value)
    gives the uncompressed size -- of text/byte fields only (verify)
  • keep in mind you're not counting indices.
when e.g. calculating the percentage of storage used by a column, you likely want to include index size

It's trickier to compare all a table's columns individually, within a single SQL statement, because pg_column_size takes values, not column names. You can write a server-side function (makes sense if you want regular monitoring), but it may be easier cheat and do it from the client side, e.g.

def colsizesums(tablename, curs):
   ''' For the given table name, in the currently selected database, returns each column's:
        - name
        - type
        - total size       (bytes,  e.g. 995782220)
        - total size       (string, e.g. 950 MB)
        - average row size (bytes, rounded a bit, e.g.  588.7)
        Note that this does a tablescan for each column, so isn't fast. 
          To do all at once you need a little more cheating.
        Note also that it's not injection-safe (second select), 
 don't expose this without hardcoding table names.
    import re
    tablename = re.sub('[^A-Za-z0-9_]','',tablename) # basic sanitization
    curs.execute("""SELECT column_name,data_type  FROM information_schema.columns
                    WHERE table_name = %s  ORDER BY ordinal_position""", (tablename,))
    for colname,coltype in curs.fetchall():
        # yes, that's interpolated. Should be safe _enough_,
        #  assuming table and column names are not insane.
        curs.execute("""SELECT sum(pg_column_size(%s)),
                        FROM %s"""%(colname,colname,colname, tablename))
        ret.append( (colname, coltype, int(row[0]), row[1], round(row[2],1)) )
    return ret

Which gives output like:

[('id',           'integer',                       531360, '519 kB',   4.0),
 ('path',         'text',                        18399078,  '18 MB', 138.5),
 ('ext',          'character varying',             537550, '525 kB',   4.0),
 ('datetime',     'timestamp without time zone',   949232, '927 kB',   8.0),
 ('stillpresent', 'boolean',                       132840, '130 kB',   1.0),
 ('data',         'text',                        33423962,  '32 MB', 251.6)]


  • NULL values barely add storage
on top of constant row overhead, anyway - NULL is not stored as values, instead there is a per-tuple bitmap of which columns are present in this tuple
  • you can hand in literals.
rows can be handed into pg_column_size (and not octet_length, directly). Asking for rows, or tuples, will also count the 24-byte header, so:
select pg_column_size( 1::real ) + pg_column_size( 1::real ) = 8
select pg_column_size( (1::real,1::real) ) = 32
  • you can estimate TOAST compression in a (text/byte) column like:
SELECT avg(pg_column_size(colname)::real / octet_length(colname)::real) from tablename;
note that this may be very little when TOAST decides it fits in the table itelf.
Counting only actually-compressed things is a little harder

Fast count from a table

If you need an accurate count:

In general, if you do:

select count(*) from mytable

you'll find this is slow for very large tables, more or less regardless of anything.

This because to give a fully accurate answer, it has to scan a lot.

The reason lies in MVCC. You could say this counting operation is considered rare, and less important than MVCC's tendency to relieve locking for more typical queries.

Since 9.2 you can occasionally satisfy a subset of counting queries with an index, which is typically faster than a table scan. It comes with some restrictions, though, so evaluate whether you really need this.

If you need a fast and accurate count

Keep a counter, keep it updated. Using triggers on table alterations may be the easiest way. It's extra work on most alterations, though, so only do this when it's important.

If a recent-ish estimate is good enough


SELECT reltuples::bigint AS estimate FROM pg_class where oid='myschema.mytable'::regclass;

The accuracy of this result varies with how often you ANALYSE (with autovacuum in place you're typically good).

Alternatively, have a background job do a count every minute or two, store it somewhere, and have everything else fetch that count. Note that when this still applies sequential scans this still stops scaling when the data doesn't fit RAM.

See also:


This article/section is a stub — probably a pile of half-sorted notes, is not well-checked so may have incorrect bits. (Feel free to ignore, fix, or tell me)

A database is created with a specific character encoding (typically defaults to UTF-8 a.k.a. UNICODE; initdb picks these up from template1).

An older default may also be SQL_ASCII, which is not great if if you ever store anything but ASCII. Similarly, postgres supports codepages, but you usually don't want to use them (unless you know you really do).

The encoding is applied to its text types in text operations(verify).

The database encoding is essentially a default, in that in many operations you can force others.

The client connection also has an encoding - which the database knows about, to be able to convert between database and client encoding (when necessary and possible), so as a user you only need to worry about the connection coding.

You can set the client coding with
, or the more standard (though slightly wider)
SET NAMES 'value';

On top on the types in your programming language of choice, some clients may be configured for extra conversion. What comes in is still a byte coding, and if your language has a unicode string type, it can be very convenient to have it converted to that automatically.

Personally, I dislike implicit convenience conversions. I like to use UTF8 as the database and client encoding, and explicitly convert on the client side when necessary. Yes, that makes for longer code, but it's also clearer (also because not all text needs to be decoded. A case I run into frequently is the *nix filesystem API, which is essentially just arbitrary bytes (other than NUL and slash) - it just happens that most people use utf8 by convention, not least because their LANG is usually that)

Client details are up to the client, and API.

It seems that e.g. psycopg2

client encoding defaults to database encoding
unicode is encoded according to client encoding
python2 str is not encoded, so has to be correct according to the server coding(verify)
(so both may work as you wish)
incoming: you get a str
in python2: str means bytestrings, so hands things back in client encoding(verify)
arguably the most controlled way is to both set client encoding, and hardcode the decode
arguably the more convenient way is to register a typecaster to get it in unicode always. On a cursor or right after import
in python3: decodes to unicode (using the client encoding)

See also: