For other database related things, see:
- 1 "I installed the package, now what?"
- 1.1 (If your packager didn't:) create postgres database
- 1.2 Check that you can log in
- 1.3 Optional: Set password
- 1.4 Optional: creating roles, setting access privileges, and client auth settings
- 1.5 Optional: performance tweaks
- 1.6 See also
- 2 Use notes
- 2.1 psql
- 2.2 Upserts
- 2.3 SAVEPOINT
- 2.4 temporary tables
- 2.5 Memory tables
- 2.6 Date math
- 2.7 data types
- 2.8 On large binary / text data
- 2.9 Indexes
- 2.9.1 Index types
- 2.9.2 Variations for specific uses
- 2.9.3 Other notes
- 2.9.4 Understanding EXPLAIN / queries to rewrite for indices
- 2.10 Rules
- 2.11 Warnings
- 2.12 Errors
- 2.12.1 ERROR: current transaction is aborted, commands ignored until end of transaction block
- 2.12.2 Error: Cannot stat /var/run/postgresql / No PostgreSQL clusters exist; see "man pg_createcluster"
- 2.12.3 Permission denied - Is the server running locally and accepting connections on Unix domain socket "/var/run/postgresql/.s.PGSQL.5432"?
- 2.12.4 No PostgreSQL clusters exist; see "man pg_createcluster"
- 2.12.5 could not connect to server: Cannot assign requested address
- 2.12.6 ERROR: MultiXactId has not been created yet – apparent wraparound
- 3 Replication, Backup
- 4 client-specific
- 5 Notes on performance and innards
- 5.1 Memory related
- 5.2 VACUUM, VACUUM FULL (MVCC and the FSM)
- 5.3 Indices
- 5.4 Table fill factor
- 5.5 Index fill factor
- 5.6 Architecture, and disk related details
- 5.7 Disk details: WAL, fsync, logging
- 5.8 other disk-related details
- 5.9 planner costs
- 5.10 logging
- 6 Full text search notes
- 7 Semi-sorted
- 7.1 template databases
- 7.2 pg_hba
- 7.3 pgadmin notes and errors
- 7.4 On prewarming caches with indices and/or data
- 7.5 Seeing what your server is up to
- 7.6 Estimating disk size of things
- 7.7 Fast count from a table
- 7.8 Encodings
- 8 Unsorted
"I installed the package, now what?"
The following steps should get you to the point where an client/app can get to the database it needs.
(If your packager didn't:) create postgres database
Postgres is made to allow multiple entirely-separate instances (distinct configs, distinct storage directories), and some distros are set up to take advantage of this.
If so, then their package managers may choose to not automatically create a blank setup, assuming you wanted control over that, and/or may be upgrading.
As such, do one of
- create a config
- when starting from scratch
- initdb at the core, and some have pg_createcluster (which wraps initdb) to make this easier, see below
- supply a config
- mostly when upgrading an existing one
e.g. on RHEL, it's basically "run initdb, enable the service, start the service". 
Debian/ubuntu only: clusters
While database clusters are a postgresql concept ("a collection of databases served by a postmaster instance"; SQL standard calls it catalog cluster) various distros create just the one - they seem to assume that if you need more than one, you know your stuff anyway.
Debian/ubuntu manage distinct clusters, so that service management and package management can properly run multiple database clusters, and multiple postgres versions, side by side.
It theoretically also eases updates between them).
You want database clusters e.g. when managing databases to distinct clients.
For simple test setups, just one for all your stuff is fine.
Assuming it created a database cluster (package management usually does), you're already done.
If a cluster does not yet exist (you'll see a lot of messages like "No PostgreSQL clusters exist"):
And on debian/derivatives,
will show none.
A cluster has a name and a version. For example, after installing the postgresql-9.3 package you might run
pg_createcluster 9.3 main --start
which creates a set of config files inside
Check that you can log in
In most cases, after basic installation you have
- a host user called postgres with admin rights, and a postgres role within the database
- a database called postgres
- a pg_hba that trusts localhost
That means the easiest check is to become the postgres user, and run psql (a command line tool).
Assuming you have sudo rights, that amounts to:
sudo -u postgres psql postgresIf it gives you an SQL prompt ( ), then there is a running postgres instance, and you're logged into it as the local admin.
If it gives an error, search for the error.
Optional: Set passwordThat is, the role within postgresql is basically its superuser.
You'll likely use it for major admin (e.g. creating databases),
It's probably set to trust local admins (see #On_peer_auth_.28and_ident.29). This is fine on your own box / instance, while on shared-user systems there are cases where it's not as secure, so if the data is sensitive then consider setting a password.
(Also, it's sometimes useful to allow remote login, e.g. for tools like pgadmin3, and you'ld probably allow that via password auth -- though note you can SSH-tunnel to avoid needing this)
To set its password, connect to the database, e.g.
sudo -u postgres psql postgres
Once you're at its prompt:
Optional: creating roles, setting access privileges, and client auth settings
I typically do this later (and frankly am too lazy to do it on my home server), but it's a thing to consider.
By default you are:
- doing everything as the database superuser (postgres role)
- only allowed to so so on localhost (because peer auth, see below)
This is fine for initial mucking about, but in production you may want e.g.:
- admin can do everything (but no regular users can authenticate as them)
- backup can only read everything
- each app can read/write, but only its specific database
Access privileges are combinations of
- actions (SELECT, INSERT, UPDATE, DELETE, ec.)
- on specific things (DATABASE, TABLESPACE, TABLE, etc.)
- to roles (users or groups)
Roles are postgres's internal user management (separate from the OS's users - though note that you can tie things to host accounts in pg_hba).
It is preferable to read up on common practice - and also to keep this as simple as makes sense to you, in that the easier you can remember/explain it, the more that random implications won't sneak up on you.
Optional and sometimes useful: client authentication (pg_hba)
pg_hba is part of config, and is rough-grained control of which sorts of connections get a chance to log in at all.
Often filtering out by type of connection, type of authentication, and such.
People often seem to use it as a firewall-like whitelist. The defaults are usually decent for that.
Note that this is is about connections, and unrelated to role privileges.
Optional: performance tweaks
There are some conservative defaults that you can increase, e.g. a few relating to memory use.
And on platter disks it helps when not everything (tablespace, WAL, logs) goes to the same platter(s).
Note that if you're unsure where the config file is, or if there may be several and you don't know which one, will tell you which file the running database is using
The .psqlrc file is psql's config. See the man page, but example:
# see how fast things are \timing on # Setting PROMPT1 to show the server you're on \set PROMPT1 '%n@%m:%/ \#' # colors are handled but don't make things more readable \set PROMPT1 '%[%033[90m%]%n%[%033[0m%]@%m:%[%033[92m%]%/%[%033[0m%]\# ' # See also https://www.postgresql.org/docs/9.6/static/app-psql.html#APP-PSQL-PROMPTING # one of: #\pset pager off \pset pager on # won't trigger when output has fewer than ~25 lines #\pset pager always # many-column tables can be annoying to read # (also depending on the PAGER itself, see below) # expanded formatting shows records at a time, vertically. # to get this only when it's wider than the screen: \x auto # See nulls as distinct from empty string. \pset null '(NULL)' # You can also make it distinct from literal whatever-this-string-is # by using some unicode character here, # as psql will show unicode ''data'' escaped (configurable maybe?) \pset null '¤'
Your PAGER is a shell thing, that psql also chooses to use. You may want to, in general, tweak it as you like (e.g. if using less, I want -S, to crop rather than wrap lines, as that makes scrolling to the right a lot less confusing), e.g.
export PAGER="less -SaXz-5j3R" # or export PAGER="/usr/local/bin/gvim -f -R -" # or look at things like https://github.com/okbob/pspg
You can also consider putting some canned queries in there, see e.g. https://opensourcedbms.com/dbms/psqlrc-psql-startup-file-for-postgres/
then will not pick up the .psql file in that account.
Note that if you adminster the database via another user (e.g. postgres),
The simple solution is to switch to that user (or ssh as that user) for a normal login.
Another possible fix is something like
alias psql="sudo -u postgres sh -c 'cd ~postgres; . ~postgres/.bashrc; psql'"
Upserts refers to an INSERT where, if a matching(=conflicting) row exists already, falls back to an UPDATE. (see also Relational database and SQL notes#Upsert)
This is not a core SQL feature, but useful.
Postgres will have a "INSERT ... ON CONFLICT UPDATE" in 9.5
Upserts are a weirder thing than you think / it should be, in part because the condition is actually a "would an insert cause a duplicate in a UNIQUE index", and that can race with another such case.
The fallback logic must necessarily happen in a mini-transaction(verify). This makes it interesting to implement efficiently, MVCC makes it even more interesting, as does doing it correctly according to the set isolation mode, and being well defined in terms of the possible errors.
|This article/section is a stub — probably a pile of half-sorted notes, is not well-checked so may have incorrect bits. (Feel free to ignore, fix, or tell me)|
Occasionally you say "hey, I'm creating some costly-to-calculate intemediate data. It could be a subquery, but I'd like to use it more than once, so can I keep this somewhere - and have it be its own thing that auotmatically goes away when the connection ends?"
Basically, create tables using : .
There are two things that TEMPORARY guarantees:
- That they will be dropped at the end of a session.
- Or, if you add ON COMMIT DROP, at the end of the current transaction
- That these tables exist in their own schema
- meaning table names won't conflict with existing tables
- though when not using fully qualified names, can mask them in your queries
- won't conflict between sessions (because that temporary schema's name is picked per session(verify)).
- will be in RAM for smaller data -- but once the session becomes larger than temp_buffers it will wpill over onto disk
- so you may want to increase temp_buffers, or at least keep this in mind
- it said 'session' up there, not connection. The difference can be important e.g. around connection pooling
- so prefer the transaction variant when convenient
- otherwise drop it yourself
- "The autovacuum daemon cannot access and therefore cannot vacuum or analyze temporary tables. For this reason, appropriate vacuum and analyze operations should be performed via session SQL commands. For example, if a temporary table is going to be used in complex queries, it is wise to run ANALYZE on the temporary table after it is populated."
|This article/section is a stub — probably a pile of half-sorted notes, is not well-checked so may have incorrect bits. (Feel free to ignore, fix, or tell me)|
tl;dr: doesn't really exist, but you can get a good imitation if you must
The use case is typically a sort of memcache: shared, transaction-safe, guarantee of no disk IO.
A global temporary table is what you'ld want, and is considered/planned, but not implemented as of 9.4.
You can get something very close in a few ways:
- An UNLOGGED table means one that does not use the WAL
- truncated after a crash (because they're not crash-safe)
- but is still transactioned, and visible between sessions
- handled largely in RAM, will still go to disk if storing a lot of data
- arguably a feature for anything you can't limit the size of
- which all makes a bunch of sense for caches (except for the bit where it can occupy disk)
- A temporary table stays in memory until it's too large for it
- apparently because it's a more specific type of unlogged table
- but you can't share these, so generally not what you want
- Add a tablespace on a RAMdisk, then create tables on that specific tablespace.
- If you want these tables to be predictably there (around reboots and such), you would do this creation as part of OS/service startup.
- Keep in mind that if the WAL is on regular disk, this won't give you much faster alterations, only faster selects.
There are some extensions that do similar things.
|This article/section is a stub — probably a pile of half-sorted notes, is not well-checked so may have incorrect bits. (Feel free to ignore, fix, or tell me)|
There are two basic date-and-time types:
- timestamp, a.k.a. timestamp without time zone
- timestamptz, a.k.a. timestamp with time zone
There is no difference in storage (both take 8 bytes).
Because the semantics mean it doesn't need to store the actual timezone.
It does, however, convert them based on where you say you are.
- timestamptz always stores UTC
- the server converts from the timezone the client has set
- client timezone must be correct, obviously
- You may want to set it explicitly because it may be inherited(verify)
- preferable when you interact directly with more than one timezone
- timestamp you store exactly what you send
- useful when
- you have data without timezone info, you don't want to start pretending now so want to store it as-is
- you consider the client timezone authoritative
- ...or want to deal with all timezone logic (if any applies) yourself (hint: you generally don't)
Note that if you have server-side logic, including comparisons with things like NOW() - INTERVAL '7 days', then you probably want timestamptz.
Because otherwise you'ld be mixing the two types and have to think very hard about every such mix. For example, the docs say,
"Conversions between timestamp without time zone and timestamp with time zone normally assume that the timestamp without time zone value should be taken or given as timezone local time. A different time zone can be specified for the conversion using AT TIME ZONE."
# set timezone TO 'Europe/Amsterdam'; # select now(); 2019-07-22 17:52:49.146931+02 # select now()::timestamp; 2019-07-22 17:52:49.146931 # set timezone TO 'GMT'; # select now() 2019-07-22 15:52:49.146931+00 # select now()::timestamp; 2019-07-22 15:52:49.146931
The mainly points to be made here are probably that
- the two now() values are identical (refer to the same absolute time), and the two now()::timestamp values are not.
- it does communicate timezone information on the way out (based on where you say you are)
The connection's timezone can be viewed with
Or the less standard but easier to remember:
"Show timestamps from the last week" can look something like:
SELECT * from files where entered > NOW() - INTERVAL '7 days'
"Are time functions volatile?"
In postgres, NOW() and CURRENT_TIMESTAMP() are aliases for TRANSACTION_TIMESTAMP(), which is defined STABLE within a transaction.
There are other functions, e.g. clock_timestamp(), that are volatile.
serial, bigserial notes
As the documentation notes:
CREATE TABLE tablename ( colname SERIAL );
is equivalent to:
CREATE SEQUENCE tablename_colname_seq; CREATE TABLE tablename ( colname integer DEFAULT nextval('tablename_colname_seq') NOT NULL );
Bigserial and smallserial are the same idea, but use bigint (int64) and smallint (int16) instead of integer (int32)
UNIQUE is not implied on the column. You may want it.
nextval() always increment, and calls never returns the same value. Use from transactions that are rolled back create gaps in the sequence.
Text and character notes
- verify). and map to the same internal types, but varchar implies length check when entered, and text has no limit(
- will be padded with spaces
- verify) is short for char(1) (
An E before a string is a string including escapes (a non-standard extension of SQL).
- if you do operations on the data, in the database, JSONB's lower access time is worth the extra space
- if it's actually structured data (or the important core is), use columns
- the query planner cannot help you on unstructured data
- but well-targeted indices work as well as any
- If it's store-and-return and only the app cares JSON is smaller
- ...or use TEXT, if you don't want to handle insert/update time validation errors
JSONB is a binary serialization, and
- takes some shortcuts that are typically fine (doesn't maintain key ordering, duplicate keys (rarely used but JSON technically doesn't rule out) (last key wins))
- slightly faster at extracting values, because it's already parsed
- has more operations defined
- sometimes little larger to store(verify)
JSON is stored as text, and only validated
- will preserve key ordering, whitespace, duplicate keys. (Only?) If you need any of these should your prefer JSON
Date and time notes
Tests like "in last half hour":
WHERE entered > ( NOW() - INTERVAL '30 minute' )
Casts to integer should be done via extract.For example, to get a timezone-less(verify) since-epoch time for the current time, use something like .
Basically, read: https://www.postgresql.org/docs/9.1/static/datatype-enum.html
To change a column to an enumerated type you'll need some explicit casting. See e.g. http://stackoverflow.com/questions/15655820/upgrading-a-varchar-column-to-enum-type-in-postgresql
Inspecting an enumerated type
- For an overview as data, look at pg_catalog.pg_enum. A query you may like (from ):
SELECT n.nspname AS enum_schema, t.typname AS enum_name, e.enumlabel AS enum_value FROM pg_type t JOIN pg_enum e ON t.oid = e.enumtypid JOIN pg_catalog.pg_namespace n ON n.oid = t.typnamespace
- For a specific one:
- use in psql
On large binary / text data
- Pre-TOAST, storing larger things (in bytea/text) mean overhead in the main table.
- usually okayish for objects up to a few megabytes, beyond that you probably needed LO (Large Objects)
- LO is essentially an enumeration into a postgres-managed file store. Which you could do yourself, but LO saves you a bunch of work and permission blah and possible mistakes.
- Now that we have TOAST, it mostly just does what you want: large transparently go to a separate table
Large Objects versus TOASTed bytea:
- TOASTed bytea is often easiest for binary blobs: mostly transparent, LO is more work to use
- LO can stream, seek, and modify in-place, bytea are only handled as a whole
- Relevant for some use cases on very large data
- both bytea and LO are limited to a few GB (depends on config?) (verify)
- the text type is actually wrapped around bytea.
- The only real difference is that text has text codings, bytea does not. e.g. bytea allows a 0x00 byte, many text codings do not
- bytea cleanup is automatic, LO's is not (as they act as references to files)
- a few ORMs will always fetch all columns and cannot be told not to, which for TOASTed fields may be a lot.
- the same typically wouldn't fetch data from LO
- TOAST (The Oversized-Attribute Storage Technique) avoids storing large values in the main table
- by putting them into a large-stuff-side-table instead.
- no functional difference to you: values are placed into and fetched from that external table transparently when you use them
- TOAST only applies to types that can be large.
- which internally means things based on varlena
- which mostly means arbitrary-length text types (including larger VARCHAR), text, binary columns, also things like JSON/JSONB.
- by default only actively used when things are large (generally useful behaviour; see below)
- becomes likely to be applied for values over ~2K, a given if over ~8K (because page size)
- (technical reason: postgres uses a fixed page size (commonly 8 kB), and does not allow tuples (=rows) to span multiple pages)
- will be compressed and/or stored outside of the table itself. The preference for which is tweakable via column settings
- can make sense for columns that you don't fetch in most of your queries.
- that that the columns in the main table itself stay much smaller (cache better)
Short text values like words and text identifiers are likely to be things you select on, commonly fetch, sometimes transform while selecting, so you would prefer to avoid the extra work from decompression or the extra fetch, so are best kept in the table.
Few-kilobyte-large text values are likelier to be kept more for archiving reasons, e.g. the unparsed version of something or an article abstract, never filtered on, not fetched in most queries on the table, and generally only fetched in a "present this single item fully" style query.
Meaning 90% of the time, keeping it in the main table would only mean more data you typically skip past. Putting it elsewhere makes it likelier that that main table's more-commonly-used data will be and stay in RAM.
If it's something you indeed rarely fetch (or never in typical use, e.g. unparsed version) then compression also saves disk (and memory).
In some cases you know more about how these values will be used than the above rough assumptions, so you can control the TOAST strategies (see SET STORAGE), mainly to disable the compression and/or out-of-lineness.
- PLAIN (also describes non-TOASTable types)
- prevents compression
- prevents out-of-line storage (i.e. forces it to be in the main table)
- Puts a limit on the row's combined length due to the page size - crossing that will make inserts/updates fail.
- still a good idea for things you commonly handle in queries and know will always be short (keep in mind this is about the size of the entire row rather than a specific text field)
- allows out-of-line storage
- prevents compression
- EXTENDED (default for most TOASTable types)
- allows out-of-line storage
- allows compression
- compression will be attempted first. If row is still too big, use out-of-line storage
- allows compression
- out-of-line storage is only used if there is no way to fit the row in a page (so effectively the threshold is 8K instead of 2K(verify))
- The difference to EXTENDED is priority: Basically, within a row tuple, MAIN columns stay in until it's clear that pushing out EXTENDEDs won't make things fit.
To change it, it's something like
ALTER TABLE tablename ALTER columnname SET STORAGE EXTERNAL
To see current STORAGE type in psql:
(TODO: find SQL equivalent)
- you cannot currently force TOAST-style compression for small text values
- you cannot force things out of the table always
- (both seem to partly the design decision that the small gains aren't worth the overhead)
- TOAST implementation details are the reason values single values are limited to at most 1GB.
- (it steals two bits from the varlena 32-bit length value)
- Limits: Each table gets its own TOAST table, and it uses OIDs.
- so this is the basic 232 OIDs limit with the standard "degraded performance if you get anywhere close to that"
- yet that won't matter until you have got billions of rows and at least (because ~2K threshold) terabytes of data in a single table.
- so it's something to keep in mind if you plan to put all of your big data directly in postgres (note that sharding to more tables may be a good idea anyway)
- On storing files in the database:
- pro: you get transactional integrity (with path-to-filestore and LO this is more your responsibility)
- pro/con: replication and backup will include them. This will often be large. But also makes it complete, and won't go out of sync (as a path-to-filestore and LO might)
- con(/pro): doesn't do files over ~1GB (LO does 2GB If you need larger, you should really want the filesystem solution)
- con: goes via memory, which means large files will clobber your caches(verify) (for contrast: LO and path-to-filestore can be streamed)
- con: valena in most uses(verify) is escaped on the wire, meaning they're larger
- note: take a look at the Foreign Data interface, it's sometimes a better fit. Particularly if you want to use a filsystem style solution but specifically a distributed one
- the 2K threshold mentioned above is TOAST_TUPLE_THRESHOLD, and its value is settled at compile time
- when to use one depends on
- how you use the columns
- whether the index overhead is worth it (whether there are more selects or more index-updates)
- table size. Tiny tables have no use for indices. Huge tables can still benefit from avoiding tablescans even if you have constant index work)
- a btree index is automatically created to support any uniqueness constraints (UNIQUE, PRIMARY KEY)
- btree is fine for most basic types, for equality, range, and sorting needs
- GIN and GIST are mostly for fancier types, fancier operations
- think array logic, 2D geometry, full-text indexing (various types)
- tend to be larger, and therefore slower to build
- hash usually isn't worth it
- tweaking GiST can help. Takes some study, though.
- partial indexes make sense when you typically search only for a few specific values (when the work of keeping and updating the rest is pointless)
- btree (balanced tree. The default, and usually what you want)
- works on numeric types and text types, and deals with NULL
- fast for equality and range queries
- supports basic inequality too (=, <, >, <=, >=)
- amount of work to find is relatively constant (B is for balanced)
- optimized for case where a single value maps to a single row
- note: a btree index is implied when you add a uniqueness constraint (UNIQUE / PRIMARY KEY)
- GiST (Generalized Search Tree)
- balanced tree style
- beyond that a framework more than an implementation - allows
- choice of operators
- choice of index type (e.g. text with equality, or text with pattern search(verify))
- custom data types
- made it easier to implement geometric types (more flexible with its operators)
- more than equality/range operations; can support various geometric operators
- lossy, in that it's fine with keeping false positives for some time, and has to check the table.
- See also http://gist.cs.berkeley.edu/
- SP-GiST is a variant of GiST that supports not-necessarily-equally-space-partitioned data (think quad-trees, k-d trees, and suffix trees)
- GIN (Generalized Inverted Indexes)
- can map many values to one row
- can work with custom data types
- can deal with array values
- easily much larger (often means slower), so use only when needed
- BRIN (Block Range Index)
- supports operations: =
- not transaction safe, meaning (?)
- not WAL-logged, meaning you should rebuild these after a crash to be sure
- comparable speed to btree - hash may be slightly faster (perhaps due to being smaller?)
- supports more operators than btree
- not transaction safe
- no upsides of using this over GiST (which came later)
- rtree now removed, effectively merged into GiST (since 8.2)
Variations for specific uses
Consider a time/date field ('created' or 'lastupdated' or so), and the common query on that being 'select the most recent hundred' , i.e. ORDERed and with a small LIMIT. An index kept in DESCending order (and NULLS LAST) will find enough entries at a small clump at the start. (The direction of sorting probably doesn't make a huge amount of difference (verify))
if you know you'll always select on a column with specific values, and never others, you can create an index missing those values.
For example, say you have a system storing images, and want to detect for all if they're animated or not. That column will be TRUE, FALSE, or NULL, and you'll typically select it only for NULL. You can have a tiny index for just that query with
CREATE INDEX ani_unknown ON images (animated) WHERE (animated is null);
Index the result of an expression on the value, rather than the value itself.
Usually only has a point when supporting a very specific select, and there is zero point to also storing that transformed value in the table.
- have a datetime column, index it with just the date() (without the time)
- for a "get all stuff within this day" query, without needing a separate date column (or transform / filter the datetime)
- Store names case-sensitive, index them case-insensitive, by lower(columname)
- and querying it that way too (but consider queries that this might not support!)
- indexes on specific parts when using hstore, json and such, see e.g. 
create indexes on multiple columns. Only makes sense when you query on that same combination, and in the same order.
supports fast checking of unique restraints, otherwise a bit lighter
TODO: look at https://github.com/reorg/pg_repack
Keep in mind:
- postgres will always use an index for UNIQUE columns, so adding an index manually would only mean two indexes and double work.
- row visibility/currentness (see MVCC) is not stored in indices, and has to be fetched from table data
- Indices that see a lot of alterations have the same empty space problem tables do
- space is reclaimed in a page-based way, meaning that pathological cases can still lead to index bloat.
- which degrades performance if the table is doing very large amounts of writes and/or deletes.
- the simplest fix is to just accept this
- if you want to clean up
- A REINDEX is possible, but requires an exclusive lock on the table
- you can DROP and CREATE it separately, but often want no time without index (for uniqueness and/or query-planning/performance reasons)
- a CREATE INDEX (e.g. in combination with a DROP INDEX in a transaction) still needs a write lock
- ...so you probably want something involving CREATE INDEX CONCURRENTLY, which builds slower but needs no locks
- and e.g. probably rename it into place (in a transaction) once it's done
- Creating indices will by default imply a write lock.
- try CONCURRENTLY - it'll build slower, but won't lock out the table from what is probably regular use
- some caveats and can fail (see ), meaning some extra handling around it
- VACUUM FULL often means a lot of work not only to the table data, but also the index
- Before 8.4, this would be work that altered the existing index, so likely increased index size and fragmentation (there are some ways to cheat this a little)
- Since then, it basically builds a new index, so is equivalent to a reindex (verify)
Understanding EXPLAIN / queries to rewrite for indices
You should be passingly familiar with the major decisions the query optimizer makes:
- table scan method
- sequential scan
- index scan
- bitmap heap scan
- table join method
- nested loop with inner index
- nested loop without inner index
- hash join
- merge join
- join order, in that
- which to use as the right relation can matter, e.g. in hash join's amount of keys, index use in nested loop.
...there are variants - e.g. backwards index scans (typically makes a max() much faster), hash anti join. Most are specific-case optimizations you can generally read as their basis. (Note that an anti-join is a "where a match does not exist in the other table)
Mean looking through the entire table. The larger the table, the linearly slower this is to do,
If this is a frequent query that returns a small part of the table, then consider adding and index on the column(s) you're filtering on most.
An example where you wouldn't care is an infrequent maintenance thing that is intended to look at all data.
See more on index choice below.
external merge disk
...often suggesting an intermediate is very large.
Sometimes indicates a slightly larger larger work_mem helps, but more usually, some redesign is the better solution, because a large intermediate result is often slow and avoidable anyway.
Sequential scan because of a lack of join
Joining on an indexed thing is usually fast, so you want that to happen when possible.
The simplest case is where you explicitly join.
A more interesting case is where a query doesn't explicitly join on an indexed thing. (Note: related to the (un)correlated subquery problem)
Consider "I want images that are untagged" like:
select guid from tags_by_guid where guid not in (select guid from files)
If you somehow joined on the condition that they match, you'ld be telling the database to look compare based on column values (using what it knows about them, and allowing use of their indices).
The fact that they come from specific matching columns is not explicit. We find this obvious, more so because they have the same name, but what you've said is "have one produce strings, then compare that against another set of strings". Which it'll do, typically by materializing the inner query (= execute and keep its results around), then do a scan over the outer one comparing values to that.
The crux is often figuring out how to express this join (in this case an anti-join, meaning a "does not exist"-style query). The best syntax / performance can vary a little with specific case.
One (slightly funny looking) option for this one is:
select distinct guid from tags_by_guid tbg where not exists (select 1 from files where files.guid=tbg.guid)
The EXPLAIN output for the two:
HashAggregate (cost=418235428718.38..418235431725.66 rows=300729 width=41) -> Seq Scan on tags_by_guid (cost=0.00..418235423257.31 rows=2184425 width=41) Filter: (NOT (SubPlan 1)) SubPlan 1 -> Materialize (cost=0.00..187840.80 rows=1448653 width=41) -> Seq Scan on files f (cost=0.00..167864.53 rows=1448653 width=41)
HashAggregate (cost=431068.70..431068.71 rows=1 width=41) -> Hash Anti Join (cost=198705.69..431068.69 rows=1 width=41) Hash Cond: (tbg.guid = files.guid) -> Seq Scan on tags_by_guid tbg (cost=0.00..99143.50 rows=4368850 width=41) -> Hash (cost=167864.53..167864.53 rows=1448653 width=41) -> Seq Scan on files (cost=0.00..167864.53 rows=1448653 width=41)
Things to note:
- in the first, the outer query is much larger than the inner cost (roughly just their table sizes multiplied)
- indicating a dumb 'compare everything to everything one by one' plan.
- the cost of the inner query is the same in both cases.
- Makes sense, it's a sequential scan selecting everything.
When/why is my index not used?
Many possible cases, which can vary with your query, your data, and your table definitions.
- If postgres needs to transform the stored data, it can't use the index on that column.
- e.g. differing collations for text columns
- because the planner figures that doing so is more expensive than an alternative.
- And it's usually right.
- For example:
- index lookup before having to read almost all table data is more work than reading all table data.
- a condition that an index cannot easily assist, such a larger-than condition on an index that does not support that (or possibly for which the planner can guess the result is 'most'?(verify))
- Having only condition(s) on non-indexed columns means no narrowing down can be done before that condition can be checked, so it has to go to all data
- in some cases nullability matters(verify)
- Anything that by nature doesn't scale very well, e.g.
- GROUP BY on a lot of data
- e.g. various uses of DISTINCT
- ORDER BY on a lot of data
- ...though a well-placed WHERE can do wonders
- sometimes use of a LIMIT can make it prefer an index over a tablescan
- ...though ORDER BY before LIMIT means all data has to be evaluated
Estimating index usefulness
The query optimizer estimates the cost of various applicable ways to satisfy a query, and that number in EXPLAIN can be useful in an order-of-magnitude way.
It's pretty good at choosing the best given the situation -- but you can influence that context
- being smart in your queries can help more
- adding indices can sometimes make for much faster plans
- tweaking indices can make them cheaper to use
When it's net-positive:
- Every column you commonly select on may be useful to index
- ...when you typically select small sets, and the table is large
- Every column you commonly join on may be useful to index
- ...but not always, and not necessarily both of he compared columns (read up on the join types)
- unused indices just slow you down a little, because they must be kept up to date
- there sometimes is gray area, e.g. ones that are rarely used, written more often than read.
When an index is huge relative to the table, it won't help much
- ...and maintaining it is probably also considerable work.
- There are some clever alleviations, most when you know more about your data
- when a condition usually implies most data needs to be fetched anyway, there is no point in the extra work of figuring that out via an index
- sometimes writing queries differently can help the optimizer a lot
- TODO: many examples :)
"When is an index better?" depends on a few different things, including:
- the nature of your data
- ...including the amount of rows, and the amount of distinct values
- the nature of the queries you (most commonly) run
- (e.g. what you select on, what you join on, sometimes what columns you use in conditions)
- the query planner/optimizer
- the amount of work done to maintain the indices
It varies per case which one of those weight strongest.
Text indices and matching
The rule system is a rewrite system, which conditonally does something instead of OR before a given query.
Rules can be used as augmented queries, exceptions, stored queries or very simple stored procedures.
Creating them takes the form:
CREATE RULE rule_name AS ON event TO object [WHERE rule_qualification] DO [INSTEAD] [ action | (actions) | NOTHING ];
You can use this to:
- implement views (e.g. adding some useful calculated columns) by making the event SELECT ON tablename
- implement view updates (INSERT/UPDATE) by making rules do the proper changes to the actual table
- implement stored queries (on some given conditions)
- do (very) basic data validation
- protect certain INSERTs, UPDATEs and/or DELETEs from doing anything, via a DO INSTEAD NOTHING.
- Note this will allow no changes from any user, which is probably not what you want
- ...unless you made a table purely for rules, in which case you can manage grants for that view (or rather 'view table') and the actual table separately, which can be useful (verify)
- record how many queries the rule was triggered for
When rules' conditions do selects, or when you rewrite into more complex queries, then they can make things slow in less-avoidable ways because you made it overrule yourself :) Use rules with due consideration.
- Absent: the rule action is done unconditionally - before the user's query
- Present, and WHERE applies: Done instead (or, pedantically, before the user query isn't done)
- (to write)
You may need some complex code to make rules apply differently to wildly different insert types, but on properly normalized databases that should not be a problem.
Example: Word counterThe following imitates a simple use of what would be in MySQL (see the respective notes):
Assuming a table:
CREATE TABLE wordcount (word text primary key, count int); CREATE INDEX word_index on wordcount(word);
Adding the rule:
CREATE RULE counter AS ON INSERT TO wordcount WHERE (NEW.word) IN (SELECT word FROM wordcount) DO INSTEAD UPDATE wordcount SET count = count+1 WHERE word=NEW.word;
...will allow you to do:
INSERT INTO wordcount (word) VALUES ('word'); INSERT INTO wordcount (word) VALUES ('word'); INSERT INTO wordcount (word) VALUES ('word');
This would normally complain about a duplicate key, but the rule always tests and falls back to incrementing the count when necessary. (Note: the index isn't necessary, but avoids a sequential scan every for every insert/count)
using stale statistics instead of current ones because stats collector is not responding
postgres keeps table statistics to help the planning. These statistics are manged by a specific process, in a file
The message means that the collector did not respond to a "give me a recent statistics snapshot" request very fast.
Which usually means your IO is currently overloaded.
checkpoints are occurring too frequently
WARNING: there is already a transaction in progress
Means you BEGIN a transaction when one has already begun.
This often means either a missing commit/rollback, or an explicit BEGIN when a transaction was automatically started (e.g. when autocommit is off).
ERROR: current transaction is aborted, commands ignored until end of transaction block
In general the first error in that transaction should make the transaction fail as a whole. Postgres, like many databases, expects your code to handle errors in transactions, including an explicit rollback.
- You are probably interested in the error that happened just before this one. Check your logs if you have them, go the verbose-debug way if you only have your code.
- If the SQL can fail for a good (e.g. data-related) reason, it is also a smart idea to make your client-side logic more fine-grained, e.g. deal with the possibility of failure of individual statements in a transaction.
- How to handle the error path is up to your preference of correctness, and of necessity.
- in some cases you'ld want to retry, in others giving up is fine,
- in some cases it's a code bug, in others it's database state
- In many situations, having an exception path that just rolls back whatever you were doing is good enough.
- you could do a partial commit, when parts of a transaction are logically unrelated (e.g. you chose to put it in the same transaction for ease, or perhaps for WAL related reasons)
- You may wish to make your transactions a bit more local and explicit
- e.g. when working with psycopg, it will implicitly start a transaction on the first command after the last commit/rollback (or initial connection).
- It can be a good habit to rollback any connections you're not using after the latest select (also to avoid "idle in transaction" connections)
Note that this error can also be caused by code out of your hands. For example, django being out of sync with its database.
Error: Cannot stat /var/run/postgresql / No PostgreSQL clusters exist; see "man pg_createcluster"
...when you try to start an existing installation.
My issue -- quite possibly specific to ubuntu and 9.3 -- is that the /var/run/postgresql directory is assumed to exist, and it is not created automatically.
Which means you need to create it manually and restart postgres. You could trivially hack this into the service script, I just hope the power grid behaves decently - and forget the details by the time I need them ^^
Permission denied - Is the server running locally and accepting connections on Unix domain socket "/var/run/postgresql/.s.PGSQL.5432"?
Probably means that the file permissions on that unix socket don't allow the effective user to use it.
The simplest solution is often to add your effective user to the postgres group.
(Though this isn't always ideal. For example, if you add apache to it and your pg_hba trusts all localhost access, then scripts get somewhat wide access to your database by implication)
No PostgreSQL clusters exist; see "man pg_createcluster"
- running ubuntu (14.something-specific perhaps?)
- things were previously running fine
- this happened after a reboot
...then it may mean the /var/run/postgresql/ directory isn't created, but postgres assumes it's there, and needs it for... let's call it management of its state. The error is not a very good indication of what's actually wrong.
Probably fixed in a later package update (cause is possibly that it's on tmpfs now?), but since I don't update, I initially mkdir'd it at reboots, and eventually tweaked the upstart script to do that.
could not connect to server: Cannot assign requested address
This is an OS error, specifically coming from a socket bind().
In case you're creating connections very quickly (e.g. hundreds per second), you're probably triggering the TIME_WAIT problem by running yourself out of sockets.
Consider reusing connections where possible.
ERROR: MultiXactId has not been created yet – apparent wraparound
Replication, Clustering, etc.
- Replication as backup (With PITR, or by other means)
Clustering/load balancing/failover, etc: See  Slony?
pg_dump / pg_dumpall notes
The format you ask for implies the utility you need to restore Format and implied restore utility:
- plain text, use to restore with psql (default)
- tar, use to restore with pg_restore
- custom, usually for pg_restore
sudo -u postgres pg_dump dbname > dbname.sql
which you can restore (into an already-created and presumably still empty database named dbname) with
cat dbname.sql | sudo -u postgres psql dbname
(the sudo to do this as user postgres, because that's usually the database admin user that has enough permissions to do this. Can vary in specific setups)
initial online copy of database + archiving all transactions logs
psql and transactions
psql is autocommit by default.
psycopg and transactions
psycopg defaults autocommit to off.
And the first first command in a connection (and after a commit) will imply a START TRANSACTION ((verify).
This means most of the time, you will be in a transaction, and (potentially idle) in transaction, even after a SELECT.
So get used to doing a rollback() or commit() as soon as sensible.
Ideally you also check the success of all transactions, because if there was an error you can only rollback()
If you like to do things in chunks and not quite fail as a whole, look at SAVEPOINT
- note you can use (since psycopg 2.5) to automatically commit (if no exceptions) or rollback (if exceptions) happen within that code block
Notes on performance and innards
The defaults are conservative so it'll run on ten-year-old hardware comfortably, and you need some tweaking to be more efficient on modern systems (e.g. more RAM), or for specific types of hardware.
You may notice that postgres processes have little memory mapped, perhaps 10-20M when idle (and certain work).
This is unlike some different database engines, but quite normal for postgres, because it assumes to be on a relatively dedicated database host, where the system caches will be doing most of the work of keeping the most useful things in memory.
See also things like:
The main interesting variables seem to be the following few.
Most memory is taken in the background processes (and their bgworkers), because they process queries.
The rest tends to fall away in order of magnitude, unless you've configured certain things unusually.
As shared_buffers is by design shared between the workers, 90% of your use is covered by
- max_connections * (temp_buffers + 3*work_mem)
- the 3 is arbitrary, and pointing out that much is allocation per operation that needs it, not per query or session.
- Actually, if you write particularly nasty queries you can completely blow out performance
- so yes, if you notice only specific queries go crazy on RAM, you could rewrite them
Note that tools like top, ps, htop include SHM in memory use so count it many times, which is not the most informative. You mighy like to take a look at smem, which spreads SHM use among its users, which is fake but at least that adds up correctly.
Memory shared between all running workers.
Is a holding-area cache for table data. It seems(verify) all data to and from disk is managed via shared_buffers
Data sticks around for quick reuse of high-demand data; it has a usage count on each page.
Defaults have been 8MB or 32MB.
Should be scaled up somewhat with amount of allowed workers/connections, in that each worker pins some data.
Larger shared_buffers means more is immediately available, yet this only has measurable value for the highest-demand data.
In particular, postgres is designed to rely on the OS page cache, and shared_buffers is effectively duplicating that, so overly large shared_buffer size can actually be counterproductive.
If there are other memory-intensive or IO-intensive apps on the same host (which is not the best design, but you do you), there is value in taking some for postgres this way.
- There are some queries you can do to inspect how much of the contents are popular versus transient
- which you can use as an estimate of how useful your current buffer cache size is
- if all your access patterns mean this cache won't help much at all (e.g. very random), keep it small
- Setting it very high is counterproductive, in that it lessens the amount of RAM usable for other things, in particular work_mem (the buffers per connection to do actual querying work in) and the system's page cache (which postgres is designed to take advantage of - which is why a low shared_buffers is generally not so harmful)
- Even for serious servers where it's allowed to use gigabytes for this, you may see relatively little actual shared use
- On unices, shared_buffers is implemented using SHM-style shared memory
- your OS's SHM settings may be more conservative than what you set in postgres, so you may have to tweak kernel settings to take full advantage
- Read e.g. http://www.postgresql.org/docs/8.2/static/kernel-resources.html (and use a live sysctl and/or edit /etc/sysctl.conf)
- Quick check on linux (details can be OS-specific):
sysctl -a | grep shm
- Most important is SHMMAX (the other SHM defaults are decent). For example:
kernel.shmmax = 33554432 kernel.shmall = 2097152
- ...means the individual shm size limit is 32 megabytes. You probably want to raise that.
- The total limit of shared memory is in SHMALL, and it is in pages, so the 2097152 here, times 4096-byte pages, is 8GB.
work_mem is the amount of data a worker can handle in RAM for certain operations (mostly mostly sorts, some hash-based operations), An operation crossing this starts to do it on disk.
Defaults to something on the order of 1MB or 4MB.
Note that this is per operation that needs it, not per connection/session/query. Some queries may none. Some may have a bunch.
Values up to maybe 32MB or so have measurably effect for some particularly complex queries, though it usually tops out below that.
Beyond that it becomes more of an liability via (this amount * a handful per query * max_connections pushing against shared memory and the page cache, and eventually can be a cause for swapping)
maintenance_work_mem (default 16MB): like work_mem, but specifically for VACUUM, CREATE INDEX, and ALTER TABLE ADD FOREIGN KEY (...so mostly background maintenance and loading database dumps).
At most one of these will run per session(verify), and these operations are fairly rare anyway, so it's fairly safe to set this quite a bunch higher than work_mem.
temp_buffers (per session, default 8MB): maximum amount of memory used for temporary tables.
- The overhead to setting this high when they are typically unused is negligible,
- but like work_mem, serving many connections can imply unreasonable memory use
max_connections - keep in mind that memory use and other things scale up with this.
- For example, if (this times work_mem) approaches the order of usable RAM, it'll be fighting with other things at peak use.
effective_cache_size: your estimation as to how much of the OS page cache will store postgres table and index data (so approx how much memory you expect to typically be free beyond postgres and other uses).
This is a hint given to the query planner. It may make cleverer decisions like when loading in a common index is good on average, or decide that right now a table scans is less disruptive.
It is counterproductive to overestimate this for your setup (and note that if you don't have a dedicated DB server, that tend to always become true over time).
A pragmatic value could be half the size of your OS cache when the system is set up and running everything it will in production (including the database server), preferably when things that use noticeable and/or variable amounts of memory have also run for a while.
- The values of memory size variables
VACUUM, VACUUM FULL (MVCC and the FSM)
- Periodically run a basic VACUUM, which can easily automated with auto-vacuuming.
- Set up the FSM generously enough so that you will rarely need VACUUM FULL
- Don't run VACUUM FULL unless you know you really need to (it locks the table -- and in a well-tuned up system it is fairly unnecessary)
Postgres' Multiversion Concurrency Control (MVCC) allows concurrent access to different versions of each row, meaning that it can avoid row-level locking when multiple transactions access the same data in various types of use (it also makes rollbacks fast, and allows hot backups).
Resolving which row is current is usually simple and fast.
Old rows are not immediately reclaimed, for some mildly complex reasons, which also means tables are somewhat larger than the current data. And this can be more true when more transactions interact. And be more noticeable for tables that change a lot.
It also means free space may be in various places, so postgres records pages that become unused in the Free Space Map (FSM), a bitmap that lists the empty pages within postgres' data files, which is used when placing new data. The FSM is
- loaded in memory (for reasons of speed and simplicity) so has a fixed, configurable size.
- not continuously up to date, so it needs to be updated now and then by looking at the table on disk and recording what we now know is old and reclaimable.
The last is what a basic VACUUM does, and why (auto)vacuuming aggressively enough is important.
It also means that the size of the FSM should be higher if your transaction rate is higher, because if the FSM is too small to refer to all the reclaimable pages, then these pages will not be cleaned up until a VACUUM FULL (which under properly tuned system you can do rarely or never), and the table's disk size will grow over time.
Now, autovacuum usually means that FSM-marking vacuum happens often enough, and you don't really need to think about it.
But probably you're reading this after your table bloated and you want to know how to fix and/or prevent it.
- Rewrite the table completely with a
- Guarantees all old space is reclaimed, also effectively defragments it
- ...but holds an exclusive table lock for as long as it takes to read and write all data. This is a Bad Thing in a production system
That last detail means you usually want to do both of the following:
- do a basic VACUUM more often
- you could do it manually, but it's easier to ensure auto-vacuuming is doing its job (and possibly tweak it), since that usually means you rarely have to look at it
- scans the entire table to mark the empty space
- Takes time and IO, but does not lock the table, so the worst effect is a lower request handling speed while it's running. You can tweak it to back off while hard work is being done.
- This makes it a decent option to give predictable/constant performance while you also re-use space.
- exactly how often this should happen depends on your workload - how quickly rows go out of style
- make the FSM appropriately large, large enough to (comfortably) cover all reuse
- ...which depends on the table size, but more on the use patterns and the VACUUM interval.
- Takes more RAM. On servers there's usually plenty of it and the benefit is often worth it.
- ...but a ridiculously large FSM will only mean RAM that won't be used for this and can't be used for anything else (like caching of table data).
If you do the latter two right, you may find all unused space is always reclaimed (except perhaps after unpredicted usage spikes). You'll still have tables fragment over time, though.
One way to determine a decent configuration (which is both a science and an art in that it's a case-specific cost-benefit tradeoff) is simply running the system for the time it usually takes between tables to be vacuumed, do a "VACUUM VERBOSE ANALYSE" to get a detailed report, look on how much there was to mark in the FSM, and base your FSM settings on that plus a fat safety margin.
The most important FSM setting is , the maximum amount of pages we expect to find reclaimable (the maximum that can be marked that way).
- default 10000 - or chosen by initdb, up to perhaps 200000
- Setting this too low for your database use will mean pages will not get reclaimed and table files will grow.
- Setting this extremely high will just mean use of RAM that will never actually be used.
- Recommended to be set based on analysis of real use.
- When you have some very busy tables, you may want to up this more than the somewhat conservative documentation would suggest you do.
- Each page takes 6 bytes, so you get ~174K pages per MB of memory use
- default is 1000, and unless you're serving a lot of tables/databases from a single installation, that's a very generous number.
- Overhead is approx. 70 bytes each, so setting this high is safe, though setting it ridiculously high is also pointless.
Auto-vacuuming means a VACUUM will happen automatically every now and then. You can enable/disable this as you wish, tweak how often it happens, and how quickly it selects tables to vacuum.
(Autovacuuming used to be done by a separate process. It was integrated in the daemon in 8.1)
- having it enabled: autovacuum=on
- having stats_start_collector and stats_row_level both enabled (true in the config), for the ability to see which tables changed a lot.
When it happens:
- once every autovacuum_naptime time, postgres will choose one database and vacuum the tables in it that meet certain conditions (to avoid unnecessary work), mostly those that have seen many changes (this can be tweaked)
- it will also happen every ~2 billion transactions regardless of whether it was otherwise enabled (to avoid potential problems related to internal wraparound. It's actually hard to cause this issue, but it can't hurt to avoid it)
Cost-Based Vacuum Delay
VACUUM is in itself a lot of read operations, which implies that it easily becomes a major part of disk scheduling, which can slow production operations somewhat.
You can make VACUUM back off whenever there are database operations waiting. This will make maintenance take longer, but will lessen the impact of this maintenance on user response and, unless the database never sees any quiet time, will do most of the work in the times between real work.
This is disabled by default. To enable it, set vacuum_cost_delay to something larger than zero, namely the amount of time, in milliseconds, that the vacuuming process should sleep after the last real operation (the assumption is that others will come in soon after the last).
Through the nature of the backing OS's timing, the accuracy/resolution of this sleep may be no better than 10ms, which also implies this is the effective minimum. Apparently 10 or 20 is also usually enough(verify) to balance with user responsiveness.
The other vacuum_cost_* settings are set to decent defaults for the average case, but you may have specific cases:
- if you decide to vacuum during off hours, you may want to tweak it to finish fast, while
- in a 24/7 system you may want to tweak it for minimal impact, even if that means it'll take longer.
TODO: check out suggestion that since VACUUM holds a write lock, these delays can sometimes be counterproductive.(verify)
(B-tree) indices are spread on disk, so in tables that see a lot of alterations they will easily suffer from the same empty space problem, and may degrade - within weeks if the table is doing very large amounts of writes and/or deletes. This applied more before 7.4 than it does now. Space is reclaimed, but in a page-based way, meaning that pathological cases can still lead to index bloat.
In such cases, you may want to rebuild the indices every now and then. A REINDEX is possible, but requires an exclusive lock. A 'hot' reindex can be done by doing a DROP INDEX and CREATE INDEX within a transaction(verify) (which will still require a write lock).
Table fill factor
Fill factor below the default 100 means that INSERTs will leaving space in each page, so that a later UPDATE can place a new copy in the same page rather than elsewhere.
This because it typically helps performance when multiple versions of the same row are in stored the same page. (if the new row version goes to a new page, we involve that new page and update the index. If it goes to the same, we write to just that one page. The improvement may be less on busy pages as page writes are aggregated)
- If you expect heavy UPDATE load, consider lowering fill factor.
- If not, you will just lower the efficiency of your storage - and lower your performance
- For write-once tables, 100 makes sense. Less space is used, less disk reading is done.
- if a row is ≥4k you won't get more than
- it can be hard to estimate how large a row is, with
- leaves space in a page so that UPDATE can go to the same page
- Will we UPDATE or just DELETE entries
Index fill factor
Again, the concept is how full to pack pages.
The details are more interesting because of the way each index type works.
For example, for B-tree indices the default is 90, and lower values make for fewer page splits in heavily updated pages.
Process-wise, you have roughly: ((verify) whether this has since changed)
Postmaster process (one)
- the thing on port 5432
- mostly does authentication and handing off a client to a backend process
Utility process (one)
- WAL writer
- WAL sender/receiver (in pg9)
- stats collection
Backend processes (one per active connection/client, limited by max_connections)
- various details and tweaks apply per backend, including...
- catalog cache
- Backends share a bunch of things (all via SHM?), including...
- shared buffers
- wal buffers
- clog buffers
- other buffers
Disk details: WAL, fsync, logging
- For context
The Write-ahead log (WAL) is journaling of transactions on their way to the actual database blocks, which are the table data that the database might hand to users. The WAL is used for recovery, can be used for replication. Some details relates to the MVCC nature of postgres.
Once a session commits a transaction, the data is stored in the WAL buffers in RAM, then soon written into a WAL segments on disk (files in your pg_xlog directory, each 16MB).
WAL checkpoint interval
Checkpoints mean points in the WAL where postgres guarantees everything before then in the WAL is in the database blocks (heap and index).
How often checkpoints are done is largely a tradeoff in the amount of time involved in recovery (which checks the xlog for the last checkpoint record, essentially the time/state in the at which everything on disk was consistent, and works the journal from there).
People may increase it when doing a lot of small transactions, because it can help performance and/or IO load patterns.
- many people decide that recovery is rare enough that that slightly better performance is worth extra time taken during recovery
- yet seek-based reasons to increase this barely matter if you're on SSD.
- Checkpoint writing is done with some backoff that avoids dominating IO. This too is tweakable.
- If checkpoints are happening very quickly (default is faster than every 30s), postgres will complain in its logs.
If the cause is not you doing a huge import, or many manual CHECKPOINTs, you may wish to put them further apart, to spare IO a bit.
Checkpoints will happen, with default settings, the soonest of the following two cases:
- at most five minutes (checkpoint_timeout=300)
- at most three segments apart (checkpoint_segments=3)
- where a checkpoint segment is 16MB, so 3 means 48MB of data
- People have suggested that this can be considerably larger, at least for high-write-volume cases. A setting of 10 (160MB) isn't crazy on a modern setup, and a few factors more can make sense on some setups.
On wal_buffers size:
- Useful to make large enough to hold data from any single transaction (and possibly a few).
- Used to default to 64KB - which is very conservative, because when it's full it will block.
- Since postgres 9 it defaults to 1/32nd (~3%) of shared_buffers (up to the WAL segment size, 16MB), which is also a good suggestion for earlier versions.
- More than a few MB doesn't help speed (because it's written to disk fairly quickly anyway), though can rarely hurt (though you want to check your kernel's SHM limits (no point in stealing limited SHM from other useful things). More than 16MB is often nonsense.
WAL and IO
There is some additional tweaking possible. Consider that if there are a bunch of related transactions that will all commit very soon (because they are short-lived), then waiting for a little time might allows us to write all these transactions with fewer fsyncs. The two settings:
- commit_siblings (default 5): the amount of transactions that must be currently running before we consider this behaviour
- commit_delay (default is 0, meaning it is disabled): The amount of time (in microseconds) to wait for another commit/WAL record. This can be useful e.g. when applications consider autocommit sensible.
- Note that setting this too high will make the WAL writer sit around pointlessly. Unless you expect this to help for your workload, this feature is not that useful(verify). Example setting is 1000 (1000us, which is 1ms).
You can, for any database on platter disks, expect commit rate to be at most one per disk revolution. This may be slightly better with synchronous_commit (or fsync) off.
- On synchronous_commit
This setting controls whether to block while WAL flushes data to disk. Default is on.
If off, the data might be written to disk a few hundred ms (3*wal_writer_delay) after success is reported.
It poses no risk of corrupt tables, but a crash will lose whatever WAL data not flushed at that time.
- When looking for IO relief and not caring about the last few seconds of data, disabling this is a much better idea than disabling fsync.
- Can also be disabled per transaction, which is a more fine grained way of making non-critical commits back off a bit.
SET synchronous_commit TO OFF
- note that some things are always synchronous, such as DROP TABLE and things like two-phase transactions
- Database fsync
First, see fsync in general.
For most database engines, recovery after a crash is up to the last verifiably correct part, which is the last bit of data+journaling that was fully written to disk.
Usually that is very recent state, because we fsync often - the most robust way is to follow each transaction commit with an fsync that waits for disk to say "yes, it's on there".
If any part of this plays any looser, that means data committed at SQL level may not in fact not on disk. Do that in a controlled way and, roughly speaking, verifiable recovery may be up to a minute or two rather than seconds ago.
Do it in a less controlled way, e.g. fsync=off which basically hands behaviour over to your OS (and your tweaking of it), and in relatively pathological cases, this may even lead to unrecoverable states. But you tweaking could bring it out, so when in any doubt, leave it on.
...if only because you were probably doing that for speed, and there are safer ways to get most of that. Aside from the obvious "Use an SSD".
Most OSes implement fsync variants that mean "accept and collect data for a very short amount of time, to use disk rotations more efficiently" meaning they'll get to it very soon instead of right now.
- This gets you comparable speed increases with less risk and recovery up to more recent state.
- And may already be the default for you.
If you are willing to make the tradeoff to "losing a few minutes ago is fine", set synchronous_commit=off -- see above. Roughly speaking this also delays the write, but in a much more controlled way that should always be recoverable.
Separating different IO streams
You may have:
- system drive (syslogs and such)
- system swap partition/file
- transaction log (xlog)
- postgres WAL
- postgres data
- postgres indexes (verify)
- postgres temporary data (verify)
...all on the same disk.
If feasible, you can consider putting at least the major ones on different physical drives (JBOD style). It'll help latency in general, and when heavy writes happen it can avoid doing some things effectively twice on the same disk (WAL+tables).
How much depends entirely on the kind of load.
- If your load is almost entirely reading, it matters less
- there's no waiting on writes
- for small enough databases, reads are served by RAM anyway
- If read-heavy and limited by read throughput, then more platters in your RAID is more valuable
- in general it can matter how easily things are paged in or out (if you have other heavy RAM/disk users)
- how often joins are very large
- how often the system log is fsynced (often is better for debug, not ideal for IO)
- ...and so on.
This is mostly about when things start being bound by disk, and write-heavy - in which case you also want to look SSDs.
Most postgres parts can be moved by symlinking the directory it's in to a location on another drive. Make sure the permissions are correct, of course.
- for table data you can do it a little more fine-grained by using CREATE TABLESPACE (creates a new storage location) and handing its name to CREATE TABLE and ALTER TABLE (verify)
Note that WAL is shared by one 'cluster', but if you have several, e.g. for different users, you may wish to put all of these on SSD.
copying the xlogs
pg9 added streaming replication, which is basically a networked communication of WAL records.
- minimal - enough for crash recovery
- archive - helps archiving
- hot_standby - lets you create read-only slaves
Note that while the logs record all data changes and they can be used for replication, only so much past data is kept, meaning these logs are not useful as backups.
This relates to postgres considering your system when planning reads, writes, use of indices, and other aspects of query execution. Most of these settings take a little reading to learn to use.
One example is effective_cache_size, your estimate of much postgres data sits in the OS cache. When larger, postgres will assume that it is more likely that an index it should read from disk is already in OS cache, and figure using an index is more likely to pay off than a sequential scan. The setting should not be overestimated, and you should be aware of the difference a server beding dedicated or not makes.
While there are reasons to change other values (e.g. random_page_cost), the most interesting value here is probably
Generally you will either want it to go to syslog or, if that does not allow enough level filtering control, postgres's own logging.
log_destination (defaults to stderr) - one of
- stderr - postgres default (distro packages may alter this)
- csvlog - parseable comma-separated format (e.g. to then store in a database), applies only when using logging_collector
- eventlog (windows only)
logging_collector (defaults to on) - captures each postgres process's stdout and sends it to log file, stored in...
log_directory (defaults to 'pg_log')
You probably also care about
- log_min_messages (default is WARNING) - log level to log at all
- you can configure more details (see the docs)
- log_min_error_statement (default is ERROR) - log level to log with more contextual information (e.g. current query), meant for debugging
- client_min_messages - (default is INFO) - log level filter towards connected clients
- log_min_duration_statement (defaults to -1, which is off) - slow query logging
- a positive values in milliseconds.
- note you can do this per connection with e.g.
- can make sense for maintenance / bg stuff you know takes longer
- some permission blah?(verify)
- There are some tools you may like, see e.g. https://wiki.postgresql.org/wiki/Logging_Difficult_Queries
Full text search notes
Since 8.3, postgres has fulltext search.
The mechanics are fairly basic, and because it's indexable, it can scale well for exact search, and fuzzy-via-normalization.
@@ operator (from the docs):
Full text searching in PostgreSQL is based on the match operator @@, which returns true if a tsvector (document) matches a tsquery (query)
- contains document in parsed form
- basically a list of (lexeme,position,weight) (verify)
- you often want to use to_tsvector to create a (normalized) one from text
- contains query in parsed form
- may combine multiple terms using AND, OR, and NOT operators
- you often want to use either to_tsquery() or plainto_tsquery()
Note that lexemes must be normalized the same way (between data/index and query) or you'll probably get very few results. While falling back on defaults generally works, it's probably more robust to hardcode regconfigs in an app.
ts_rank() / ts_rank_cd() will rank results using weights (in the query and/or data; see details below).
It works on a whole result set, so at scale you probably do not want to run this on your entire set. You could choose to first filter with @@, or to run it on a LIMITed set.
Practical details - indices, columns, regconfigs
You can play with this without alterations - find find an existing text column (no need for extra columns or indices) and do something like:
SELECT textid FROM mydocs WHERE to_tsvector(body) @@ to_tsquery('term');
In practice you only ever want to run searches against a tsvector index.
So, create an index, choosing either GiST or GIN. As in general with those two:
- GIN searches faster, but updates slower, so are probably preferred on relatively static tables.
- Indexes are larger than GiST's (so mean more memory/IO)
- GiST updates faster, but searches may be slower
- Handier for often-updated data (particularly if it has a smallish amount of lexemes)
tsvector column or not?
You have a choice between:
- given a text column, storing tsvectors in a separate column, and add an index on that column
- takes more space than strictly required, but...
- take from multiple text columns, add weights, and do ranking in a way that was not completely predetermined
- makes it easier to massage your text outside of postgres
- makes it easier to exeriment/debug, in particular inspecting what the parsing does
ALTER TABLE mydocs ADD COLUMN searchvector tsvector; CREATE INDEX mydocs_ftextidx ON mydocs USING gin(searchvector);
- given a text column, telling the database to update the index directly
- less space than the previous version, because there's no generally-unused tsvector column
- is a little more black-box, in that you don't get to see it or do much with it
CREATE INDEX mydocs_body_index ON mydocs USING gin(to_tsvector('english', body));
If you've gone for the former (tsvector-in-a-column) approach, then you could update the tsvector column yourself, something like
UPDATE mydocs set title='foo', searchvector=to_tsvector('simple','foo') where id=%s
...or have it done for you via a trigger, like:
CREATE TRIGGER searchvectorupdate BEFORE INSERT OR UPDATE ON mydocs FOR EACH ROW EXECUTE PROCEDURE tsvector_update_trigger(searchvector, 'pg_catalog.english', body);
The built-in tsvector_update_trigger() function mostly just does a to_tsvector(). If you want slightly fancier handling, e.g. parsing from multiple fields, adding weights, then you may want a custom function (probably plpgsql) to do something like (see docs for full example)
setweight(to_tsvector('pg_catalog.english', coalesce(new.title,)), 'A') || setweight(to_tsvector('pg_catalog.english', coalesce(new.body,)), 'D');
- Mix configs at your own risk.
- You usually want the config on your ts_query and ts_vector to be the same,
- It won't bork out - but different configs may stem in different ways, so reduce to different lexemes, meaning you will get fewer or no matches
- It's also a good idea to use these explicitly, to avoid a default throwing you off.
- You can tell various parts that convert to a tsvector to take the config name from another column.
- but see notes above
- You may like to normalize in ways you understand
- e.g. use 'simple' instead of any language,
- use extensions like unaccent
- ...or your own normalization, in your own code (and either store the result in a column, or embed that code into postgres)
Inspecting tsvectors for very common things (that you may want to remove)
SELECT * FROM ts_stat('SELECT searchvector FROM mydocs') ORDER BY nentry DESC, ndoc DESC, word LIMIT 250;
plainto_tsquery() expects plain text and constructs a basic AND search, e.g.
'search & term'::tsquery
When you want complex queries, rewrite into postgres's syntax.
You have to to_tsquery() is used when you want operators - but you have to ensure the syntax is valid (or you'll get a ERROR: syntax error in tsquery).
This gets interesting in combination.
Special characters are , , , , , , whitespace, and null.
- , , and are your typical boolean operators. Usable with , parentheses.
- indicates prefix matching, like
- Single quoted strings are necessary when adding phrases as such, and can trigger the dictionary.
- Terms themselves are single words, or single-quoted words/phrases (which can trigger the dictionary?)
Some examples (removing SQL's quoting/escaping for readability):
supernova & !crab 'supernova' & 'sun' cat:* &!'catsuit'
Proximity matching, weighed terms, and ranking
Positions are always stored, e.g.
'quick:1 brown:2 fox:3'::tsvector
Which is enough to give higher value to words that happen in proximity.
Use ts_rank_cd (over ts_rank). For the following it would give 0.05, 1.0, 1.0, purely because there is less distance in the latter two
select ts_rank_cd('quick:1 brown:2 fox:3'::tsvector, to_tsquery('quick & fox')), ts_rank_cd('quick:1 brown:2 fox:3'::tsvector, to_tsquery('brown & fox')), ts_rank_cd('quick:1 brown:2 fox:3'::tsvector, to_tsquery('quick & brown & fox'));
Read the docs if you want more control.
Note that if you want strictly-adjacent phrase matching, older versions would force you to do your own inspection on @@'d results, e.g.
SELECT * FROM articles WHERE fulltext @@ 'word1 & word2'::tsquery AND fulltext LIKE '%word1 word2%'Since 9.6 there is have usable within to_tsquery.
- <-> is strict proximity, <2> for within 2 words, <3> for within 3, etc.
- note that this is ordered, so not symmetric
Optionally, the tsvector elements you search have weights, in four categories, because it is primarily meant as a coarse weighing-in scheme, e.g. to be able to say
- article's keywords should contribute a lot
- title words should contribute a bunch
- body words should contribute a little
When you put together a tsvector from parts you can overwrite the weight for parts, allowing you to re-weigh your indexed tsvector according (or even re-weigh during a query).
For example, "fat cat fat rat" with some weighing that prefers nouns and particularly rats might become:
'fat:2,4C cat:3B rat:5A'::tsvector
The letters are the four weight categories. The default is D, and D weighing is omitted when printing these.
Weighing means different terms contribute differently (the way tsvector elements are weighed by default A=1.0, B=0.4, C=0.2, and D=0.1). The values for the following
select ts_rank( 'fat:2,4C cat:3B rat:5A'::tsvector, to_tsquery('fat')) select ts_rank( 'fat:2,4C cat:3B rat:5A'::tsvector, to_tsquery('fat & cat')) select ts_rank( 'fat:2,4C cat:3B rat:5A'::tsvector, to_tsquery('fat & rat'))
work out as approximately 0.2, 0.4, and 0.6.
- altering a tsvector with weights involved looks something like:
UPDATE tablename SET searchvector = setweight(to_tsvector(coalesce(title, )), 'A') || setweight(to_tsvector(coalesce(keyword,)), 'B');
- this acts as if the text was concatenated (position indexes of concatenated tsvectors are higher in the result)
- coalesce ("returns first non-NULL argument") is used here to avoid cases of 'something || NULL' as the result of that is defined as NULL
- you can't weigh queries
- Yes, you can put weights in a tsquery, and yes, with such a query has weight is valid expression
- but it does something rather specific: it matches that term only when the tsvector has it in the same weight(-class-thingie).
- For example, is false.
- ts_rank() with weights in the tsquery seems to treat them as if those weights were not present(verify))}}
- normalization allows to consider the document length, in particular to have long documents score higher mostly because it has more words. Default is not to do that, values specify a few different strategies.
configurations / regconfigs, a.k.a. parsing behaviour
Default parsing is good enough for tests and, if you know its limitations, production.
Each query will use the default search configuration, unless told otherwise.Go to psql, do for configs (or for dictionaries).
simple, danish, dutch, english, finnish, french, german, hungarian, italian, norwegian, portuguese, romanian, russian
For fiddling, see http://www.postgresql.org/docs/8.4/static/textsearch-configuration.html
Keep in mind that you may not want to rely on default config, since not everything does the same thing. For example:
select to_tsquery('english', 'The & Fat & Rats'), to_tsquery('simple','The & Fat & Rats');
"'fat' & 'rat'" "'the' & 'fat' & 'rats'"
- template1 is what every CREATE DATABASE copies, effectively a skeleton database
- This lets you add site defaults, or site-default modifications
- template0 is an alternative you can forcibly use
- its main point is verbatim restores, which are portable only without first adding site defaults
host-based authentication (hence the filename), is mostly about figuring what kind of authentication applies, depending on how you're connecting (and sometimes to not allow a connection at all - this part has some functional overlap with network-level firewalling).
This is a layer that applies partly before we even talk authentication or privileges.
- ...except that trust/peer/ident can effectively bypass auth in specific conditions.
This config file is possibly easiest to find via (locate since database installs aren't always in the same place, sudo because your typical login user may not be allowed to read it).
Note that after editing, you can reload without a restart using SIGHUP (often ) 
Line syntax is roughly:
CONNTYPE DATABASE USER [ADDRESS] METHOD [OPTIONS]
- domain socket for unix
- for TCP/IP without SSL
- for TCP/IP with SSL
- for TCP/IP (with or without SSL)
DATABASE can be:
- database names (multiple need to be comma-separated)
- for database names matching role name
- (note: all does not imply this, you want two lines if you want that)
USER can be
- a username
- a group name (prefixed with +)
- a comma-separated list of users/groups
- for any currently configured IP on the local host
- for anyone on the networks currently configured on the local host
- a IPv4/IPv6 address
- a IPv4/IPv6 address with CIDR-style mask (amount of bits)
- a hostname
- a domain name (start wit a dot, it will matche everything under there)
METHOD can be:
-  -
- ((note: for libpq, domain sockets apply when you fill in nothing for the host, rather than 'localhost'. Other things may have their own connstring logic)
- RFC 1413). - like peer, but over TCP by querying using the ident protocol (think
- ...which, note, is not an auth protocol
- (plain text, please avoid)
- - SSL client certificates
- External auth:
OPTIONS are further NAME=VALUE style options to the auth module
On peer auth (and ident)
Peer auth, to recap the above description, applies only when connecting via local domain sockets, and asks the (*nix) kernel for the connecting user's username (via getpeereid), and allows passwordless trust when local username matches the role name.
Note that this shifts auth to the ability to become a particular user on that particular host (have its password, or appropriate sudo rights).
It's e.g. quite convenient for local db admin tools.
Securitywise it has its upsides and downsides.
It reduces attack surface when that the database doesn't need to listen to IP, as in that case there is no password to steal.
If your app is fine running on the same host, you could give it only peer access.
On the other hand, it means your auth config is now in several places (harder to oversee), and this is not portable to e.g. windows.
Default pg_hba often has peer for all databases, though usually, after a plain install the only user that this applies to is (verify)
local all all peer host all all 127.0.0.1/32 md5 host all all ::1/128 md5
Ident auth is roughly the same idea, but asks via identd instead, meaning it will work for TCP connections (...given you've set up identd).
You want this only when you understand the security implications, in particular that you are moving parts of your auth off your host (while not being auth as such). In some tightly controlled mutually-trusting hosts this may be convenient, in anything that doesn't live in such a bubble it's an unnecessary attack vector. Avoid use unless you understood that and are fine with it.
FATAL: Peer authentication failed for user "postgres"
Since peer only applies to domain sockets, this comes down to that your effective username doesn't match the database username.
when you wanted peer, then you're probably running as the wrong user.
When you actually wanted to log in with a password the issue is probably that your database client library / connector assuming you want a domain socket when you didn't -- which implies that the wrong pg_hba line applies.
The fix is to specify an IP address from your client, so that it won't choose to use a domain socket. (note that when you don't specify a hostname to libpq, it assumes you want domain socket)
FATAL: Ident authentication failed for user "postgres"
This often means your OS has installed a local(-only?) identd setup.
It seems some distros do this, specifically for postgres, to imitate peer (possibly figuring it's more portable than the syscall that peer uses). For example Redhat's initdb defaults to ident auth 
In that case you can learn how RH/other wants you to use this, but I prefer to use md5 (or maybe trust or peer) as I figure appropriate.
(i.e. not using ident at all, it annoys me)
pgadmin notes and errors
Note that you can imply domain socket by leaving address field blank. (is this actually a libpq thing?(verify))
On prewarming caches with indices and/or data
Seeing what your server is up to
If you log in via shells, it can be convenient to use
I've used the following to expose details via a web page:
SELECT count(*), query FROM pg_stat_activity GROUP BY query;
Also working on a CLI utility that display more from that table more usefully.
"idle" and "idle in transaction"
Idle are those connected, not currently talking to the database, and not in the middle of a transaction.
This will never block other queries from being done.
But will prevent new connections, as they still count towards the connection limit.
If your clients are just too lazy to disconnect (probably for a long time because they forgot about them), these will probably add up over time, and you want to make them close connections as soon as they can.
Note that using a connection pooler, you would expect a fairly-constant handful of connections to always be there and idle, ready to be given out.
Idle in transaction means a connection that has started a transaction, has not committed it or rolled it back, and (if you see these, most probably) hasn't talked to the database for a while.
Idle in transaction can lead to blocking other queries, depending on implicit and explicit locks, and the read-or-write nature of the operations involved.
Usually, idle in transaction means the app does a few seconds of work between parts of its transaction.
Within a single maintenance script this is often nothing to worry about.
It is generally not what you want to happen in regular operations, as that usually means the application is lazy enough that it can be considered buggy, because multiple of these are more likely to make things wait on each other.
- apps should commit changes as soon as sensible
- apps should rollback after errors
- if you don't do explicit BEGINs, then most any command implicitly starts a transaction.
- in practice that means it's good style to rollback after a SELECT
- if at all possible, apps should figure out the entire transaction before starting it
Note that database client libraries may implicitly start a transaction. For example, in psycopg, the first command implies a BEGIN.
- Workarounds and solutions
Fix your app to
- catch handle database errors gracefully.
- not have open transactions while it's doing computation rather than DB stuff
- smaller transactions can often make sense
- close connections you don't use
- Use auto-commit where sensible (it avoids transactions)
- You could try it as a default, and use transactions only where you need them.
...whichever apply. If you can't, timeouts may help a bit.
- Finding the offending code
(Chances are you'll see something like an error case that doesn't lead to a rollback)
The query is telling, but you often won't see it. Things like the pg_top util or the pg_stat_activity table show the 'current query' as idleness. The previous query isn't listed.
I find logging is often simplest. You can have postgres log_line_prefix add log information that you can match to pg_top - most importantly the PID:
log_line_prefix [%p] %u@%d %r # ...particularly if you want to build some of your own reporting around this
It can also help to have long-running-query logging:
# log all queries that take longer than 1000ms, and their actual duration. log_min_duration_statement = 1000
If you don't control postgres, then consider logging each pid+query pair to a file, or a memcache if you're fancy.
Estimating disk size of things
table and index size
Keep in mind that
- counts direct table size
- does not count TOAST
- does not count indices
- counts direct table size
- counts TOAST
- does not count indices
- counts direct table size
- counts TOAST
- counts indices
- (...it's equivalent to pg_table_size() + pg_indexes_size())
You can ask for the stored size of a value, and thereby estimate the size of parts of a row.
- individual row size can be fetched with something like:
select pg_column_size(tablename.*) from tablename
- average row size can be done with adding an avg() to the above,
- ...or by dividing table size by the row count, which will also count tuple headers, padding, etc. (verify)
- remember how NULLs (don't) count.
- keep in mind you're not counting indices. To consider those too, you may be more interested in something like:
select pg_total_relation_size('relname')::real / count(*) from tablename
- (and you can avg() or sum() them for a table) (since 8.4) says how much space it takes to store a value
- When TOAST applies...
- a columns may take four bytes more (the reference into the toast storage) (verify)
- when compresses happened, it's the compressed size.
- verify) gives the uncompressed size -- of text/byte fields only (
- keep in mind you're not counting indices.
- when e.g. calculating the percentage of storage used by a column, you likely want to include index size
It's trickier to compare all a table's columns individually, within a single SQL statement, because pg_column_size takes values, not column names. You can write a server-side function (makes sense if you want regular monitoring), but it may be easier cheat and do it from the client side, e.g.
def colsizesums(tablename, curs): ''' For the given table name, in the currently selected database, returns each column's: - name - type - total size (bytes, e.g. 995782220) - total size (string, e.g. 950 MB) - average row size (bytes, rounded a bit, e.g. 588.7) Note that this does a tablescan for each column, so isn't fast. To do all at once you need a little more cheating. Note also that it's not injection-safe (second select), ...so don't expose this without hardcoding table names. ''' import re tablename = re.sub('[^A-Za-z0-9_]','',tablename) # basic sanitization ret= curs.execute("""SELECT column_name,data_type FROM information_schema.columns WHERE table_name = %s ORDER BY ordinal_position""", (tablename,)) for colname,coltype in curs.fetchall(): # yes, that's interpolated. Should be safe _enough_, # assuming table and column names are not insane. curs.execute("""SELECT sum(pg_column_size(%s)), pg_size_pretty(sum(pg_column_size(%s))), avg(pg_column_size(%s)) FROM %s"""%(colname,colname,colname, tablename)) row=curs.fetchone() ret.append( (colname, coltype, int(row), row, round(row,1)) ) return ret
Which gives output like:
[('id', 'integer', 531360, '519 kB', 4.0), ('path', 'text', 18399078, '18 MB', 138.5), ('ext', 'character varying', 537550, '525 kB', 4.0), ('datetime', 'timestamp without time zone', 949232, '927 kB', 8.0), ('stillpresent', 'boolean', 132840, '130 kB', 1.0), ('data', 'text', 33423962, '32 MB', 251.6)]
- NULL values barely add storage
- on top of constant row overhead, anyway - NULL is not stored as values, instead there is a per-tuple bitmap of which columns are present in this tuple
- you can hand in literals.
- rows can be handed into pg_column_size (and not octet_length, directly). Asking for rows, or tuples, will also count the 24-byte header, so:
select pg_column_size( 1::real ) + pg_column_size( 1::real ) = 8 select pg_column_size( (1::real,1::real) ) = 32
- you can estimate TOAST compression in a (text/byte) column like:
SELECT avg(pg_column_size(colname)::real / octet_length(colname)::real) from tablename;
- note that this may be very little when TOAST decides it fits in the table itelf.
- Counting only actually-compressed things is a little harder
Fast count from a table
- If you need an accurate count:
In general, if you do:
select count(*) from mytable
you'll find this is slow for very large tables, more or less regardless of anything.
This because to give a fully accurate answer, it has to scan a lot.
The reason lies in MVCC. You could say this counting operation is considered rare, and less important than MVCC's tendency to relieve locking for more typical queries.
Since 9.2 you can occasionally satisfy a subset of counting queries with an index, which is typically faster than a table scan. It comes with some restrictions, though, so evaluate whether you really need this.
- If you need a fast and accurate count
Keep a counter, keep it updated. Using triggers on table alterations may be the easiest way. It's extra work on most alterations, though, so only do this when it's important.
- If a recent-ish estimate is good enough
SELECT reltuples::bigint AS estimate FROM pg_class where oid='myschema.mytable'::regclass;
The accuracy of this result varies with how often you ANALYSE (with autovacuum in place you're typically good).
Alternatively, have a background job do a count every minute or two, store it somewhere, and have everything else fetch that count. Note that when this still applies sequential scans this still stops scaling when the data doesn't fit RAM.
A database is created with a specific character encoding (typically defaults to UTF-8 a.k.a. UNICODE; initdb picks these up from template1).
An older default may also be SQL_ASCII, which is not great if if you ever store anything but ASCII. Similarly, postgres supports codepages, but you usually don't want to use them (unless you know you really do).
The encoding is applied to its text types in text operations(verify).
The database encoding is essentially a default, in that in many operations you can force others.
You can set the client coding with , or the more standard (though slightly wider)
The client connection also has an encoding - which the database knows about, to be able to convert between database and client encoding (when necessary and possible), so as a user you only need to worry about the connection coding.
On top on the types in your programming language of choice, some clients may be configured for extra conversion. What comes in is still a byte coding, and if your language has a unicode string type, it can be very convenient to have it converted to that automatically.
Personally, I dislike implicit convenience conversions. I like to use UTF8 as the database and client encoding, and explicitly convert on the client side when necessary. Yes, that makes for longer code, but it's also clearer (also because not all text needs to be decoded. A case I run into frequently is the *nix filesystem API, which is essentially just arbitrary bytes (other than NUL and slash) - it just happens that most people use utf8 by convention, not least because their LANG is usually that)
Client details are up to the client, and API.
It seems that e.g. psycopg2
- client encoding defaults to database encoding
- unicode is encoded according to client encoding
- python2 str is not encoded, so has to be correct according to the server coding(verify)
- (so both may work as you wish)
- incoming: you get a str
- in python2: str means bytestrings, so hands things back in client encoding(verify)
- arguably the most controlled way is to both set client encoding, and hardcode the decode
- arguably the more convenient way is to register a typecaster to get it in unicode always. On a cursor or right after import
- in python3: decodes to unicode (using the client encoding)
- in python2: str means bytestrings, so hands things back in client encoding(verify)