Relational database and SQL notes

From Helpful
Jump to: navigation, search

For other database related things, see:

These are primarily notes
It won't be complete in any sense.
It exists to contain fragments of useful information.

See also other articles in the Database category.



Speed and tweaking

Much of this is RDBMS-general.

Some of it came from suggestions specifically for one database engine or another.


General

Memory

tl;dr: having more physical memory is usually a good idea.


Anything that can come from memory instead of disk tends to be much faster.

This was always true for platter disk. On SSDs the difference is much less pronounced.


Configuring the database to use said memory well is wise. Defaults for many datbase engines are typically conservative, so this is worth half an hour of looking at.

That said, the OS's disk cache will generically help even if you don't, particularly if the database is the only process doing a lot of reading - e.g. on a host dedicated to the database.

If it's not dedicated, there is some added value to dedicating a portion of RAM to database-managed caches, in that you should get more somewhat more predictable performance in more variable situations.


You can get a lot more tweaky and specific. Say, you might redesign a schema to separate very-common data from rarely-fetched data, making it more likely that the latter won't clobber the former from caches (particularly if these work out as smaller and larger tables, respectively).


Disk and other hardware

tl;dr:

  • Avoid things hitting platter disks - latency will rocket up due to seek speeds adding up
if you do tests for this, model the fact that a large amount of concurrent operations
  • SSDs are friendlier
if they are too price to store everything on, you may still consider using them for your most active tables
  • a mass of (transactioned) write operations can strangle IO due to having to happen separately.
Flushing less often helps here - at the expense of how many seconds of recent changes are lost after crash


Databases: IO bound

The slowest part of a database-backed app (including most webapps) is typically the database itself, because the database server itself is often not CPU limited but disk-limited (sometimes otherwise IO-limited).


Particularly less-technical businesses have a habit of thinking that things will go faster depending on the hardware's price tag, and throw money at the problem (bigger servers). This works somewhat, but is not very effective in the longer run.

Disk access is slow enough without having to share it, and running two database engines on the same disk (e.g. by server virtualization done the wrong way) can be fatal to performance.

In not-overly-pathological cases, two household computers would hold up far better than one powerful server, potentially purely because there are two separate disks.


Note there are app-side tricks, such as memcaches, that can relieve this to a surprising degree (depending on case, and design).

The value of SSDs

The issue with platter disks is not(/rarely) their throughput, and more typically the latency - the overhead of operations (particularly "ensure this is on disk) because it is limited by seek speed, limited by physical rotation and head movement.


SSDs don't have those physics in the way, and can guarantee lower latency on average. ...and, given certain use constraints and management being done, almost always.


This means the IOPS rate will be higher, allowing a higher transaction rate (without touching data safety).


Also keep in mind that a database engine tends to write

  • transaction journaling
  • table data
  • logs
  • sometimes implied swap

...and it can makes sense to split those to distinct devices - particularly if some of them are still platter.



The value of caches

Indexing

An art as well as a science, because while the most common queries may be obvious for your app, it's hard to design perfectly for real-world use and data distribution.

Notes:

  • there's often a 80/20 thing going on, figure out e.g.
which one or two queries that will using most time
which happen less regularly but hit a lot of data
and then...


  • EXPLAIN is your friend.
To check how and when indices are used
a slow-query log is a good source for queries not hitting indices


  • Anticipate data growth. Long-term scaling is more important that short-term improvement.
  • know that indices are not free
indexes have to be updated in related write operations, so are only worth it if there are a good bunch of select operations that they actually improve
some indexes cost more than they are worth.
  • You can consider indexes on columns when:
    • you know query conditions will often use that column,
    • for which the values will be fairly unique, and/or
    • the result set always relatively small
  • Realize what contributes to index size
(e.g. sometimes normalization helps a lot)



Schema tweaking

Query writing and EXPLAINing

There are often alternative ways to write a query with different index-using implications. Particularly when you use things like GROUP BY, think about what indices you have. If you're not sure it'll be used, you can use EXPLAIN to see the difference in the way the query optimizer will actually execute your query. Table scanning (looking at all rows) is obviously bad for large tables.

Think in sets rather that for loops, which will also help you avoid sub-queries and use the generally more efficient joins.

Try to rewrite for indices, for example making expressions with indexed columns in them trivial comparisons rather than complex.


On foreign keys

fragmentation

Connection pooling

You will often write code like "connect to database, do stuff in database, close database connection".

Every "connect to database" is a small holdup, because that'll take more than a few milliseconds (authentication, possibly a secure connection, select database, set connection parameters, and other overhead).


Connection pooling refers to keeping a number of connections open, so that you can skip some of that overhead and go straight to the "do stuff in database".


It is mostly a concept, and can be done multiple ways.

It can be a separate process, it can be a pool kept within a client process. The basics aren't hard to implement yourself within a specific app, yet consider that...

A more generic solution needs to worry about host/port/user/database parameters should become distinct connections

arguing for something external to you

It's nice if it reconnects transparently (and other edge cases) and overrides certain functions (particularly a connection close)

arguing for a libraries.

It's nice to have minimal code changes (rather than having to change every database connect to a pool request)

arguing for a proxy process that takes your TCP requests (and does database connection for you)

Specific to...

PostgreSQL

See Postgres#Notes_on_performance_and_innards


MySQL

Memory configuration
Basic
This article/section is a stub — probably a pile of half-sorted notes, is not well-checked so may have incorrect bits. (Feel free to ignore, fix, or tell me)

The most interesting tweak is to give more than the default amount of memory to either...

  • InnoDB: use innodb_buffer_pool_size (innodb_buffer_pool in older versions(verify))
  • MyISAM: use key_buffer_size (key_buffer in older versions(verify))

To avoid having to divide your memory between engines, settle on one and use it for everything (most people seem to settle on InnoDB eventually).


You can often give it a good chunk of your free memory. If the server is dedicated, use up to perhaps 80% of memory (leaving the rest for other things such as MySQL per-connection buffers, and OS cache) - though note that the OS cache will often be an effective supporting cache, particularly for data sets that fit in memory.

On non-dedicated servers it may be useful to leave at least a few hundred megs for apache and other things, because if you push the allocation so far up that the OS has to swap out parts of MySQL to serve everything, the effect can be just as bad (or worse) than not having all data in memory.


Note that MyISAM only ever caches index data(verify). InnoDB cache index data and stored data.

(both can be a good thing, depending on your goals, data, schema design, size of the data, and amount of RAM).


Query cache
This article/section is a stub — probably a pile of half-sorted notes, is not well-checked so may have incorrect bits. (Feel free to ignore, fix, or tell me)

You could investigate how much you have to gain by using the MySQL query cache, and then check whether it being bigger would help.

Consider also that if it does help, you can often do it more effectively yourself with memcached or such.


Others
This article/section is a stub — probably a pile of half-sorted notes, is not well-checked so may have incorrect bits. (Feel free to ignore, fix, or tell me)

Note that some allocation settings apply per connection instead of globally - for example the sort buffer size.


open tables, open files
This article/section is a stub — probably a pile of half-sorted notes, is not well-checked so may have incorrect bits. (Feel free to ignore, fix, or tell me)

If you use MyISAM: each MyISAM table is actually three different files, you need the per-process file handler max to be 3 times the amount of tables you want to be able to have open(verify)). Note that with MyISAM, open_files_limit must be at least 3*table_cache.


Note that some OSes put a limit on per-process open files. If it is limited, you should not set MySQL above that number, or it will respond somewhat unreliably. Note also that various OSes allow you to set their open file limit higher.



table_cache has nothing to do with caching table data; it seems to be the amount of table references you can have open at a time, be it the permanent on-disk ones (such references can get re-used through this cache) or temporary ones (which are generally destroyed fairly quickly).(verify)

Note that each join is potentially a temporary table, so for safery, you may want to set table_cache to max_connections times the average/max amount of joins in common queries.

At the same time, table_cache should not imply a higher amount of open file descriptors than open_files_limit supports.

The default for table_cache is fine for most database designs and with the default max_connections value; it usually only needs changing in unusual cases.


See also:


Binlogging

(mysql's name, most databases do this one way or the other)

...is data journaling and is often on by default. It's a boon if you want to recover as much data as possible after a crash (it won't recover things that weren't done while the server crashed, but that sounds fair enough, right?) and are also used in replication (clustering).

Binlogging slows the database down somewhat, and the binlogs grow huge as it's basically a fully descriptive copy of everything you do to the database.

If you feel daily or weekly backups are enough, you can disable binlogging per database (binlog-ignore-db=dbname in my.cnf) or server-wide.

If you want binlogging, and are using platter drives, you can save seek time by making them go do a distinct drive.

To just empty them (seems to removes only what isn't necessary for recovery to current state(verify)), do a
FLUSH MASTER
(verify).


Slow Query log

The slow query log logs what queries are slow. Any query that takes longer than long_query_time specifies (in seconds) will be logged.

This can be rather useful to see what your bogged-down server is doing without having to profile your client code.

It can also be useful to see where you could perhaps use an index -- or where the query optimizer is misjudging.


Query Cache

MySQL can be told to keep a cache that maps from query string to the results it gives, avoiding disk access completely. This is a simple memcache and sounds cool, but in practice it is only interesting on tables that:

  • rarely change (writes invalidate cache entries)
  • return fairly small result sets (small enough to easily fit in the space allocated for this feature, alongside another bunch of them)
  • using queries that are identical
  • are queried at least somewhat regularly
  • using queries that are deterministic (see note below)

About determinism: an expression containing something like logdate>(current_date() - interval 31 days) is non-deterministic, since current_date returns something different each time you call it. Since in this case you're probably looking for approximately a month's worth of things to give statistics over, a few seconds won't matter. You could calculate the timestamp for midnight a month ago, then use that value in statistics queries. This value is a contant and will only change once a day, meaning the query cache can be used.


Most other database software does not have this feature, mostly because there are few cases in which it is a particularly useful feature. In practice, it improves fewer cases than even the above constaints suggests. You often want to add your own caching logic, meaning that an external memcache (e.g. memcached) is more practical.

Note that whenever one of these reasons implies the query cache is ineffective (particularly from common invalidation), it being enabled will slightly slow you down as there is constant management going on (entering on selects, invalidation on table updates). For this reason, you may want to enable the query cache but have it apply only for queries you know it is useful for and explicitly add the SQL_CACHE keyword to.


Consider, for example :

  • a table of a few dozen application settings. You probably fetch values often, it rarely ever changes, so you naturally want it to be in memory. The thing is that it was probably cached anyway.
  • looking up names of magazines by ISSN (e.g. select name from issns where issn=something), from a table that is probably rarely updated. There are over a million ISSNs, meaning the table would be large. The size itself is not too important since you would be crazy to not have an index on such a table, but the distribution is: it wouldn't be surprising to see lookups for a top few thousand come up as often as for the bottom million, and it would be nice to know that it's remembering many by-individual-ISSN queries, so that it won't have to do a disk read for those.
  • your own case :)
Use

If you have the query cache on and you do exactly the same query as you have done before, it will take the last cached result set from memory (if it was in there: if it fit and qualified).

  • Query
    SHOW VARIABLES LIKE 'query_cache_%'
    to see the configuration.
  • Use
    SHOW STATUS LIKE 'qcache_%';
    to see statistics of how much the query cache is currently being used
  • See also [1] for reference.


Note that depending on how the qc is configured (see query_cache_type), it may apply to everything you don't tell it not to work on (SELECT SQL_NO_CACHE ...), or work only on demand (SELECT SQL_CACHE .... The last may not be the most automatic, but it is often the simplest way to avoid it cycling out useful entries because of unnecessarily query caching.

SQL

Note on GROUP BY

Join types

This article/section is a stub — probably a pile of half-sorted notes, is not well-checked so may have incorrect bits. (Feel free to ignore, fix, or tell me)


INNER join

The join you typically use is the inner join.

It goes through multiple tables, and when a condition is matched, it emits a row that often comes from values from both tables.

For example:

select person.name, department.name
from   person, department
where  person.deptid = department.id


You don't need to type JOIN or INNER JOIN, but you can. For example:

SELECT person.name. department.name
FROM person
JOIN department ON person.deptid = department.id

I personally don't, but you can argue that it's clearer, pointing out that all operations on multiple tables are some sort of join.




Named inner-join variants:

  • an equi-join is an inner join that uses an equality comparison (often identifiers).
  • a natural join is an inner join that implies all same-named columns must be equi-joined on
(and only one of each equi-joined column shows up, unlike in a manual equi/inner join).
It's partly just laziness.


CROSS JOIN

When you want all possible combinations of rows from two tables, you want a cross join (a.k.a. Cartesian join).

You don't need or do this it very often, but it has its uses.


It happens when you select from multiple tables and don't have a restricting condition. (you could consider it a variant of inner join if you want)

SELECT a.thing, b.thing
FROM a, b

You can also do that with an explicit CROSS JOIN:

SELECT a.thing, b.thing
FROM a
CROSS JOIN b



OUTER JOINs: LEFT JOIN, RIGHT JOIN, and FULL JOIN

Note:

LEFT JOIN is short for LEFT OUTER JOIN
RIGHT JOIN short for RIGHT OUTER JOIN
FULL JOIN short for FULL OUTER JOIN


When the condition fails in an left/right join, we still emit a row, but with NULLs on the other side.

This is useful when tabulating incomplete data, particularly if it is inherently incomplete.


An example will help here. Say you are are adding nice names for ID3v2 field names (e.g. "Lead Performer" for TPE1), but haven't finished yet.

Say you have a table filedata:

FILENAME       FIELDID   FIELDVALUE
/music/a.mp3   TIT2      Christopher Lydon
/music/a.mp3   TPE1      Amanda Palmer

And a table fielddescr:

FIELDID    DESCRIPTION
TIT2       Title

Then:

SELECT filedata.filename, fielddescr.description,   filedata.value
FROM filedata
LEFT JOIN fielddescr
       ON filedata.fieldid=fielddescr.fieldid

Will give you:

/music/a.mp3   Title      Christopher Lydon
/music/a.mp3   NULL       Amanda Palmer

...whereas for an inner join, that last row would have been omitted.


This is called LEFT JOIN because

the left (first mentioned) table is the primarily important one, while
the column(s) coming from the right one right one (the one you mention after 'LEFT JOIN', fielddescr) may contain NULLs.


A RIGHT JOIN is the same thing with the other table being the important one.

Left and right are in some ways the same thing, but there are many cases where one makes for a less awkward, more readable query.



A FULL JOIN is like the above, where neither table is the important one.

Say you have a customers and orders table.

Then

SELECT          customers.name, orders.order_id
FROM            customers
FULL OUTER JOIN orders ON customers.customer_id=orders.customer_id;

will give customers without orders 'as well as and orders without customers (probably on top of mostly rows that have both)



Semi-joins, Anti-join
This article/section is a stub — probably a pile of half-sorted notes, is not well-checked so may have incorrect bits. (Feel free to ignore, fix, or tell me)

Correlated subquery

A correlated subquery is one where a subquery uses a value from the outer query.

Which easily leads to more work.


For example, in

SELECT employee_name
  FROM employees emp
  WHERE salary > ( SELECT AVG(salary) FROM employees WHERE department = emp.department);

those are two different queries.

In this particular case, the query might be executed for every employee, but if the query optimizer is smart, it could cache the per-department averages it calculates, so the inner query will only need to be be executed once for every department that you select employees from.


For another example:

An uncorrelated subquery might be: (fetching tags for files we no longer have)

SELECT distinct guid   FROM tags_by_guid
WHERE guid not in  (SELECT guid FROM files)

Uncorrelated queries often allow fewer optimizations.

Even if you consider it obvious that this should be a join because the thing you're matching on it's the same sort of thing, the query optimizer can't know it can make that assumption.

As such, in the above the subquery will probably be executed each time (as EXPLAIN will mention).


You would want to write it as a correlated subquery instead. There are a few options, one of them:

SELECT distinct guid from tags_by_guid tbg
WHERE not exists (select 1 from files where files.guid=tbg.guid)

...may lead that to be executed as a single join. (which should be faster even though the the inner query may still be a sequential scan. The reasons behind this case (including how much faster it is) is more involved)

See the related Postgresql_notes#Understanding_EXPLAIN_.2F_queries_to_rewrite_for_indices, specifically 'Sequential scan because of a lack of join', for more explanation.


Isolation levels

This article/section is a stub — probably a pile of half-sorted notes, is not well-checked so may have incorrect bits. (Feel free to ignore, fix, or tell me)

Isolations levels are a controlled way to loosen strictly ACID behaviour.

Roughly speaking, it describes what non-ACID behaviour is allowed when you don't do explicit and fairly complete locking.(verify)


The three things most often mentioned that you want to avoid:

  • Dirty reads: getting data ignorant of any transaction or locking
Means we don't have to wait at all
May mean a connection reads data that is not yet committed by another
when multiple clients are using multiple data that has to make coherent sense, you can assume this will go wrong at some point
  • Nonrepeatable reads: getting different values for the same row when trying to read it multiple times in the same transaction
...e.g. because of another transaction's committed alterations. Reads can be non-dirty, but still non-repeatable.
We usually care about nonrepeatable reads refer to those within the same transaction. There is also the concept of "nonrepeatable within the same query". The fact that we use the same word for both is confusing, yes.
  • Phantom behavior / phantom reads refers to transactions 'magically' seeing new rows
i.e, seeing that the result set changes (rather than that specifically identificable row values change)
e.g. caused by WHERE clauses that select from another table at different times, which can find new matching rows after they are committed by something else.


Approximately speaking,

dirty reads are reading non-committed data from another transaction
nonrepeatable reads are often because another transaction COMMITted an UPDATE
phantoms are often because another transaction COMMITted an INSERT or DELETE)

Approximately, because with more scrutiny there are a few more specific cases/behaviours.



You don't have to think about all of that because SQL-92 defines isolation levels:

  • READ UNCOMMITTED
    • Basically 'anything you find will do', does not use/observe read locks
    • allows dirty read
    • allows nonrepeatable reads
    • allows phantom behaviour
  • READ COMMITTED
    • Won't read uncommitted data, doesn't ensure that read rows won't change within the length of this transaction
    • avoids dirty reads (only)
    • allows nonrepeatable reads
    • allows phantom behaviour
"Won't read uncomitted data" and "allows nonrepeatable reads" seem to conflict. The point lies in the difference between a query and a transaction. Nonrepeatable reads in a query means it will not see different values if it needs to read a row twice. But two identical queries in the same transaction can still see data if another transaction has completely committed data to the relevant table. If you want all state to be repeatable within your transaction, you want either locks, or the following:
  • REPEATABLE READ
    • Won't read uncommitted data, and does ensure read rows will not change within the transaction. Put another way, you're essentially reading from the same snapshot each time.
    • avoids dirty reads
    • avoids nonrepeatable reads (the main point, since it's the name of the thing)
    • still allows phantom behaviour (...wider locking may avoid it, without being quite the same as SERIALIZABLE. Depends.)
    • usually implemented by placing locks on all data relevant to the query
  • SERIALIZABLE
    • strictest -- no transaction interaction possible at all
    • Tends to be noticeably slower, because it is often implemented with zealous locking, or sometimes even by never handling two transactions on the same object(s) at once(verify).
    • avoids dirty reads
    • avoids nonrepeatable reads
    • avoids phantom behaviour


Some database engines may offer their own additional levels (e.g. SNAPSHOT[2]).

Note that oracle and postgres say serializable when they mean snapshot(verify)



Note that most of these isolations aren't ACID - that's sort of the point.

ACID is great theory, but basically implies "do things entirely in series always, or it becomes too complex to handle"

It turns out that most parallelism will easily be faster only under some conditions.

So the question becomes a tradeoff between which guarantees can be given for extra speed.



Most database engines can be asked to behave to different levels (sometimes you want stricter to be highly correct, sometimes looser if you know the speed increase does not come with incorrectness for this specific query).


For example:

MySQL:

  • MyISAM does not support transactions at all, which is probably best described as READ UNCOMITTED since multiple connections can interleave queries in any order
  • InnoDB supports all four(verify), and defaults to REPEATABLE READ


Postgresql:

  • its MVCC nature means it internally has the equivalents of READ COMMITTED and SERIALIZABLE.
  • Default is READ COMITTED(verify)
  • Asking for the other two gives you the next strictest:
READ UNCOMMITTED → READ COMMITTED
REPEATABLE READ → SERIALIZABLE
  • Note that given its MVCC design, reads tend to spend less time waiting for locks, so looser settings often aren't useful




See also:

/admin/c0007870.htm


DB2 has its own terms and goes into a few more distinctions and may be interesting reading. See also:




Considerations to allowing dirty reads
This article/section is a stub — probably a pile of half-sorted notes, is not well-checked so may have incorrect bits. (Feel free to ignore, fix, or tell me)

Using SET TRANSACTION ISOLATION LEVEL to READ UNCOMMITTED (or e.g. MSSQL's per-query WITH(NOLOCK)), can be used to allow dirty reads to happen, which means the value is read out without locking considerations.

This can be useful when any value currently in there will do. This may be true in the case of social-site profile information, continually updated statistics tables, and such. Any value from now to a few seconds ago will be perfectly good - using locks and listening to transactional accuracy will only really mean more waiting.

Of course, dirty reads may at any time return uncomitted, old, or duplicate data (because you are asking the database to violate ACIDity for the read), meaning you should only do this when you know you'll do no harm, such as when you only read the value to show it and no write/update will even be based on it. (Much the same goes for non-repeatable reads, really)


Note that MVCC-style engines, such as postgresql and MySQL's InnoDB, avoid a lot of classical style locking by isolating data versions, which means there is little speed increase in allowing requests to give you dirty versions.


See also

Constraints and indices

PRIMARY, INDEX, UNIQUE, UNIQUE INDEX, CLUSTER

  • UNIQUE constraints
    • are a logical concept on a column tuple that tells inserts/updates to not violate this
    • are usually implemented by also having an index, but not always
    • may enable the scheduler to give you indexed lookups, but this is not at all guaranteed behaviour
  • An INDEX stores where to find data, usually on-disk, and avoids walking through all data. Indices are most useful when they allow operations to work on (greatly) reduced sets of data.
    • Usually this means tests for equality such as WHEREs in lookups or various JOINs, and also.
    • They may GROUP BY and/or ORDER BY, under certain conditions (verify)
    • Avoid when there is likeely little gain, but possibly loss by overhead, e.g. when:
      • ...the respective WHERE(/JOIN) will not be able to reduce the result set much
      • ...indices need to be adjusted very often (e.g. high volume of changes not done in batches)
      • (arguably) there are very few column values, such as in enumerations or often NULL (this will mostly duplicate the data and provide little gain)
      • ...field contents are/can be very long content, e.g. entire text documents (however, indices may be configurable to use only first n characters, which depending on the data may be a good place to tune)
    • Indices on multiple columns are possible
      • ...but only useful if you use these column tuples in a where or join; it does not give you indices on the separate columns. You usually should have a specific reason to do this.
      • Because of likely implementations, it often pays to order them from most to least restrictive.
  • UNIQUE INDEX means specifically getting both at once, presumably in the same index structure. This may also be useful information for the query scheduler. This often acts as the same thing as UNIQUE(verify).
  • A PRIMARY key constraint is largely a UNIQUE constaint and related index (usually), and also enforces that the value is not NULL.


  • CLUSTER sorts data on-disk based on an index.
    • Done immediatey, once. Only done when you explicitly ask it to, so for continuously updated tables this is worthless.
    • Seems to be an an in-place alternative to selecting into another table with an ORDER BY (verify)
    • Has the most use (arguably only has use) when you mostly/always select a small set of the data, and the index predicts this well, because it enables these selects to be in a smaller part of the disk. Note that subsequent inserts will not be clustered and lessen the improvement over time.

The I-don't-know-what-I-just-inserted problem

When you INSERT something and rely on the database to create a new unique identifier (SERIAL column in postgres, AUTO_INCREMENT in MySQL, IDENTITY in MSSQL, etc.), standard SQL that no way to let you directly/unambiguously return that new value.

When this automatically generated identifier is the row's only identifier, this means it's effectively impossible to fetch/reference just the row you just inserted.


Most solutions requires separate communication of the row identifier, either by using a backend-specific function to select this single value, or by doing the generation of identifiers yourself and using it explicitly in the insert (where generating it yourself may e.g. mean using the sequence these are generated from, or even something like UUIDs).


Get-the-value solutions imply that you have to check that the insert succeeded in the first place. It may also mean that doing multiple-row inserts can be impractical, since it tends to imply getting only the first (or last) ID that was set.

Some solutions may need calls to happen sequentially and in the same connection or transaction. Beware of creating race conditions.


Fetch the last last-set/generated value

This article/section is a stub — probably a pile of half-sorted notes, is not well-checked so may have incorrect bits. (Feel free to ignore, fix, or tell me)

Most engines have an automaticaly-generated-ID thing, and let you fetch the last generated value.

This can be easiest, but it means your SQL is not portable between database backends.


In PostgreSQL:

You can use:

SELECT currval("tablename_columnname_seq")
currval
returns the value most recently obtained by nextval for this sequence in the current session.

In other words, it's transaction-safe.

(Side note: currval is considered a volatile function, so you often don't want to use it in a condition or subquery directly).


Since postgres 8.2, you can also do:

INSERT INTO foo VALUES () RETURNING id

...which is like a combined insert-and-select - what follows RETURNING is a basic select expression, but the most common use is probably returning the column containing the sequence-assigned IDs.


See also:


In MSSQL: You can fetch:

  • SELECT IDENT_CURRENT(’tablename’) reports the last inserted value for a table, regardless of connection/session
  • SELECT @@IDENTITY reports the last value set in the context of the current connection, regardless of table
  • SELECT SCOPE_IDENTITY() - like @@IDENTITY, but restricted to the scope, probably that of the insert you just did (verify). This avoids problems when you e.g. have triggers that may cause other identity field values to be generated



In MySQL:

Use LAST_INSERT_ID() (from SQL, or mysql_insert_id() from code).

This returns the last ID that was set into an auto_increment column (even if you handed it along explicitly), within the context of the same/current connection.

This is pretty robust, but the feature does not work with MyISAM when it uses 'Use Concurrent Inserts'.

Manual new ID

This article/section is a stub — probably a pile of half-sorted notes, is not well-checked so may have incorrect bits. (Feel free to ignore, fix, or tell me)

serial/auto_increment behaviour generally guarantees that the next value is max(ids_in_table)+1, and usually does so by being a simple counter belonging to a column/table, gets incremented and never wraps.

This makes it very simple and fast, although the non-wrapping is an inherent and potentially big problem, and perhaps more importantly, if you want your application to be able to use more than one backend, using such non-standard features means you need different code to handle different backends, something you may want to avoid.


A slower solution would be to calculate that max ID yourself, such as with a 'select max(id)', or '...order by id, limit 1' sort of thing. However, both of these are slow for big tables, so you may want to store a counter separately.

Postgres helps by allowing you to create sequences. These are used internally for serial fields, but can also be used to imitate a serial field since selecting from a sequence also increases its value.

Without it, however, you have to make sure no two clients can get the same ID to insert, by putting the select and update of the counter in a transaction. However, since MySQL's MyISAM does not support transactions, this will cause corruption.



Semi-sorted

Upsert

"Insert this row. If it already exists according to a uniqueness constraint, update that row instead" (or, in other cases, do nothing instead)


Consider a table

word   varchar(50)   primary key
count  integer


There are extensions in various database engines to allow something like:

MySQL:

INSERT INTO         wordcount (word,count)  VALUES ('a_word',1)  
ON DUPLICATE KEY    UPDATE    count=count+1;


Postgres introduced ON CONFLICT in version 9.5 [3]

INSERT INTO         wordcount (word,count)  VALUES ('a_word',1)  
ON CONFLICT (word)  DO UPDATE count=count+1;

(TODO: test that example)


Some also allow similar behaviour in other ways. On postgres you could do it with a rule, something like:

CREATE OR REPLACE RULE strings_counter AS
ON INSERT TO strings
WHERE (EXISTS ( SELECT 1 FROM strings tester WHERE new.string = tester.string)) 
DO INSTEAD  UPDATE strings SET count = strings.count + 1
            WHERE strings.string = new.string;



Backend agnostic code will require two statements, because upserts are necessarily a small transaction (implementing it cleanly and efficiently seems to be the reason postgres postponed it for a while).


In fact, some DIY solutions, like the following, will have possible races, others will have possible deadlocks - and under low load you may never see either of these problems.


If you usually want to update, then you can instead:

UPDATE wordcounter set count=count+1 where word=myword

and if that affects zero rows:

INSERT INTO wordcounter values (myword, 1)


If you usually want to fall back to do nothing (e.g. "insert a row if it wasn't there") it may often be more efficient to do a SELECT, then e.g. INSERT (or UPDATE) based on whether you had 0 or 1 rows.

counting is slow

This article/section is a stub — probably a pile of half-sorted notes, is not well-checked so may have incorrect bits. (Feel free to ignore, fix, or tell me)


Using SQL's COUNT for a precise amount of things in a large table is slow.


This is largely an implication of ACIDity. In particular isolation means you need to actually check the visibility of everything you want to count

Which potentially amounts to a sequential scan, though there are optimized cases.

COUNT DISTINCT is worse, because most of those optimizations do not apply.(verify)



If ballpark is good enough, there are various faster ways - for an overall table COUNT, not so much COUNT DISTINCT or anything else that implies conditional counts.


Things you can consider doing include:

  • count precisely only every so often, and cache the results.
makes sense if e.g. a daily update is good enough
  • store a count yourself
possibly in the database, which you can keep accurate e.g. by doing this via triggers (...at minor cost at insert/delete time)
possibly not in the database, because there are lower-overhead, deadlock-free ways to count atomically


  • if an estimate is enough (with some caveats you should know about - but which may be fine for you), most database engines store it somewhere, and backend-specific ways of fetching that.
e.g in postgres you can ask
the stats collector (from pg_stat_all_tables),
the counts at last VACUUM / ANALYZE (from pg_class)
keep in mind that these are often only a little behind - but there are cases where they are significantly off
note that this is sometimes quite similar to the cached-daily-update case


  • conditional counts can sometimes be assisted with an index
consider e.g. a background process looking for a "update from file" field
which if run frequently enough is typically set only a tiny fraction of items
then a partial index on updateme=true is both small itself, and supports (only) queries like SELECT count(*) where reread=true;
  • e.g. postgres 9.2 introduced index-only scans, which means that under certain conditions, queries (including certain count queries) can be served entirely or largely from the index, which be much faster.

consider not using auto_increment/serial

Fetching random items and samples

This article/section is a stub — probably a pile of half-sorted notes, is not well-checked so may have incorrect bits. (Feel free to ignore, fix, or tell me)

If you want a guaranteed statistically robustly random sample without thinking, you can often use ORDER BY RANDOM (exact syntax varies). Which is often slow, because picking from most of the table means reading much of an index/table.

There are faster tricks

  • you can get faster results with fairly decent randomness when you know certain things about your table beforehand
  • when your wish is less robust, e.g. "a chunk of examples that is picked somewhat arbitrarily"


order by random

  • properly random
  • slow: probably O(n lg n) with the table size at best, and apparently exponential with proper randomness - you probably do not want to do this unless the table will always stay smallish.

Example:

SELECT * from strings ORDER BY random() LIMIT 10

(note: random() may be called rand() or something else)

random offset

  • best for single item or when a contiguous range at a random position is good enough
  • requires knowing amount of items
  • good for random selection of single items, often not so much for larger samples
  • linear with table size (probably(verify))


Idea: When you (approximately or exactly) know the number of items, you can pick a spot between 1 and that amount and use it as an offset. The SQL syntax varies. Details are typically found in LIMIT documentation.

If you fetch many items, they'll be related by some sort of sorting, which may not be random enough, or be perfectly fine for your purposes.


Knowing the amount of items exactly is preferable, in that if your estimate is a little high, you would sometimes select nothing. If you know how much off the estimate can be, you can be conservative with your offset.


Postgres optimization: fetching the amount of items recorded at last vacuum time (SELECT reltuples FROM pg_class WHERE relname='tablename') is faster than a basic-SQL select count(*) from tablename. The difference can matter for (many) small samples from large tables.

In MySQL+InnoDB, you can use SHOW TABLE STATUS similarly, but its value is a much rougher estimate that in some situations is worthless see docs.

select a fraction of rows

  • randomly allows items to become part of the result
  • useful for portions (choosing threshold by (approximate) amount of rows)
  • decent randomness - within the set it considers (consider: table order, limit)
  • should be linear time with min(table size, amount you want) (verify)
which, if you want to select from all the table, is roughly a tablescan


However, note:

its relation to the data is a bit funny
it relies on knowing roughly how much data there is / that there is enough

Consider:

-- selects 50% from the first 200ish it sees
select * from tab where rand()<0.5 limit 100  

-- selects 10% of the first 1000ish rows
select * from tab where rand()<0.1 limit 100


select next ID

  • can be the fastest trick (for single item), but the one with the most footnotes / caveats.
  • randomness can be good, depending on the table's history (see notes below)
  • typically an index-based lookup, then avoiding both sorting and skipping past records. If so, it's the fastest option for larger tables


Main assumptions:

  • your rows have IDs on them and they are indexed (often true)
  • you know little or nothing was deleted

...because

that means you have a known range, where thinking up an integer is pretty-properly random,
and the index means 'look up this or the next highest id' (to be robust to a few holes due to deletes) will use an index and so be fast (noticeably faster than linear time anyway)


For example, to get a random string from a (id,string) table, first get an ID in the given range

SELECT FLOOR( RANDOM()*(SELECT max(id) FROM strings))

And then use it to select the row with the next highest ID (and these two are typically in single statement)

SELECT * FROM strings WHERE id >= that_randid LIMIT 1


There are a lot of variants and a lot of possible tweaks to this trick. For example, you can often know (or estimate) max(id) in cheaper ways (e.g. remember it on inserts, select it from the relevant sequence in postgres)


Tables with a lot of deletes are problematic, because the random integer in 1..max(id) now isn't properly distributed.

The more sparse or unbalanced the IDs in your whole range, the worse the randomness is and the more you may e.g. see specific items.

notes

  • SELECT count(*) is often linear time with table size, so remembering an approximate count may be preferable over using count(). For a lot of software there is a non-standard, cheaper, and often more approximate way of estimating the amount of rows (some notes on this for postgres and mysql are mentioned below)
  • Watch bordercases that can mean fewer results than you think (like 0), particularly on range fetches - the real code you'll need often needs to cheat a little more than that shown above.
  • LIMIT 1 (or any smallish number) is often fast because it tells the database that after it has found one thing, it can immediately looking in the index or table.
  • Similarly, a limit with an offset also can also work fairly well, to fetch a chunk of items at a random position. It's often slower, though, as it needs to skip past data.
  • You generally want to avoid sorting when possible, as that's O(n*log(n)) at best
  • the speed of selecting max(id) or count(*) may differ depending on indices, optimizations, and possibly on engine

Multiple cursors

.NET notes

This article/section is a stub — probably a pile of half-sorted notes, is not well-checked so may have incorrect bits. (Feel free to ignore, fix, or tell me)

.NET has a database interface made up of a good bunch of classes. It is nicely generic, if quite verbose.


Connections to data sources

a Connection object represents a connection to a particular data source, and is subclassed for types of connection, such as SqlConnection, OleDbConnection, OdbcConnection, OracleConnection.


You set up an object using a connection string, then call Open() when you need a connection.


It's common to use a using to have disconnection/threadpool releases (if applicable) handled automatically when execution exits the code block:

using (SqlConnection conn = new SqlConnection(connectionString)) {
    conn.Open();
    // all work that needs the connection
}

Other ways are calling Close() explicitly, and to hand CommandBehavior.CloseConnection to commands, but you should only do so when you are sure you need it, as it would lessen threadpool effectiveness.


Connection pooling has to be supported by the database adapter, and may be disabled via the connection string.


You can use an event handler to listen to connection state changes if you wish.

Commands and their execution

Commands represent operations in a database. They may be instantiated directly or be created using Connection.CreateCommand() or Transaction object.

You can set the exact command yourself, e.g.:

SqlCommand countStrings = new SqlCommand(
      "SELECT count(*) FROM strings", connection);

You can also use one of InsertCommand, UpdateCommand, and DeleteCommand (or more exactly, its derivatives)

SqlParameter

To avoid SQL injection, you can use the fairly standard place-a-marker-in-the-actual-query, tell-it-what-to-replace-it-with approach.

There are many ways to do this, including the following two:

SqlCommand cmd = new SqlCommand("SELECT count(*) from tablename where email=@email AND password=@2)", connection);
cmd.Parameters.Add("@email");
cmd.Parameters["@email"] = email;
cmd.Parameters.Add("@2",SqlDbType.Text).Value = password;
// A much longer form would be:
SqlParameter emailParam = new SqlParameter();
emailParam.ParameterName = "@email";
emailParam.Value         = email;
cmd.Parameters.Add(emailParam);

//You may sometimes want to use Convert.Tosomething() to catch bad data on the code side rather than letting the database handle it.


Parameters are also used to specify mappings to DataSet objects.


Readers

ExecuteReader() returns an object that allows you to read many rows. For example:

SqlDataReader reader = command.ExecuteReader() //optionally with a CommandBehavior

while (reader.Read()) { //which updates columns. For example:
    Console.WriteLine(String.Format("{0}", reader[0]));
}
Scalars

ExecuteScalar() returns the first column of the first row, and ignores the rest.

Use when you expect exactly one such result, such as when you use aggregate functions to select just the min, max, or count of something.

For example:

numberOfStrings = (Int32)command.ExecuteScalar();


Non-queries

ExecuteNonQuery is used for commands that do not query for data, such as SQL INSERT, DELETE, UPDATE, and SET.

Returns the number of rows affected.

Begin...: Ascunchronous variants
Stored procedures

DataSets

A DataSet is an in-memory representation of database data, allowing disconnected/offline access to data fetched earlier.

DataSets are usually filled by creating a DataAdapter based on a query and having it fill the object with data.

Alternatively, you can add data (rows) manually, as well as create the tables and their columns from scratch.


DataSets can also be used to update, insert, and delete data, although this has a few more problems than command-based code.

DataSet may contain one or more DataTables. In some cases, just the latter will do, and you can fill it just the same (using a different overload of a data adapter).


Untyped DatsSets are generic DataSet() objects for which the column types become clear when it is being filled, so type checking can only be done at runtime.

Typed DataSets are those for which you explicitly declare the typing before compilarion, so the strongly typing can apply at compile time. This does imply you need a typed dataset for each different (major) query.

Typed datasets are declared based on schemas stored in xsd format (XML schema definitions).


BLOBs

There are also a few minor behaviour things that can bug people, such as:

  • the encoding/collation features that significantly vary in behaviour in a few specific combinations of collations and text column types (one of utf8_general_ci/TEXT, utf8_general_ci/VARCHAR, utf8_bin/TEXT, utf8_bin/VARCHAR acts differently from the rest. Can you guess which?), simply because they kill the concept of transparent storage and of all text being created equal, which seems sour as MySQL's relevant code is probably originally aimed at both providing and being smart about it, and occasionally useful enough for it.