Difference between revisions of "Relational database and SQL notes"

From Helpful
Jump to: navigation, search
m (Isolation levels)
m (Memory)
 
(One intermediate revision by the same user not shown)
Line 41: Line 41:
  
 
==Speed and tweaking==
 
==Speed and tweaking==
Much of this is RDBMS-general, some of it came from suggestions specifically for one database engine or another. <!--  
+
Much of this is RDBMS-general.
 +
 
 +
Some of it came from suggestions specifically for one database engine or another. <!--  
 
though this summary was originally written while working with mysql and looking at its docs, say, [http://developers.sun.com/solaris/articles/mysql_perf_tune.html this]. I'll probably end up doing the same for postgres when I use it in production.-->
 
though this summary was originally written while working with mysql and looking at its docs, say, [http://developers.sun.com/solaris/articles/mysql_perf_tune.html this]. I'll probably end up doing the same for postgres when I use it in production.-->
  
Line 48: Line 50:
  
 
====Memory====
 
====Memory====
tl;dr: Anything that can come from memory instead of disk tends to be much faster, sohaving more memory is usually good
+
tl;dr: Anything that can come from memory instead of disk tends to be much faster, so having more physical memory is usually good.
  
 +
Particularly when the database is the primary thing that the host(/VM(/container)) is doing
  
Configuring it to use said memory wisely is good.
 
Typical defaults (e.g. PostgreSQL, MySQL) allocate very little RAM,
 
frankly a decade too conservative.
 
  
Yes, the OS disk cache will be used and ''will'' generically help,  
+
Configuring it to use said memory well is wise -- defaults for many datbase engines are typically a bit conservative, so this is worth half an hour of looking at.
 +
 
 +
That said, the OS disk cache ''will'' generically help even if you don't,  
 
particularly if the database is the only process doing a lot of reading.
 
particularly if the database is the only process doing a lot of reading.
 +
 
...but if you can dedicate a portion of RAM to database-managed cache you  
 
...but if you can dedicate a portion of RAM to database-managed cache you  
should get more predictable performance in more variable situations.
+
should get more somewhat more predictable performance in more variable situations.
  
  

Latest revision as of 23:54, 15 August 2019

For other database related things, see:

These are primarily notes
It won't be complete in any sense.
It exists to contain fragments of useful information.

See also other articles in the Database category.



Speed and tweaking

Much of this is RDBMS-general.

Some of it came from suggestions specifically for one database engine or another.


General

Memory

tl;dr: Anything that can come from memory instead of disk tends to be much faster, so having more physical memory is usually good.

Particularly when the database is the primary thing that the host(/VM(/container)) is doing


Configuring it to use said memory well is wise -- defaults for many datbase engines are typically a bit conservative, so this is worth half an hour of looking at.

That said, the OS disk cache will generically help even if you don't, particularly if the database is the only process doing a lot of reading.

...but if you can dedicate a portion of RAM to database-managed cache you should get more somewhat more predictable performance in more variable situations.


You can get a lot more tweaky and specific. Say, you might redesign a schema to separate very-common data from rarely-fetched data, making it more likely that the latter won't clobber the former from caches (particularly if these work out as smaller and larger tables, respectively).


Disk and other hardware

tl;dr:

  • Avoid things hitting platter disks - latency will rocket up due to seek speeds adding up
it may be okayish in simple tests, but many users hitting disk is bad
  • SSDs are friendlier. Sometimes best applied to your core tables.
  • a mass of (transactioned) write operations can strangle IO due to having to happen separately.
Flushing less often helps here - at the expense of how many seconds of recent changes are lost after crash


Databases: IO bound

The slowest part of a database-backed app (including most webapps) is typically the database itself, because the database server itself is often not CPU limited but disk-limited (sometimes otherwise IO-limited).


Particularly less-technical businesses have a habit of thinking that things will go faster depending on the hardware's price tag, and throw money at the problem (bigger servers). This works somewhat, but is not very effective in the longer run.

Disk access is slow enough without having to share it, and running two database engines on the same disk (e.g. by server virtualization done the wrong way) can be fatal to performance.

In not-overly-pathological cases, two household computers would hold up far better than one powerful server, potentially purely because there are two separate disks.


Note there are app-side tricks, such as memcaches, that can relieve this to a surprising degree (depending on case, and design).

Things a database writes; Disk vs. SSD

The disk problem is rarely the throughput, and typically operation overhead, on platter mostly seek speed.

SSDs are quite interesting now. Consider that:


Note also that a database (and the system it's on) has multiple distinct things it writes to e.g. tabledata, database journaling, logs and swap. Also other programs do things. It sometimes makes sense to split those to distinct devices, particularly if some of them are platter.


There is also typically a tradeoff of "if we go to disk less often, we may lose a few more seconds of data next crash, but also be faster".

This can be lowered when

  • you have a UPS and shutdown script
  • you don't store any critical data, e.g.
when you apply to databases used as caches,
when people will accept having to upload their most recent cat picture


Keep in mind, though, that networking is also a limiting factor, in that once it is saturated, faster disk / database reacion makes no difference.

This is also true for distributed caches: RAM is fast that network cache speed depends mostly on network.

The value of RAID
This article/section is a stub — probably a pile of half-sorted notes, is not well-checked so may have incorrect bits. (Feel free to ignore, fix, or tell me)

Read speed is helped by most RAID.

Some a little, some more.

Write speed less so / is more complex question, particularly for a small amount of disks

RAID5 typically has a negative effect; RAID0+1 or RAID1+0, while it may cost you one or two more disks, is similarly robust and faster


Hardware versus software RAID:

  • hardware RAID tends to be faster, and less of a CPU hog
  • software RAID may be faster than cheap onboard RAID (even if not fakeRAID)
  • sofware RAID are often easily transported to another system
while hardware RAID's hardware can be a liability, in that replacements may not be not sold by the time yours failed (so it may be difficult to get the data off your perfectly-happy drives)

Indexing

An art as well as a science, because while the most common queries may be obvious for your app, it's hard to design perfectly for real-world use and data distribution.

Notes:

  • there's often a 80/20 thing going on, figure out e.g.
which one or two queries that will using most time
which happen less regularly but hit a lot of data
and then...
  • EXPLAIN is your friend.
To check how and when indices are used
a slow-query log is a good source for queries not hitting indices


  • Anticipate data growth. Long-term scaling is more important that short-term improvement.
  • Know indices are not free
indexes have to be updated in related write operations, so are only worth it if there are a good bunch of select operations that they actually improve
some indexes cost more than they are worth.
  • You can consider indexes on columns when:
    • you know query conditions will often use that column,
    • for which the values will be fairly unique, and/or
    • the result set always relatively small
  • Realize what contributes to index size
(e.g. sometimes normalization helps a lot)



Schema tweaking

Use the smallest types that will work; RDBMSes in general have a lot of storage overhead, which imply memory use, which may therefore be more wasteful than necessary and lessen the effectiveness of throwing memory at the database.


There are also more low-level details you may want to watch. For example, If your primary key is a large string, try to use an auto_incrementing integer for identification/referencing/indexing instead.


Query writing and EXPLAINing

There are often alternative ways to write a query with different index-using implications. Particularly when you use things like GROUP BY, think about what indices you have. If you're not sure it'll be used, you can use EXPLAIN to see the difference in the way the query optimizer will actually execute your query. Table scanning (looking at all rows) is obviously bad for large tables.

Think in sets rather that for loops, which will also help you avoid sub-queries and use the generally more efficient joins.

Try to rewrite for indices, for example making expressions with indexed columns in them trivial comparisons rather than complex.



Study statistics

phpmyadmin has a 'Status' section, which shows you a good number of things that may point you towards changes that will serve a particular server. This includes things like the query cache (you can see its size under 'Variables').


Connection pooling

Establishing a connection for each request/interaction that a user does makes for code that is easy to read, but the connection process is often noticably slow.

Connection pooling refers to keeping a number of connections open, storing them in a pool/cache, to be handed off to client code the next time it wants a database connection, cutting down on the connection at the time of request.

It is mostly a concept. Exactly how the pooling is done and how a coder should use it will differ between implementations. You may have to explicitly do something differently to get connection pooling, or it may be done transparently (overriding certain functions, particularly the connection close), in which case connections are also easily re-used. Often enough, transparent reconnection is also handled.


Specific to...

PostgreSQL

See Postgres#Notes_on_performance_and_innards


MySQL

Memory configuration
Basic
This article/section is a stub — probably a pile of half-sorted notes, is not well-checked so may have incorrect bits. (Feel free to ignore, fix, or tell me)

The most interesting tweak is to give more than the default amount of memory to either...

  • InnoDB: use innodb_buffer_pool_size (innodb_buffer_pool in older versions(verify))
  • MyISAM: use key_buffer_size (key_buffer in older versions(verify))

To avoid having to divide your memory between engines, settle on one and use it for everything (most people seem to settle on InnoDB eventually).


You can often give it a good chunk of your free memory. If the server is dedicated, use up to perhaps 80% of memory (leaving the rest for other things such as MySQL per-connection buffers, and OS cache) - though note that the OS cache will often be an effective supporting cache, particularly for data sets that fit in memory.

On non-dedicated servers it may be useful to leave at least a few hundred megs for apache and other things, because if you push the allocation so far up that the OS has to swap out parts of MySQL to serve everything, the effect can be just as bad (or worse) than not having all data in memory.


Note that MyISAM only ever caches index data(verify). InnoDB cache index data and stored data.

(both can be a good thing, depending on your goals, data, schema design, size of the data, and amount of RAM).


Query cache
This article/section is a stub — probably a pile of half-sorted notes, is not well-checked so may have incorrect bits. (Feel free to ignore, fix, or tell me)

You could investigate how much you have to gain by using the MySQL query cache, and then check whether it being bigger would help.

Consider also that if it does help, you can often do it more effectively yourself with memcached or such.


Others
This article/section is a stub — probably a pile of half-sorted notes, is not well-checked so may have incorrect bits. (Feel free to ignore, fix, or tell me)

Note that some allocation settings apply per connection instead of globally - for example the sort buffer size.


open tables, open files
This article/section is a stub — probably a pile of half-sorted notes, is not well-checked so may have incorrect bits. (Feel free to ignore, fix, or tell me)

If you use MyISAM: each MyISAM table is actually three different files, you need the per-process file handler max to be 3 times the amount of tables you want to be able to have open(verify)). Note that with MyISAM, open_files_limit must be at least 3*table_cache.


Note that some OSes put a limit on per-process open files. If it is limited, you should not set MySQL above that number, or it will respond somewhat unreliably. Note also that various OSes allow you to set their open file limit higher.



table_cache has nothing to do with caching table data; it seems to be the amount of table references you can have open at a time, be it the permanent on-disk ones (such references can get re-used through this cache) or temporary ones (which are generally destroyed fairly quickly).(verify)

Note that each join is potentially a temporary table, so for safery, you may want to set table_cache to max_connections times the average/max amount of joins in common queries.

At the same time, table_cache should not imply a higher amount of open file descriptors than open_files_limit supports.

The default for table_cache is fine for most database designs and with the default max_connections value; it usually only needs changing in unusual cases.


See also:


Binlogging

(mysql's name, most databases do this one way or the other)

...is data journaling and is often on by default. It's a boon if you want to recover as much data as possible after a crash (it won't recover things that weren't done while the server crashed, but that sounds fair enough, right?) and are also used in replication (clustering).

Binlogging slows the database down somewhat, and the binlogs grow huge as it's basically a fully descriptive copy of everything you do to the database.

If you feel daily or weekly backups are enough, you can disable binlogging per database (binlog-ignore-db=dbname in my.cnf) or server-wide.

If you want binlogging, and are using platter drives, you can save seek time by making them go do a distinct drive.

To just empty them (seems to removes only what isn't necessary for recovery to current state(verify)), do a
FLUSH MASTER
(verify).


Slow Query log

The slow query log logs what queries are slow. Any query that takes longer than long_query_time specifies (in seconds) will be logged.

This can be rather useful to see what your bogged-down server is doing without having to profile your client code.

It can also be useful to see where you could perhaps use an index -- or where the query optimizer is misjudging.


Query Cache

MySQL can be told to keep a cache that maps from query string to the results it gives, avoiding disk access completely. This is a simple memcache and sounds cool, but in practice it is only interesting on tables that:

  • rarely change (writes invalidate cache entries)
  • return fairly small result sets (small enough to easily fit in the space allocated for this feature, alongside another bunch of them)
  • using queries that are identical
  • are queried at least somewhat regularly
  • using queries that are deterministic (see note below)

About determinism: an expression containing something like logdate>(current_date() - interval 31 days) is non-deterministic, since current_date returns something different each time you call it. Since in this case you're probably looking for approximately a month's worth of things to give statistics over, a few seconds won't matter. You could calculate the timestamp for midnight a month ago, then use that value in statistics queries. This value is a contant and will only change once a day, meaning the query cache can be used.


Most other database software does not have this feature, mostly because there are few cases in which it is a particularly useful feature. In practice, it improves fewer cases than even the above constaints suggests. You often want to add your own caching logic, meaning that an external memcache (e.g. memcached) is more practical.

Note that whenever one of these reasons implies the query cache is ineffective (particularly from common invalidation), it being enabled will slightly slow you down as there is constant management going on (entering on selects, invalidation on table updates). For this reason, you may want to enable the query cache but have it apply only for queries you know it is useful for and explicitly add the SQL_CACHE keyword to.


Consider, for example :

  • a table of a few dozen application settings. You probably fetch values often, it rarely ever changes, so you naturally want it to be in memory. The thing is that it was probably cached anyway.
  • looking up names of magazines by ISSN (e.g. select name from issns where issn=something), from a table that is probably rarely updated. There are over a million ISSNs, meaning the table would be large. The size itself is not too important since you would be crazy to not have an index on such a table, but the distribution is: it wouldn't be surprising to see lookups for a top few thousand come up as often as for the bottom million, and it would be nice to know that it's remembering many by-individual-ISSN queries, so that it won't have to do a disk read for those.
  • your own case :)
Use

If you have the query cache on and you do exactly the same query as you have done before, it will take the last cached result set from memory (if it was in there: if it fit and qualified).

  • Query
    SHOW VARIABLES LIKE 'query_cache_%'
    to see the configuration.
  • Use
    SHOW STATUS LIKE 'qcache_%';
    to see statistics of how much the query cache is currently being used
  • See also [1] for reference.


Note that depending on how the qc is configured (see query_cache_type), it may apply to everything you don't tell it not to work on (SELECT SQL_NO_CACHE ...), or work only on demand (SELECT SQL_CACHE .... The last may not be the most automatic, but it is often the simplest way to avoid it cycling out useful entries because of unnecessarily query caching.

SQL

Note on GROUP BY

Join types

This article/section is a stub — probably a pile of half-sorted notes, is not well-checked so may have incorrect bits. (Feel free to ignore, fix, or tell me)


INNER join

The join you typically use is the inner join.

It goes through multiple tables, and when a condition is matched, it emits a row, often consisting of values from both tables.

For example:

select person.name, department.name
from   person, department
where  person.deptid = department.id

You don't need to type JOIN or INNER JOIN, but you can. For example:

SELECT person.name. department.name
FROM person
JOIN department ON person.deptid = department.id

I prefer the former because it's shorter, and it's closer to my intuition of matching from side-by-side tables.



Named inner-join variants:

  • an equi-join is an inner join that uses an equality comparison (often identifiers).
  • a natural join is an inner join that implies all same-named columns must be equi-joined on
(and only one of each equi-joined column shows up, unlike in a manual equi/inner join).
It's partly just laziness.


CROSS JOIN

When you want all possible row pairs from two tables, you want a cross join (a.k.a. Cartesian join).

You don't need it very often, but it certainly has its uses.


It happens when you select from multiple tables and don't have a restricting condition. (you could consider it a variant of inner join if you want)

SELECT a.thing, b.thing
FROM a, b

You can also do that with an explicit CROSS JOIN:

SELECT a.thing, b.thing
FROM a
CROSS JOIN b


OUTER JOINs: LEFT JOIN, RIGHT JOIN

Note:

LEFT JOIN is short for LEFT OUTER JOIN
RIGHT JOIN short for RIGHT OUTER JOIn


When the condition fails in an left/right join, we still emit a row, but with NULLs on the other side.

This is useful when tabulating incomplete data, particularly if it is inherently incomplete.


An example will help here. Say you are are adding nice names for ID3v2 field names (e.g. "Lead Performer" for TPE1), but haven't finished yet.

Say you have a table filedata:

FILENAME       FIELDID   FIELDVALUE
/music/a.mp3   TIT2      Cristopher Lydon
/music/a.mp3   TPE1      Amanda Palmer

And a table fielddescr:

FIELDID    DESCRIPTION
TIT2       Title

Then:

SELECT filedata.filename, fielddescr.description,   filedata.value
FROM filedata
LEFT JOIN fielddescr
       ON filedata.fieldid=fielddescr.fieldid

Will give you:

/music/a.mp3   Title      Cristopher Lydon
/music/a.mp3   NULL       Amanda Palmer

...whereas in an inner join the last row would have been omitted. (in practice you may also select FIELDID to fall back to showing that)


This is called LEFT JOIN because the left (first mentioned) table is the primarily important one, while the column(s) coming from the right one right one (the one you mention after 'LEFT JOIN', fielddescr) may contain NULLs.


A RIGHT JOIN is the same thing with the other table being the important one. This can make for more somewhat more readable queries.


Semi-joins, Anti-join

Correlated subquery

A query in which a subquery has to be re-evaluated for each row (...with WHERE, or group of rows with HAVING), which is usually very slow, and regularly avoidable.


Whether a query is correlated or not depends on the presence of correlated references, for example a value that the subquery tests something that comes from the enclosing query. In simple cases, you can rewrite this into a join.


This is related to Postgresql_notes#Understanding_EXPLAIN_.2F_queries_to_rewrite_for_indices, specifically 'Sequential scan because of a lack of join')


Isolation levels

This article/section is a stub — probably a pile of half-sorted notes, is not well-checked so may have incorrect bits. (Feel free to ignore, fix, or tell me)

Isolations levels are a controlled way to loosen ACID behaviour.

Basically, it describes what non-ACID behaviour could happen if you don't do explicit and fairly complete locks.(verify)


The three things most often mentioned that you want to avoid:

  • Dirty reads is reading of data ignorant of any transaction or locking
Means we don't have to wait at all
May mean a connection reads data that is not yet committed by another
easily trouble when combining pieces of data that have to make coherent sense
  • Nonrepeatable reads: getting different values for the same row when trying to read it multiple times in the same transaction
...e.g. because of another transaction's committed alterations. Reads can be non-dirty, but still non-repeatable.
Before you go on: Nonrepeatable reads within a query means something different from non-repeatable reads in a transaction. The fact that we use the words for both is confusing, yes.
  • Phantom behavior / phantom reads refers to transactions 'magically' seeing new rows
i.e, seeing that the result set changes (rather than that specifically identificable row values change)
e.g. caused by WHERE clauses that select from another table at different times, which can find new matching rows after they are committed by something else. (The harsh solution is to lock. An often acceptable solution is storing results in a temporary table)


Approximately speaking,

dirty reads are reading non-committed data from another transaction
nonrepeatable reads are often because another transaction COMMITted an UPDATE
phantoms are often because another transaction COMMITted an INSERT or DELETE)

Approximately, because with more scrutiny there are a few more specific cases/behaviours.



You don't have to think about all of that because SQL-92 defines isolation levels:

  • READ UNCOMMITTED
    • Basically 'anything you find will do', does not use/observe read locks
    • allows dirty read
    • allows nonrepeatable reads
    • allows phantom behaviour
  • READ COMMITTED
    • Won't read uncommitted data, doesn't ensure that read rows won't change within the length of this transaction
    • avoids dirty reads (only)
    • allows nonrepeatable reads
    • allows phantom behaviour
"Won't read uncomitted data" and "allows nonrepeatable reads" seem to conflict. The point lies in the difference between a query and a transaction. Nonrepeatable reads in a query means it will not see different values if it needs to read a row twice. But two identical queries in the same transaction can still see data if another transaction has completely committed data to the relevant table. If you want all state to be repeatable within your transaction, you want either locks, or the following:
  • REPEATABLE READ
    • Won't read uncommitted data, and does ensure read rows will not change within the transaction. Put another way, you're essentially reading from the same snapshot each time.
    • avoids dirty reads
    • avoids nonrepeatable reads (the main point, since it's the name of the thing)
    • still allows phantom behaviour (...wider locking may avoid it, without being quite the same as SERIALIZABLE. Depends.)
    • usually implemented by placing locks on all data relevant to the query
  • SERIALIZABLE
    • strictest -- no transaction interaction possible at all
    • Tends to be noticeably slower, because it is often implemented with zealous locking, or sometimes even by never handling two transactions on the same object(s) at once(verify).
    • avoids dirty reads
    • avoids nonrepeatable reads
    • avoids phantom behaviour

Some database engines may offer their own additional levels (e.g. SNAPSHOT[2]). Note that oracle and postgres say serializable when they mean snapshot{{verify}



Note that most of these aren't ACID - that's sort of the point. ACID is great theory, but basically says "do things in series, or it becomes too hard to handle"

It turns out a lot of parallelism will easily make faster only under some conditions, so usually relatively little.

And you can get it faster if you loosen it more - which may even still be correct behaviour, but you have to know that it is.



Most database engines can be asked to behave to different levels (sometimes you want stricter to be highly correct, sometimes looser if you know the speed increase does not come with incorrectness for this specific query).


For example:

MySQL:

  • MyISAM does not support transactions at all, which is best described as READ UNCOMITTED since multiple connections can interleave queries in any order
  • InnoDB supports all four(verify), and defaults to REPEATABLE READ


Postgresql:

  • its MVCC nature means it internally has the equivalents of READ COMMITTED and SERIALIZABLE.
  • Default is READ COMITTED(verify)
  • Asking for the other two gives you the next strictest:
READ UNCOMMITTED → READ COMMITTED
REPEATABLE READ → SERIALIZABLE
  • Note that given its MVCC design, reads tend to spend less time waiting for locks, so looser settings often aren't useful




See also:

/admin/c0007870.htm


DB2 has its own terms and goes into a few more distinctions and may be interesting reading. See also:




Considerations to allowing dirty reads
This article/section is a stub — probably a pile of half-sorted notes, is not well-checked so may have incorrect bits. (Feel free to ignore, fix, or tell me)

Using SET TRANSACTION ISOLATION LEVEL to READ UNCOMMITTED (or e.g. MSSQL's per-query WITH(NOLOCK)), can be used to allow dirty reads to happen, which means the value is read out without locking considerations.

This can be useful when any value currently in there will do. This may be true in the case of social-site profile information, continually updated statistics tables, and such. Any value from now to a few seconds ago will be perfectly good - using locks and listening to transactional accuracy will only really mean more waiting.

Of course, dirty reads may at any time return uncomitted, old, or duplicate data (because you are asking the database to violate ACIDity for the read), meaning you should only do this when you know you'll do no harm, such as when you only read the value to show it and no write/update will even be based on it. (Much the same goes for non-repeatable reads, really)


Note that MVCC-style engines, such as postgresql and MySQL's InnoDB, avoid a lot of classical style locking by isolating data versions, which means there is little speed increase in allowing requests to give you dirty versions.


See also

Constraints and indices

PRIMARY, INDEX, UNIQUE, UNIQUE INDEX, CLUSTER

  • UNIQUE constraints
    • are a logical concept on a column tuple that tells inserts/updates to not violate this
    • are usually implemented by also having an index, but not always
    • may enable the scheduler to give you indexed lookups, but this is not at all guaranteed behaviour
  • An INDEX stores where to find data, usually on-disk, and avoids walking through all data. Indices are most useful when they allow operations to work on (greatly) reduced sets of data.
    • Usually this means tests for equality such as WHEREs in lookups or various JOINs, and also.
    • They may GROUP BY and/or ORDER BY, under certain conditions (verify)
    • Avoid when there is likeely little gain, but possibly loss by overhead, e.g. when:
      • ...the respective WHERE(/JOIN) will not be able to reduce the result set much
      • ...indices need to be adjusted very often (e.g. high volume of changes not done in batches)
      • (arguably) there are very few column values, such as in enumerations or often NULL (this will mostly duplicate the data and provide little gain)
      • ...field contents are/can be very long content, e.g. entire text documents (however, indices may be configurable to use only first n characters, which depending on the data may be a good place to tune)
    • Indices on multiple columns are possible
      • ...but only useful if you use these column tuples in a where or join; it does not give you indices on the separate columns. You usually should have a specific reason to do this.
      • Because of likely implementations, it often pays to order them from most to least restrictive.
  • UNIQUE INDEX means specifically getting both at once, presumably in the same index structure. This may also be useful information for the query scheduler. This often acts as the same thing as UNIQUE(verify).
  • A PRIMARY key constraint is largely a UNIQUE constaint and related index (usually), and also enforces that the value is not NULL.


  • CLUSTER sorts data on-disk based on an index.
    • Done immediatey, once. Only done when you explicitly ask it to, so for continuously updated tables this is worthless.
    • Seems to be an an in-place alternative to selecting into another table with an ORDER BY (verify)
    • Has the most use (arguably only has use) when you mostly/always select a small set of the data, and the index predicts this well, because it enables these selects to be in a smaller part of the disk. Note that subsequent inserts will not be clustered and lessen the improvement over time.

The I-don't-know-what-I-just-inserted problem

When you INSERT something and rely on the database to create a new unique identifier (SERIAL column in postgres, AUTO_INCREMENT in MySQL, IDENTITY in MSSQL, etc.), standard SQL that no way to let you directly/unambiguously return that new value.

When this automatically generated identifier is the row's only identifier, this means it's effectively impossible to fetch/reference just the row you just inserted.


Most solutions requires separate communication of the row identifier, either by using a backend-specific function to select this single value, or by doing the generation of identifiers yourself and using it explicitly in the insert (where generating it yourself may e.g. mean using the sequence these are generated from, or even something like UUIDs).


Get-the-value solutions imply that you have to check that the insert succeeded in the first place. It may also mean that doing multiple-row inserts can be impractical, since it tends to imply getting only the first (or last) ID that was set.

Some solutions may need calls to happen sequentially and in the same connection or transaction. Beware of creating race conditions.


Fetch the last last-set/generated value

This article/section is a stub — probably a pile of half-sorted notes, is not well-checked so may have incorrect bits. (Feel free to ignore, fix, or tell me)

Most engines have an automaticaly-generated-ID thing, and let you fetch the last generated value. This can be easiest, though means your SQL is not portable between database backends.


In PostgreSQL:

You can use:

SELECT currval("tablename_columnname_seq")
currval
returns the value most recently obtained by nextval for this sequence in the current session. In other words, it's transaction-safe.

(Side note: currval is considered a volatile function, so you often don't want to use it in a condition or subquery directly).


Since postgres 8.2, you can also do:

INSERT INTO foo VALUES () RETURNING id

...which is like a combined insert-and-select - what follows RETURNING is a basic select expression, but the most common use is probably returning the column containing the sequence-assigned IDs.


See also:


In MSSQL: You can fetch:

  • SELECT IDENT_CURRENT(’tablename’) reports the last inserted value for a table, regardless of connection/session
  • SELECT @@IDENTITY reports the last value set in the context of the current connection, regardless of table
  • SELECT SCOPE_IDENTITY() - like @@IDENTITY, but restricted to the scope, probably that of the insert you just did (verify). This avoids problems when you e.g. have triggers that may cause other identity field values to be generated



In MySQL:

Use LAST_INSERT_ID() (from SQL, or mysql_insert_id() from code).

This returns the last ID that was set into an auto_increment column (even if you handed it along explicitly), within the context of the same/current connection.

This is pretty robust, but the feature does not work with MyISAM when it uses 'Use Concurrent Inserts'.

Manual new ID

This article/section is a stub — probably a pile of half-sorted notes, is not well-checked so may have incorrect bits. (Feel free to ignore, fix, or tell me)

serial/auto_increment behaviour generally guarantees that the next value is max(ids_in_table)+1, and usually does so by being a simple counter belonging to a column/table, gets incremented and never wraps.

This makes it very simple and fast, although the non-wrapping is an inherent and potentially big problem, and perhaps more importantly, if you want your application to be able to use more than one backend, using such non-standard features means you need different code to handle different backends, something you may want to avoid.


A slower solution would be to calculate that max ID yourself, such as with a 'select max(id)', or '...order by id, limit 1' sort of thing. However, both of these are slow for big tables, so you may want to store a counter separately.

Postgres helps by allowing you to create sequences. These are used internally for serial fields, but can also be used to imitate a serial field since selecting from a sequence also increases its value.

Without it, however, you have to make sure no two clients can get the same ID to insert, by putting the select and update of the counter in a transaction. However, since MySQL's MyISAM does not support transactions, this will cause corruption.



Semi-sorted

Upsert

"Insert this row. If it already exists according to a uniqueness constraint, update that row instead" (or, in other cases, do nothing instead)


Consider a table

word   varchar(50)   primary key
count  integer


There are extensions in various database engines to allow something like:

MySQL:

INSERT INTO         wordcount (word,count)  VALUES ('a_word',1)  
ON DUPLICATE KEY    UPDATE    count=count+1;

Postgres introduced ON CONFLICT in version 9.5 [3]


Some also allow it in other ways. On postgres before 9.5 you could e.g. have the same effect with, you could also do it with a rule, something like:

CREATE OR REPLACE RULE strings_counter AS
ON INSERT TO strings
WHERE (EXISTS ( SELECT 1 FROM strings tester WHERE new.string = tester.string)) 
DO INSTEAD  UPDATE strings SET count = strings.count + 1
            WHERE strings.string = new.string;


If you want a version that works on all backends, it's going to require two statements, because upserts are necessarily a small transaction, even if it's hiding in the syntax (implementing it cleanly and efficiently seems to be the whole reason postgres postponed it for a while).

In fact, some DIY solutions like the following will have possible races, others will have possible deadlocks.

When you want to usually do-nothing-instead, it's probably more efficient to do a SELECT, then e.g. INSERT (or UPDATE) based on whether you had 0 or 1 rows.

If you usually want to update, then you can instead:

UPDATE wordcounter set count=count+1 where word=myword

and if that affects zero rows:

INSERT INTO wordcounter values (myword, 1)


counting is slow

Fetching random items and samples

This article/section is a stub — probably a pile of half-sorted notes, is not well-checked so may have incorrect bits. (Feel free to ignore, fix, or tell me)

If you want a guaranteed statistically robustly random sample without thinking, you can often use ORDER BY RANDOM (exact syntax varies). Which is often slow, because picking from most of the table means reading much of an index/table.

There are faster tricks

  • you can get faster results with fairly decent randomness when you know certain things about your table beforehand
  • when your wish is less robust, e.g. "a chunk of examples that is picked somewhat arbitrarily"


order by random

  • properly random
  • slow: probably O(n lg n) with the table size at best, and apparently exponential with proper randomness - you probably do not want to do this unless the table will always stay smallish.

Example:

SELECT * from strings ORDER BY random() LIMIT 10

(note: random() may be called rand() or something else)

random offset

  • best for single item or when a contiguous range at a randomly position is good enough
  • requires knowing amount of items
  • good for random selection of single items, often not so much for larger samples
  • linear with table size (probably(verify))


Idea: When you (approximately or exactly) know the number of items, you can pick a spot between 1 and that amount and use it as an offset. The SQL syntax varies. Details are typically found in LIMIT documentation.

If you fetch many items, they'll be related by some sort of sorting, which may not be random enough, or be perfectly fine for your purposes.


Knowing the amount of items exactly is preferable, in that if your estimate is a little high, you would sometimes select nothing. If you know how much off the estimate can be, you can be conservative with your offset.


Postgres optimization: fetching the amount of items recorded at last vacuum time (SELECT reltuples FROM pg_class WHERE relname='tablename') is faster than a basic-SQL select count(*) from tablename. The difference can matter for (many) small samples from large tables.

In MySQL+InnoDB, you can use SHOW TABLE STATUS similarly, but its value is a much rougher estimate that in some situations is worthless see docs.

select a fraction of rows

  • randomly allows items to become part of the result
  • useful for portions (choosing threshold by (approximate) amount of rows)
  • decent randomness - within the set it considers (consider: table order, limit)
  • should be linear time with min(table size, amount you want) (verify)
which, if you want to select from all the table, is roughly a tablescan

However, note:

its relation to the data is a bit funny
it relies on knowing roughly how much data there is / that there is enough

Consider:

-- selects 50% from the first 200ish it sees
select * from tab where rand()<0.5 limit 100  

-- selects 10% of the first 1000ish rows
select * from tab where rand()<0.1 limit 100


select next ID

  • best for single item
  • typically an index-based lookup, then avoiding both sorting and skipping past records. If so, it's the fastest option for larger tables
  • randomness can be good, depending on the table's history (see notes below)
  • can be the fastest trick, but the one with the most footnotes / caveats.


Main assumptions:

  • your rows have IDs on them and they are indexed (often true)
  • you know little or nothing was deleted

...because

that means you have a known range, where thinking up an integer is pretty-properly random,
and the index means 'look up this or the next highest id' (to be robust to a few holes due to deletes) will use an index and so be fast (noticeably faster than linear time anyway)


For example, to get a random string from a (id,string) table, first get an ID in the given range

SELECT FLOOR( RANDOM()*(SELECT max(id) FROM strings))

And then use it to select the row with the next highest ID (and these two are typically in single statement)

SELECT * FROM strings WHERE id >= that_randid LIMIT 1


There are a lot of variants and a lot of possible tweaks to this trick. For example, you can often know (or estimate) max(id) in cheaper ways (e.g. remember it on inserts, select it from the relevant sequence in postgres)


Tables with a lot of deletes are problematic, because the random integer in 1..max(id) now isn't properly distributed.

The more sparse or unbalanced the IDs in your whole range, the worse the randomness is and the more you may e.g. see specific items.

notes

  • SELECT count(*) is often linear time with table size, so remembering an approximate count may be preferable over using count(). For a lot of software there is a non-standard, cheaper, and often more approximate way of estimating the amount of rows (some notes on this for postgres and mysql are mentioned below)
  • Watch bordercases that can mean fewer results than you think (like 0), particularly on range fetches - the real code you'll need often needs to cheat a little more than that shown above.
  • LIMIT 1 (or any smallish number) is often fast because it tells the database that after it has found one thing, it can immediately looking in the index or table.
  • Similarly, a limit with an offset also can also work fairly well, to fetch a chunk of items at a random position. It's often slower, though, as it needs to skip past data.
  • You generally want to avoid sorting when possible, as that's O(n*log(n)) at best
  • the speed of selecting max(id) or count(*) may differ depending on indices, optimizations, and possibly on engine

Multiple cursors

.NET notes

This article/section is a stub — probably a pile of half-sorted notes, is not well-checked so may have incorrect bits. (Feel free to ignore, fix, or tell me)

.NET has a database interface made up of a good bunch of classes. It is nicely generic, if quite verbose.


Connections to data sources

a Connection object represents a connection to a particular data source, and is subclassed for types of connection, such as SqlConnection, OleDbConnection, OdbcConnection, OracleConnection.


You set up an object using a connection string, then call Open() when you need a connection.


It's common to use a using to have disconnection/threadpool releases (if applicable) handled automatically when execution exits the code block:

using (SqlConnection conn = new SqlConnection(connectionString)) {
    conn.Open();
    // all work that needs the connection
}

Other ways are calling Close() explicitly, and to hand CommandBehavior.CloseConnection to commands, but you should only do so when you are sure you need it, as it would lessen threadpool effectiveness.


Connection pooling has to be supported by the database adapter, and may be disabled via the connection string.


You can use an event handler to listen to connection state changes if you wish.

Commands and their execution

Commands represent operations in a database. They may be instantiated directly or be created using Connection.CreateCommand() or Transaction object.

You can set the exact command yourself, e.g.:

SqlCommand countStrings = new SqlCommand(
      "SELECT count(*) FROM strings", connection);

You can also use one of InsertCommand, UpdateCommand, and DeleteCommand (or more exactly, its derivatives)

SqlParameter

To avoid SQL injection, you can use the fairly standard place-a-marker-in-the-actual-query, tell-it-what-to-replace-it-with approach.

There are many ways to do this, including the following two:

SqlCommand cmd = new SqlCommand("SELECT count(*) from tablename where email=@email AND password=@2)", connection);
cmd.Parameters.Add("@email");
cmd.Parameters["@email"] = email;
cmd.Parameters.Add("@2",SqlDbType.Text).Value = password;
// A much longer form would be:
SqlParameter emailParam = new SqlParameter();
emailParam.ParameterName = "@email";
emailParam.Value         = email;
cmd.Parameters.Add(emailParam);

//You may sometimes want to use Convert.Tosomething() to catch bad data on the code side rather than letting the database handle it.


Parameters are also used to specify mappings to DataSet objects.


Readers

ExecuteReader() returns an object that allows you to read many rows. For example:

SqlDataReader reader = command.ExecuteReader() //optionally with a CommandBehavior

while (reader.Read()) { //which updates columns. For example:
    Console.WriteLine(String.Format("{0}", reader[0]));
}
Scalars

ExecuteScalar() returns the first column of the first row, and ignores the rest.

Use when you expect exactly one such result, such as when you use aggregate functions to select just the min, max, or count of something.

For example:

numberOfStrings = (Int32)command.ExecuteScalar();


Non-queries

ExecuteNonQuery is used for commands that do not query for data, such as SQL INSERT, DELETE, UPDATE, and SET.

Returns the number of rows affected.

Begin...: Ascunchronous variants
Stored procedures

DataSets

A DataSet is an in-memory representation of database data, allowing disconnected/offline access to data fetched earlier.

DataSets are usually filled by creating a DataAdapter based on a query and having it fill the object with data.

Alternatively, you can add data (rows) manually, as well as create the tables and their columns from scratch.


DataSets can also be used to update, insert, and delete data, although this has a few more problems than command-based code.

DataSet may contain one or more DataTables. In some cases, just the latter will do, and you can fill it just the same (using a different overload of a data adapter).


Untyped DatsSets are generic DataSet() objects for which the column types become clear when it is being filled, so type checking can only be done at runtime.

Typed DataSets are those for which you explicitly declare the typing before compilarion, so the strongly typing can apply at compile time. This does imply you need a typed dataset for each different (major) query.

Typed datasets are declared based on schemas stored in xsd format (XML schema definitions).


BLOBs

There are also a few minor behaviour things that can bug people, such as:

  • the encoding/collation features that significantly vary in behaviour in a few specific combinations of collations and text column types (one of utf8_general_ci/TEXT, utf8_general_ci/VARCHAR, utf8_bin/TEXT, utf8_bin/VARCHAR acts differently from the rest. Can you guess which?), simply because they kill the concept of transparent storage and of all text being created equal, which seems sour as MySQL's relevant code is probably originally aimed at both providing and being smart about it, and occasionally useful enough for it.