RDBMS use and speed tweaking notes

From Helpful
Jump to navigation Jump to search
The printable version is no longer supported and may have rendering errors. Please update your browser bookmarks and please use the default browser print function instead.
Database related

More theoretical - thinking about databases:

Everyday-use notes


...which in practice often means "database modeled to be queried easily with SQL", but not only.

Speed and tweaking

Much of this is RDBMS-general.

Some of it came from suggestions specifically for one database engine or another.


General

Memory

Anything that can come from memory instead of platter disk tends to be much faster.

This was always true for platter disk. On SSDs the difference is much less pronounced, but still exists.


Whether 'memory' is the database's cache (more dedicated), or via the OS's page cache (more down to whether other things use memory) matters less, but does matter at all as the latter involves syscall overhead and some interfaces that may be a little more serial than you'ld like.

...so dedicating some RAM to caches within the database tends to make for both higher and more predictable performance. But it takes RAM from other things.

The OS's disk cache can be almost as good, but only be persistently good if the database is the only process (e.g. on a host dedicated to the database).


Relatedly: if your data can be more compact, that means more of your table data will fit into memory.





Disk and other hardware

tl;dr:

  • If you store on platter disks, consider it archival, and hit it as little as you can. Some use is perfectly fine but once multiple workers hit it enough to queue, latency will rocket up due to seek speeds adding up
if you do tests for this, model the fact that a large amount of concurrent operations, from multiple clients
  • SSDs are much friendlier in this regard. Particularly for read-heavy loads (heavy writes have more footnotes)
if they are too pricy to store everything on, you may still consider using them for your most active databases / tables
...but the price difference is still getting smaller, so at some point this stops being an argument
  • a mass of (small, transactioned) write operations can strangle IO due to having to happen separately.
Flushing less often helps here - at the expense of how many seconds of recent changes are lost after crash


Databases: IO bound

The slowest part of a system involving a database is typically that database, because the database server itself is often not CPU-bound but IO-bound.


Some less-technical businesses have a habit of thinking that price tag defines hardware speed, and throw money at the problem (bigger servers).

This works somewhat, but e.g. platter disk is slow enough without having to share it, and running two database engines on the same disk (e.g. by server virtualization done the wrong way) can be fatal to performance.

In not-overly-pathological cases, two household computers (or just separate disks) would hold up far better than one powerful server.

(this was especially pressing on platter, a little less so on SSD)


Note there are app-side tricks, such as memcaches, that can relieve this to a surprising degree (depending on case, and design).

The value of SSDs

The issue with platter disks is not(/rarely) their throughput, and more typically their latency - the overhead of operations (particularly "ensure this is on disk") is limited by seek speed, in turn limited by physical rotation and head movement.


SSDs don't have those physics in the way, and can guarantee lower latency, both on average and fairly consistently.

This means high IOPS rate, allowing a higher transaction rate without touching data safety.


Also keep in mind that a database engine tends to write

  • transaction journaling
  • table data
  • logs
  • sometimes implied swap

...and it can makes sense to split those to distinct devices (more so on platter than on SSD).



The value of caches

This article/section is a stub — probably a pile of half-sorted notes and is probably a first version, is not well-checked, so may have incorrect bits. (Feel free to ignore, or tell me)

Memory in general

Any data that is in RAM doesn't have to come disk. This matters much less in these SSD days, but still helps.

If you can dedicate RAM to it, consider doing so.
Even if you don't your OS's page cache is helping.

See also #Memory above.



Consider memcaches

Database connection setup is often not the most trivial thing, which is one reason a separate in-RAM memcache service (with no auth and such) can go higher than a regular RDBMS even if it's serving all data from RAM.

Under load, it can also lessen waiting for connection slots. Consider that a memcache can be shareded, the database cannot.

This obviously comes at the cost of ensuring that cache stays correct.



Cacheing query plans

Refers to the backend reusing the way it turned a particular SQL string into an query plan.

Products vary on whether they do this for ad-hoc query, even if the SQL string is identical, or only on request).


So if you want this, you should probably tell it to do this. This is known as a prepared statement.


Notes:

  • some products do this only on request, others do this for all queries (parametrized ones, may come down to library details).
  • Prepared statements are often only reusable within a single session.
  • (so) if you use the query only once in a session, you reuse nothing so you save nothing.
  • the client gains are usually tiny, in that the cost of connecting is larger
  • in high volume serving, with some very common queries, you may find the system performs/saturates better if those common things are prepared queries
  • ...beyond that, prepared statements are easy to over-hype
and benchmarks can easily lie by
forgetting connection setup latency
forgetting to run on actual data, meaning it's measuring the overhead only
  • Prepared statements aren't standard SQL(verify), so syntax varies a little per backend.
  • 'Query cache' can refer to this, or the next point



Cacheing query results


The idea of a query cache is that if the backend

  • gets exactly the same query as before
  • knows the data it worked from hasn't changed

...then it could return the same data without having to execute anything.


Notes:

  • 'Query cache' can refer to this, or the previous point
  • Works only for selects.
  • Only worth it for more complex queries
  • Works only if the data didn't change
  • Yes, libraries or your own code can do this. But if possible, you want the backend to do this
It's the only part that can ensure consistency. All other parts have to worry about manual cache invalidation and such
and note that for simple selects, it may make no speed difference



Indexing

An art as well as a science, because while the most common queries may be obvious for your app, it's hard to design perfectly for real-world use and data distribution.

Notes:

  • there's often a 80/20 thing going on, figure out e.g.
which one or two queries that will using most time
which queries happen less regularly but hit a lot of data (and whether that's a maintenance thing that is fine to take longer)
and then...


  • EXPLAIN is your friend.
To check how and when indices are used
a slow-query log is a good source for queries not hitting indices


  • Anticipate data growth. Long-term scaling is more important that short-term improvement.
  • know that indices are not free
indexes have to be updated in related write operations, so are only worth it if there are a good bunch of select operations that they actually improve
some indexes cost more than they are worth.
  • You can consider indexes on columns when:
    • you know query conditions will often use that column,
    • for which the values will be fairly unique, and/or
    • the result set always relatively small
  • Realize what contributes to index size
(e.g. sometimes normalization helps a lot)




A primary index is an index on a field or set of fields that includes the primary key - which implies it is unique and without duplicates.

A secondary index is one that may have duplicates.


For example, an index on user ID would be a primary index, and index on user name would be secondary.

Schema tweaking

Query writing and EXPLAINing

There are often alternative ways to write a query with different index-using implications. Particularly when you use things like GROUP BY, think about what indices you have. If you're not sure it'll be used, you can use EXPLAIN to see the difference in the way the query optimizer will actually execute your query. Table scanning (looking at all rows) is obviously bad for large tables.

Think in sets rather that for loops, which will also help you avoid sub-queries and use the generally more efficient joins.

Try to rewrite for indices, for example making expressions with indexed columns in them trivial comparisons rather than complex.


On foreign keys

fragmentation

Connection pooling

You will often write code like "connect to database, do stuff in database, close database connection".

Every "connect to database" is a small holdup, because that'll take more than a few milliseconds (authentication, possibly a secure connection, select database, set connection parameters, and other overhead).


Connection pooling refers to keeping a number of connections open, so that you can skip some of that overhead and go straight to the "do stuff in database".


It is mostly a concept, and can be done multiple ways.

It can be a separate process, it can be a pool kept within a client process. The basics aren't hard to implement yourself within a specific app, yet consider that...

A more generic solution needs to worry about host/port/user/database parameters should become distinct connections

arguing for something external to you

It's nice if it reconnects transparently (and other edge cases) and overrides certain functions (particularly a connection close)

arguing for a libraries.

It's nice to have minimal code changes (rather than having to change every database connect to a pool request)

arguing for a proxy process that takes your TCP requests (and does database connection for you)

Specific to...

PostgreSQL

See Postgres#Notes_on_performance_and_innards


MySQL

Memory configuration
Basic
This article/section is a stub — probably a pile of half-sorted notes and is probably a first version, is not well-checked, so may have incorrect bits. (Feel free to ignore, or tell me)

The most interesting tweak is to give more than the default amount of memory to either...

  • InnoDB: use innodb_buffer_pool_size (innodb_buffer_pool in older versions(verify))
  • MyISAM: use key_buffer_size (key_buffer in older versions(verify))

To avoid having to divide your memory between engines, settle on one and use it for everything (most people seem to settle on InnoDB eventually).


You can often give it a good chunk of your free memory. If the server is dedicated, use up to perhaps 80% of memory (leaving the rest for other things such as MySQL per-connection buffers, and OS cache) - though note that the OS cache will often be an effective supporting cache, particularly for data sets that fit in memory.

On non-dedicated servers it may be useful to leave at least a few hundred megs for apache and other things, because if you push the allocation so far up that the OS has to swap out parts of MySQL to serve everything, the effect can be just as bad (or worse) than not having all data in memory.


Note that MyISAM only ever caches index data(verify). InnoDB cache index data and stored data.

(both can be a good thing, depending on your goals, data, schema design, size of the data, and amount of RAM).


Query cache
This article/section is a stub — probably a pile of half-sorted notes and is probably a first version, is not well-checked, so may have incorrect bits. (Feel free to ignore, or tell me)

You could investigate how much you have to gain by using the MySQL query cache, and then check whether it being bigger would help.

Consider also that if it does help, you can often do it more effectively yourself with memcached or such.


Others
This article/section is a stub — probably a pile of half-sorted notes and is probably a first version, is not well-checked, so may have incorrect bits. (Feel free to ignore, or tell me)

Note that some allocation settings apply per connection instead of globally - for example the sort buffer size.


open tables, open files
This article/section is a stub — probably a pile of half-sorted notes and is probably a first version, is not well-checked, so may have incorrect bits. (Feel free to ignore, or tell me)

If you use MyISAM: each MyISAM table is actually three different files, you need the per-process file handler max to be 3 times the amount of tables you want to be able to have open(verify)). Note that with MyISAM, open_files_limit must be at least 3*table_cache.


Note that some OSes put a limit on per-process open files. If it is limited, you should not set MySQL above that number, or it will respond somewhat unreliably. Note also that various OSes allow you to set their open file limit higher.



table_cache has nothing to do with caching table data; it seems to be the amount of table references you can have open at a time, be it the permanent on-disk ones (such references can get re-used through this cache) or temporary ones (which are generally destroyed fairly quickly).(verify)

Note that each join is potentially a temporary table, so for safery, you may want to set table_cache to max_connections times the average/max amount of joins in common queries.

At the same time, table_cache should not imply a higher amount of open file descriptors than open_files_limit supports.

The default for table_cache is fine for most database designs and with the default max_connections value; it usually only needs changing in unusual cases.


See also:


Binlogging

(mysql's name, most databases do this one way or the other)

...is data journaling and is often on by default. It's a boon if you want to recover as much data as possible after a crash (it won't recover things that weren't done while the server crashed, but that sounds fair enough, right?) and are also used in replication (clustering).

Binlogging slows the database down somewhat, and the binlogs grow huge as it's basically a fully descriptive copy of everything you do to the database.

If you feel daily or weekly backups are enough, you can disable binlogging per database (binlog-ignore-db=dbname in my.cnf) or server-wide.

If you want binlogging, and are using platter drives, you can save seek time by making them go do a distinct drive.

To just empty them (seems to removes only what isn't necessary for recovery to current state(verify)), do a FLUSH MASTER(verify).


Slow Query log

The slow query log logs what queries are slow. Any query that takes longer than long_query_time specifies (in seconds) will be logged.

This can be rather useful to see what your bogged-down server is doing without having to profile your client code.

It can also be useful to see where you could perhaps use an index -- or where the query optimizer is misjudging.


Query Cache

MySQL can be told to keep a cache that maps from query string to the results it gives, avoiding disk access completely. This is a simple memcache and sounds cool, but in practice it is only interesting on tables that:

  • rarely change (writes invalidate cache entries)
  • return fairly small result sets (small enough to easily fit in the space allocated for this feature, alongside another bunch of them)
  • using queries that are identical
  • are queried at least somewhat regularly
  • using queries that are deterministic (see note below)

About determinism: an expression containing something like logdate>(current_date() - interval 31 days) is non-deterministic, since current_date returns something different each time you call it. Since in this case you're probably looking for approximately a month's worth of things to give statistics over, a few seconds won't matter. You could calculate the timestamp for midnight a month ago, then use that value in statistics queries. This value is a contant and will only change once a day, meaning the query cache can be used.


Most other database software does not have this feature, mostly because there are few cases in which it is a particularly useful feature. In practice, it improves fewer cases than even the above constaints suggests. You often want to add your own caching logic, meaning that an external memcache (e.g. memcached) is more practical.

Note that whenever one of these reasons implies the query cache is ineffective (particularly from common invalidation), it being enabled will slightly slow you down as there is constant management going on (entering on selects, invalidation on table updates). For this reason, you may want to enable the query cache but have it apply only for queries you know it is useful for and explicitly add the SQL_CACHE keyword to.


Consider, for example :

  • a table of a few dozen application settings. You probably fetch values often, it rarely ever changes, so you naturally want it to be in memory. The thing is that it was probably cached anyway.
  • looking up names of magazines by ISSN (e.g. select name from issns where issn=something), from a table that is probably rarely updated. There are over a million ISSNs, meaning the table would be large. The size itself is not too important since you would be crazy to not have an index on such a table, but the distribution is: it wouldn't be surprising to see lookups for a top few thousand come up as often as for the bottom million, and it would be nice to know that it's remembering many by-individual-ISSN queries, so that it won't have to do a disk read for those.
  • your own case :)
Use

If you have the query cache on and you do exactly the same query as you have done before, it will take the last cached result set from memory (if it was in there: if it fit and qualified).

  • Query SHOW VARIABLES LIKE 'query_cache_%' to see the configuration.
  • Use SHOW STATUS LIKE 'qcache_%'; to see statistics of how much the query cache is currently being used
  • See also [1] for reference.


Note that depending on how the qc is configured (see query_cache_type), it may apply to everything you don't tell it not to work on (SELECT SQL_NO_CACHE ...), or work only on demand (SELECT SQL_CACHE .... The last may not be the most automatic, but it is often the simplest way to avoid it cycling out useful entries because of unnecessarily query caching.

SQL

Versions

This article/section is a stub — probably a pile of half-sorted notes and is probably a first version, is not well-checked, so may have incorrect bits. (Feel free to ignore, or tell me)

SQL versions:

  • SQL-86
  • SQL-89
  • SQL-92, also known as SQL2
    • Major revision
  • SQL:1999, ISO/IEC 9075:1999, also known as SQL99, and as SQL3
  • SQL:2003, ISO/IEC 9075:2003
    • XML functionality
    • MERGE
  • SQL:2006
  • SQL:2008, ISO/IEC 9075:2008
    • more XML functionality
  • SQL:2011
temporal data
https://sigmodrecord.org/publications/sigmodRecord/1203/pdfs/10.industry.zemke.pdf
  • SQL:2016
  • SQL:2019


If you want SQL that is likely to do the same thing on everything made in the last twenty+ years, you may be best off with SQL-92 or SQL:1999.

That said, such complete support is rare.

most software is written for one specific backend anyway (then possibly then also using some of its proprietary features),
...or maybe two or three (then often more generic)

Join types

This article/section is a stub — probably a pile of half-sorted notes and is probably a first version, is not well-checked, so may have incorrect bits. (Feel free to ignore, or tell me)


INNER JOIN

The join you typically use is the inner join.

It goes through multiple tables, and when a condition is matched, it emits a row that often comes from values from both tables.

For example:

select person.name, department.name
from   person, department
where  person.deptid = department.id


You don't need to type JOIN or INNER JOIN, but you can. For example:

SELECT person.name. department.name
FROM person
JOIN department ON person.deptid = department.id

I personally don't, but you can argue that it's clearer, pointing out that all operations on multiple tables are some sort of join.




Named inner-join variants:

  • an equi-join is an inner join that uses an equality comparison (often identifiers).
  • a natural join is an inner join that implies all same-named columns must be equi-joined on
(and only one of each equi-joined column shows up, unlike in a manual equi/inner join).
It's partly just laziness.
CROSS JOIN

When you want all possible combinations of rows from two tables, you want a cross join (a.k.a. Cartesian join).

You don't need or do this it very often, but it has its uses.


It happens when you select from multiple tables and don't have a restricting condition. (you could consider it a variant of inner join if you want)

SELECT a.thing, b.thing
FROM a, b

You can also do that with an explicit CROSS JOIN:

SELECT a.thing, b.thing
FROM a
CROSS JOIN b



OUTER JOINs: LEFT JOIN, RIGHT JOIN, and FULL JOIN

Note:

LEFT JOIN is short for LEFT OUTER JOIN
RIGHT JOIN short for RIGHT OUTER JOIN
FULL JOIN short for FULL OUTER JOIN


When the condition fails in an left/right join, we still emit a row, but with NULLs on the other side.

This is useful when tabulating incomplete data, particularly if it is inherently incomplete.


An example will help here. Say you are are adding nice names for ID3v2 field names (e.g. "Lead Performer" for TPE1), but haven't finished yet.

Say you have a table filedata:

FILENAME       FIELDID   FIELDVALUE
/music/a.mp3   TIT2      Christopher Lydon
/music/a.mp3   TPE1      Amanda Palmer

And a table fielddescr:

FIELDID    DESCRIPTION
TIT2       Title

Then:

SELECT filedata.filename, fielddescr.description,   filedata.value
FROM filedata
LEFT JOIN fielddescr
       ON filedata.fieldid=fielddescr.fieldid

Will give you:

/music/a.mp3   Title      Christopher Lydon
/music/a.mp3   NULL       Amanda Palmer

...whereas for an inner join, that last row would have been omitted.


This is called LEFT JOIN because

the left (first mentioned) table is the primarily important one, while
the column(s) coming from the right one right one (the one you mention after 'LEFT JOIN', fielddescr) may contain NULLs.


A RIGHT JOIN is the same thing with the other table being the important one.

Left and right are in some ways the same thing, but there are many cases where one makes for a less awkward, more readable query.



A FULL JOIN is like the above, where neither table is the important one.

Say you have a customers and orders table.

Then

SELECT          customers.name, orders.order_id
FROM            customers
FULL OUTER JOIN orders ON customers.customer_id=orders.customer_id;

will give customers without orders 'as well as and orders without customers (probably on top of mostly rows that have both)

Semi-joins, Anti-join
This article/section is a stub — probably a pile of half-sorted notes and is probably a first version, is not well-checked, so may have incorrect bits. (Feel free to ignore, or tell me)

Correlated subquery

A correlated subquery is one where a subquery uses a value from the outer query.


This has implications to its execution - it easily leads to more work.


For example, in

SELECT employee_name
  FROM employees emp
  WHERE salary > ( SELECT AVG(salary) FROM employees emp2 WHERE emp2.department = emp.department);

the inner query might be executed for every employee, but if the query optimizer is smart, it could cache the per-department averages it calculates, so the inner query will only need to be be executed once for every department that you select employees from, rather than once per employee row.


In various cases, there may be uncorrelated and correlated subquery forms of essentially the same query.

Which matter in that often, more optimizations apply to the correlated subquery form than the uncorrelated one.

Consider the following ('fetching tags for files we no longer have')

SELECT distinct guid   FROM tags_by_guid
WHERE guid not in  (SELECT guid FROM files)

This is an uncorrelated subquery. Even if you consider it obvious that this should be a join because the thing you're matching on it's the same sort of thing, the query optimizer cannot know it can make that assumption.

As such, in the above the subquery will probably be executed each time (as EXPLAIN will mention).


You would want to write it as a correlated subquery instead. There are a few options, one of them is:

SELECT distinct guid   FROM tags_by_guid tbg
WHERE not exists (select 1 from files where files.guid=tbg.guid)

The inner query may still be a sequential scan, it is likely to get executed only once and the next step being a single join.

Note that how much difference there is depends a bunch on the specific case, and the constraints and indices involved.

So you probably want to use EXPLAIN to see if that's true and whether it's using indices.


Constraints and indices

PRIMARY, INDEX, UNIQUE, UNIQUE INDEX, CLUSTER

  • UNIQUE constraints
    • are a logical concept on a column tuple that tells inserts/updates to not violate this
    • are usually implemented by also having an index, but not always
    • may enable the scheduler to give you indexed lookups, but this is not at all guaranteed behaviour
  • An INDEX stores where to find data, usually on-disk, and avoids walking through all data. Indices are most useful when they allow operations to work on (greatly) reduced sets of data.
    • Usually this means tests for equality such as WHEREs in lookups or various JOINs, and also.
    • They may GROUP BY and/or ORDER BY, under certain conditions (verify)
    • Avoid when there is likeely little gain, but possibly loss by overhead, e.g. when:
      • ...the respective WHERE(/JOIN) will not be able to reduce the result set much
      • ...indices need to be adjusted very often (e.g. high volume of changes not done in batches)
      • (arguably) there are very few column values, such as in enumerations or often NULL (this will mostly duplicate the data and provide little gain)
      • ...field contents are/can be very long content, e.g. entire text documents (however, indices may be configurable to use only first n characters, which depending on the data may be a good place to tune)
    • Indices on multiple columns are possible
      • ...but only useful if you use these column tuples in a where or join; it does not give you indices on the separate columns. You usually should have a specific reason to do this.
      • Because of likely implementations, it often pays to order them from most to least restrictive.
  • UNIQUE INDEX means specifically getting both at once, presumably in the same index structure. This may also be useful information for the query scheduler. This often acts as the same thing as UNIQUE(verify).
  • A PRIMARY key constraint is largely a UNIQUE constaint and related index (usually), and also enforces that the value is not NULL.


  • CLUSTER sorts data on-disk based on an index.
    • Done immediatey, once. Only done when you explicitly ask it to, so for continuously updated tables this is worthless.
    • Seems to be an an in-place alternative to selecting into another table with an ORDER BY (verify)
    • Has the most use (arguably only has use) when you mostly/always select a small set of the data, and the index predicts this well, because it enables these selects to be in a smaller part of the disk. Note that subsequent inserts will not be clustered and lessen the improvement over time.

The I-don't-know-what-I-just-inserted problem

When you INSERT something and rely on the database to create a new unique identifier (SERIAL column in postgres, AUTO_INCREMENT in MySQL, IDENTITY in MSSQL, etc.), standard SQL that no way to let you directly/unambiguously return that new value.

When this automatically generated identifier is the row's only identifier, this means it's effectively impossible to fetch/reference just the row you just inserted.


Most solutions requires separate communication of the row identifier, either by using a backend-specific function to select this single value, or by doing the generation of identifiers yourself and using it explicitly in the insert (where generating it yourself may e.g. mean using the sequence these are generated from, or even something like UUIDs).


Get-the-value solutions imply that you have to check that the insert succeeded in the first place. It may also mean that doing multiple-row inserts can be impractical, since it tends to imply getting only the first (or last) ID that was set.

Some solutions may need calls to happen sequentially and in the same connection or transaction. Beware of creating race conditions.


Fetch the last last-set/generated value

This article/section is a stub — probably a pile of half-sorted notes and is probably a first version, is not well-checked, so may have incorrect bits. (Feel free to ignore, or tell me)

Most engines have an automaticaly-generated-ID thing, and let you fetch the last generated value.

This can be easiest, but it means your SQL is not portable between database backends.


In PostgreSQL:

You can use:

SELECT currval("tablename_columnname_seq")

currval returns the value most recently obtained by nextval for this sequence in the current session. In other words, it's transaction-safe.

(Side note: currval is considered a volatile function, so you often don't want to use it in a condition or subquery directly).


Since postgres 8.2, you can also do:

INSERT INTO foo VALUES () RETURNING id

...which is like a combined insert-and-select - what follows RETURNING is a basic select expression, but the most common use is probably returning the column containing the sequence-assigned IDs.


See also:


In MSSQL: You can fetch:

  • SELECT IDENT_CURRENT(’tablename’) reports the last inserted value for a table, regardless of connection/session
  • SELECT @@IDENTITY reports the last value set in the context of the current connection, regardless of table
  • SELECT SCOPE_IDENTITY() - like @@IDENTITY, but restricted to the scope, probably that of the insert you just did (verify). This avoids problems when you e.g. have triggers that may cause other identity field values to be generated



In MySQL:

Use LAST_INSERT_ID() (from SQL, or mysql_insert_id() from code).

This returns the last ID that was set into an auto_increment column (even if you handed it along explicitly), within the context of the same/current connection.

This is pretty robust, but the feature does not work with MyISAM when it uses 'Use Concurrent Inserts'.

Manual new ID

This article/section is a stub — probably a pile of half-sorted notes and is probably a first version, is not well-checked, so may have incorrect bits. (Feel free to ignore, or tell me)

serial/auto_increment behaviour generally guarantees that the next value is max(ids_in_table)+1, and usually does so by being a simple counter belonging to a column/table, gets incremented and never wraps.

This makes it very simple and fast, although the non-wrapping is an inherent and potentially big problem, and perhaps more importantly, if you want your application to be able to use more than one backend, using such non-standard features means you need different code to handle different backends, something you may want to avoid.


A slower solution would be to calculate that max ID yourself, such as with a 'select max(id)', or '...order by id, limit 1' sort of thing. However, both of these are slow for big tables, so you may want to store a counter separately.

Postgres helps by allowing you to create sequences. These are used internally for serial fields, but can also be used to imitate a serial field since selecting from a sequence also increases its value.

Without it, however, you have to make sure no two clients can get the same ID to insert, by putting the select and update of the counter in a transaction. However, since MySQL's MyISAM does not support transactions, this will cause corruption.



Semi-sorted

Upsert

"Insert this row. If it already exists according to a uniqueness constraint, update that row instead (or, in other cases, do nothing instead)"


It turns out doing this atomically will actually require a transaction, because doing it non-atomically has possible race conditions.

Because it's a common enough need, it's often special-cased.


Consider a table

word   varchar(50)   primary key
count  integer


There are some backend specific solutions, like:

MySQL:

INSERT INTO         wordcount (word,count)  VALUES ('a_word',1)  
ON DUPLICATE KEY    UPDATE    count=count+1;


Postgres introduced ON CONFLICT in version 9.5 [2]

INSERT INTO         wordcount (word,count)  VALUES ('a_word',1)  
ON CONFLICT (word)  DO UPDATE SET count=count+1;

(TODO: test and elaborate)


Backends that have rule systems can be abused in similar ways. For example, in postgres before 9.5 you could do:

CREATE OR REPLACE RULE strings_counter AS
ON INSERT TO strings
WHERE (EXISTS ( SELECT 1 FROM strings tester WHERE new.string = tester.string)) 
DO INSTEAD  UPDATE strings SET count = strings.count + 1
            WHERE strings.string = new.string;


Backend-agnostic in-code solutions will require two statements, because upserts are necessarily a small transaction (implementing it cleanly and efficiently seems to be why postgres postponed it for a while).

Not doing this in a transaction will have possible races or possible deadlocks. Which you probably won't see under light load, but which will be a real problem in production.


You can write code to do two statements, for example:

If you usually want to update, then you can instead:

UPDATE wordcounter set count=count+1 where word=myword

and if that affects zero rows (you usually get that information about a query), do:

INSERT INTO wordcounter values (myword, 1)


Or

SELECT count from wordcounter word=myword

and if that affected zer rows, decide to do either

INSERT INTO wordcounter values (myword, 1)

otherwise

UPDATE wordcounter set count=count+1 where word=myword


The last form may also be preferable for the "insert a row if it wasn't there" case where you don't need to update it, because it means it's usually just one SELECT (a read operation).

counting is slow

This article/section is a stub — probably a pile of half-sorted notes and is probably a first version, is not well-checked, so may have incorrect bits. (Feel free to ignore, or tell me)


Using SQL's COUNT on a large table is slow

Mostly because the specs require it to be correct, and if your database engine conforms to ACID to a reasonable degree, that is a nontrivial thing to answer.

In particular isolation means a count correct at that time often needs to actually check the visibility of everything you want to count.

Which potentially amounts to a sequential scan.

There are sometimes optimized-but-still-correct cases (special-cased by the engine because people just occasionally want or need this), but don't count on that.


COUNT DISTINCT is worse, because most of those optimizations do not apply.(verify)


If ballpark is good enough, there are various faster ways - for an overall table COUNT, not so much COUNT DISTINCT or anything else that implies conditional counts.


Things you can consider doing include:

  • count precisely only every so often, and cache the results.
makes sense if e.g. a daily update is good enough
  • store a count yourself
possibly in the database, which you can keep accurate e.g. by doing this via triggers (...at minor cost at insert/delete time)
possibly not in the database, because there are lower-overhead, deadlock-free ways to count atomically
...but check whether they stay accurate and/or correct then from the occasional slower query


  • if an estimate is enough (with some caveats you should know about - but which may be fine for you), most database engines store it somewhere, and backend-specific ways of fetching that.
e.g in postgres you can ask
the stats collector (from pg_stat_all_tables)
the counts at last VACUUM / ANALYZE (from pg_class)
with autovacuum on, they may often be fairly close, but there are cases where they are significantly off (e.g. around quickly-changing tables, or large changes in otherwise-fairly-constant tables)
note that this is sometimes quite similar to the cached-daily-update case


  • conditional counts can sometimes be assisted with an index
consider e.g. a background process looking for a "which files to re-read from disk" field, which in regular use should stay small, so a partial index on rereadfromfile=true is both small itself, and supports (only) queries like SELECT count(*) where reread=true;


  • postgres e.g. postgres 9.2 introduced index-only scans, which means that under certain conditions, queries (including certain count queries) can be served entirely or largely from the index, which can be much faster (...assuming that index is not huge)


consider not using auto_increment/serial

Fetching random items and samples

This article/section is a stub — probably a pile of half-sorted notes and is probably a first version, is not well-checked, so may have incorrect bits. (Feel free to ignore, or tell me)


Define random.

There is a big difference between

  • "just quickly give me something data-like to test my code on, I don't even care if it's frequently the same thing

and

  • "give me something that should show at least least hundreds of examples when poked, out of a much larger set (a hundred adjacent items is probably fine)"

and

  • give me a random choice/sample out of the whole set so that I can do accurate statistics"


The first is relatively easy to do quickly and efficiently.

The second moderately so, with more logic.

The third is almost impossible to do efficiently at scale, in part because having each item in a distribution be equally likely basically depends on knowing how many items there are in that set - nothing to do with databases. And even if you know that number precisely at all times, fetching something of an arbitrary offset in an arbitrary-sized list is not something a lot of things are easily optimized for, particularly not if items can be added and removed.


You can often cheat well enough to get something between the second and third.

There are faster tricks

  • you can get faster results with fairly decent randomness when you know certain things about your table beforehand
  • when your wish is less robust, e.g. "a chunk of examples that is picked somewhat arbitrarily"


order by random

Idea: ask SQL to shuffle the table, then select the first, or first-so-many

Upsides:

  • properly random for 1, and possibly for any amount of samples (depends on implementation of ORDER BY RANDOM)

Downsides:

  • slow: probably O(n lg n) with the table size at best, and O(n2) at worst(verify) (depends on implementation of order by random)


A perfectly good solution for tiny tables, but you probably do not want to do this in the general case.


Example:

SELECT * from strings ORDER BY random() LIMIT 10

(note: random() may be called rand() or something else)

random offset

Idea: When you (approximately or exactly) know the number of items, you can pick a spot between 1 and that amount and use that in a LIMIT n OFFSET m.

Upsides

  • moderately fast to select
  • good when either
    • you want a single item with good randomness
    • a contiguous range at a random position is good enough

Downsides:

  • requires knowing amount of items
  • ...fairly precisely, which itself is slow (to do properly, there are workarounds)
  • still seems to be O(n) with table size
e.g. in some tests, offset 20000 limit 1 took milliseconds, offset 200000 limit 1 took 100ms, offset 2000000 limit 1 took 1.5 seconds


  • ...so not great for larger samples - either
    • randomness not good (if selecting a range at an offset)
    • speed is not good (if repeating this as often as you need)
  • If the real row count is lower than the count in your query, then you may occasionally select nothing -- depending on your actual select statement.


The SQL syntax varies a little. The performance may too (that is, sometimes it can do better than 'fetch entire result and pick one'). Details are typically found in LIMIT documentation.


If you fetch many items, they'll be related by some sort of sorting, which may not be random enough, or be perfectly fine for your purposes.


Knowing the amount of items exactly is preferable, in that if your estimate is a little high, you would sometimes select nothing. If you know how much off the estimate can be, you can be conservative with your offset.


Postgres optimization: fetching the amount of items recorded at last vacuum time (SELECT reltuples FROM pg_class WHERE relname='tablename') is often much faster than a basic-SQL select count(*) from tablename.

The difference can matter for (many) small samples from large tables.


In MySQL+InnoDB, you can use SHOW TABLE STATUS similarly, but its value is a much rougher estimate that in some situations is worthless see docs.

select a fraction of rows

Consider:

-- selects 50% from the first 200ish it sees
select * from tab where rand()<0.5 limit 100  

-- selects 10% of the first 1000ish rows
select * from tab where rand()<0.1 limit 100


Upsides:

  • moderate speed, moderately random
  • can be a good way to get a handful of examples
  • need only know the row count approximately
  • should be linear time with min(table size, amount you want via limit) (verify)
though note that if you want to select from all the table, this is roughly a tablescan


Downsides:

  • biased to look only at a small range, in the data's natural ordering (which probably has no meaning for you)
  • can't control number of outputs exactly



select next ID

  • can be the fastest trick (for single item), but the one with the most footnotes / caveats.
  • randomness can be good, depending on the table's history (see notes below)
  • typically an index-based lookup, then avoiding both sorting and skipping past records. If so, it's the fastest option for larger tables


Main assumptions:

  • your rows have IDs on them and they are indexed (often true)
  • you know little or nothing was deleted

...because

that means you have a known range, where thinking up an integer is pretty-properly random,
and the index means 'look up this or the next highest id' (to be robust to a few holes due to deletes) will use an index and so be fast (noticeably faster than linear time anyway)


For example, to get a random string from a (id,string) table, first get an ID in the given range

SELECT FLOOR( RANDOM()*(SELECT max(id) FROM strings))

And then use it to select the row with the next highest ID (and these two are typically in single statement)

SELECT * FROM strings WHERE id >= that_randid LIMIT 1


There are a lot of variants and a lot of possible tweaks to this trick. For example, you can often know (or estimate) max(id) in cheaper ways (e.g. remember it on inserts, select it from the relevant sequence in postgres)


Tables with a lot of deletes are problematic, because the random integer in 1..max(id) now isn't properly distributed.

The more sparse or unbalanced the IDs in your whole range, the worse the randomness is and the more you may e.g. see specific items.


TABLESAMPLE

Defined in SQL:2003, TABLESAMPLE selects a sample from the underlying table.


It seems the options of how to sample, and how to specify what you want, actually varies with implementation. (verify)

The percentage variant can act rather like WHERE random()<0.01 but implementations are free to be cleverer and touch less data than that would.


Note that TABLESAMPLE happens before any conditions apply, so it is possible to filter out all potentially matching rows, even if they're there.



notes

  • SELECT count(*) is often linear time with table size, so remembering an approximate count may be preferable over using count()
For a lot of database software there is a cheaper, but non-standard way of estimating the amount of rows (some notes on this for postgres and mysql are mentioned below)
  • Watch bordercases that can mean fewer results than you think (including no results), particularly on range fetches - the real code you'll need often needs to cheat a little more than that shown above.
  • LIMIT 1 (or any smallish number) is often fast because it tells the database that after it has found one thing, it can immediately looking in the index or table.
  • Similarly, a limit with an offset also can also work fairly well, to fetch a chunk of items at a random position. It's often slower, though, as it needs to skip past data.
  • You generally want to avoid sorting when possible, as that's O(n*log(n)) at best
  • the speed of selecting max(id) or count(*) may differ depending on indices, optimizations, and possibly on engine

Common Table Expressions

Multiple cursors

.NET notes

This article/section is a stub — probably a pile of half-sorted notes and is probably a first version, is not well-checked, so may have incorrect bits. (Feel free to ignore, or tell me)

.NET has a database interface made up of a good bunch of classes. It is nicely generic, if quite verbose.


Connections to data sources

a Connection object represents a connection to a particular data source, and is subclassed for types of connection, such as SqlConnection, OleDbConnection, OdbcConnection, OracleConnection.


You set up an object using a connection string, then call Open() when you need a connection.


It's common to use a using to have disconnection/threadpool releases (if applicable) handled automatically when execution exits the code block:

using (SqlConnection conn = new SqlConnection(connectionString)) {
    conn.Open();
    // all work that needs the connection
}

Other ways are calling Close() explicitly, and to hand CommandBehavior.CloseConnection to commands, but you should only do so when you are sure you need it, as it would lessen threadpool effectiveness.


Connection pooling has to be supported by the database adapter, and may be disabled via the connection string.


You can use an event handler to listen to connection state changes if you wish.

Commands and their execution

Commands represent operations in a database. They may be instantiated directly or be created using Connection.CreateCommand() or Transaction object.

You can set the exact command yourself, e.g.:

SqlCommand countStrings = new SqlCommand(
      "SELECT count(*) FROM strings", connection);

You can also use one of InsertCommand, UpdateCommand, and DeleteCommand (or more exactly, its derivatives)

SqlParameter

To avoid SQL injection, you can use the fairly standard place-a-marker-in-the-actual-query, tell-it-what-to-replace-it-with approach.

There are many ways to do this, including the following two:

SqlCommand cmd = new SqlCommand("SELECT count(*) from tablename where email=@email AND password=@2)", connection);
cmd.Parameters.Add("@email");
cmd.Parameters["@email"] = email;
cmd.Parameters.Add("@2",SqlDbType.Text).Value = password;
// A much longer form would be:
SqlParameter emailParam = new SqlParameter();
emailParam.ParameterName = "@email";
emailParam.Value         = email;
cmd.Parameters.Add(emailParam);

//You may sometimes want to use Convert.Tosomething() to catch bad data on the code side rather than letting the database handle it.


Parameters are also used to specify mappings to DataSet objects.


Readers

ExecuteReader() returns an object that allows you to read many rows. For example:

SqlDataReader reader = command.ExecuteReader() //optionally with a CommandBehavior

while (reader.Read()) { //which updates columns. For example:
    Console.WriteLine(String.Format("{0}", reader[0]));
}
Scalars

ExecuteScalar() returns the first column of the first row, and ignores the rest.

Use when you expect exactly one such result, such as when you use aggregate functions to select just the min, max, or count of something.

For example:

numberOfStrings = (Int32)command.ExecuteScalar();


Non-queries

ExecuteNonQuery is used for commands that do not query for data, such as SQL INSERT, DELETE, UPDATE, and SET.

Returns the number of rows affected.

Begin...: Ascunchronous variants
Stored procedures

DataSets

A DataSet is an in-memory representation of database data, allowing disconnected/offline access to data fetched earlier.

DataSets are usually filled by creating a DataAdapter based on a query and having it fill the object with data.

Alternatively, you can add data (rows) manually, as well as create the tables and their columns from scratch.


DataSets can also be used to update, insert, and delete data, although this has a few more problems than command-based code.

DataSet may contain one or more DataTables. In some cases, just the latter will do, and you can fill it just the same (using a different overload of a data adapter).


Untyped DatsSets are generic DataSet() objects for which the column types become clear when it is being filled, so type checking can only be done at runtime.

Typed DataSets are those for which you explicitly declare the typing before compilarion, so the strongly typing can apply at compile time. This does imply you need a typed dataset for each different (major) query.

Typed datasets are declared based on schemas stored in xsd format (XML schema definitions).


BLOBs

There are also a few minor behaviour things that can bug people, such as:

  • the encoding/collation features that significantly vary in behaviour in a few specific combinations of collations and text column types (one of utf8_general_ci/TEXT, utf8_general_ci/VARCHAR, utf8_bin/TEXT, utf8_bin/VARCHAR acts differently from the rest. Can you guess which?), simply because they kill the concept of transparent storage and of all text being created equal, which seems sour as MySQL's relevant code is probably originally aimed at both providing and being smart about it, and occasionally useful enough for it.