SQLite notes: Difference between revisions

From Helpful
Jump to navigation Jump to search
(Created page with " For some introduction, see [[]] <!-- '''Text coding''' SQLite defaults to assumes UTF-8. The C library does not ''verify'' it's valid, though. Behaviour may well depend on the library you use. For example, python's sqlite3 will * convert unicode strings to UTF-8 (py2 and py3) {{verify}} * it seems ** py2 sent str (its bytestrings) as-is (and since the library didn't check, you better hope it was valid UTF-8) {{verify}} ** and py3's bytes going to text fields isn...")
 
 
(18 intermediate revisions by the same user not shown)
Line 1: Line 1:
{{Database related}}


For some introduction, see [[Some_databases,_sorted_by_those_types#SQLite]]


For some introduction, see [[]]
===On text coding===
<!--


<!--
SQLite defaults to assume UTF-8.
'''Text coding'''
 
Assume, in that the C library does not ''verify it is valid''.


SQLite defaults to assumes UTF-8.
The C library does not ''verify'' it's valid, though.


Behaviour may well depend on the library you use.
Behaviour may well depend on the library you use.
Line 19: Line 21:
-->
-->


'''Errors'''
'''String or BLOB exceeded size limit''' seems to mean that one or both of:
* a query size exceeded SQLITE_MAX_SQL_LENGTH (1MB by default)
: you can avoid this by using parametric values (binding) instead of literal query strings
* a row, text field, or blob field exceeded SQLITE_MAX_LENGTH (1GB by default)
(possibly other limits)
You can set these these while compiling.
SQLITE_MAX_LENGTH may be 2^31-1, and SQLITE_MAX_SQL_LENGTH may be no larger than SQLITE_MAX_LENGTH (and possibly also no larger than 1GB?{{verify}} -- not that wanting a 2GB query is very sane...
'''Multiple processes'''
Multiple processes can read the same database.
Only one can write, as attempts to write will locks the entire database file.
It's up to clients to decide to error out complaining about that lock, or wait and retry.{{verify}}
Note that locking will not work over NFS, and there are issues with SMB / windows file sharing you probably want to know about.
Note you should ''not'' carry an open connection though a [[fork]].
Say the writers: "SQLite is not designed to replace Oracle. It is designed to replace fopen()"
'''Multiple threads'''
SQLite can be compiled to be thread-safe, and usually is (the reason you can compile it without that is that that removes mutexes and is ''slightly'' faster for cases you know will always be single-threaded) You can query that compile option from the database if you want to be sure.
[https://www.sqlite.org/threadsafe.html This page] points out there is a further distinction between multi-threaded (where threads must not share a connection) and serialized (where that appears to be fine).
Some libraries may have their own decisions - e.g. python3's sqlite3 will complain ''SQLite objects created in a thread can only be used in that same thread''
[https://ricardoanderegg.com/posts/python-sqlite-thread-safety/ though that seems to be overzealous]




Line 75: Line 34:
...SQLite is more like [[dynamic typing]] in that SQLite decides the type from the ''value'' you give it,  
...SQLite is more like [[dynamic typing]] in that SQLite decides the type from the ''value'' you give it,  
and only ''somewhat'' cares about the according column's type.
and only ''somewhat'' cares about the according column's type.




Line 117: Line 77:
-->
-->


=====On SQLite concurrency=====
<!--




There are roughly two approaches: rollback journal, and WAL.
===On concurrency===
 
Say the writers: "SQLite is not designed to replace Oracle. It is designed to replace fopen()"
 
 
That said, it does about as much as it can with what the filesystem provides, which is pretty great on a single device.
 
...''but'' keep in mind there are limits - and that filesystems vary. In particular some network filesystems do not do enough.
 
 
====Concurrency from distinct tasks====
 
 
'''Multiple threads'''
 
There is usually nothing to be gained from distinct threads accessing the same database, so maybe avoid doing this.
 
Yet if you insist: SQLite can be compiled to be thread-safe, and usually is. {{comment|(the reason you can compile ''without'' is that removing this safety is ''slightly'' faster for cases you know will always be single-threaded)}}
 
You can query that compile option from the database at runtime if you want to be sure.
 
[https://www.sqlite.org/threadsafe.html This page] points out there is a further distinction between multi-threaded (where threads ''must not'' share a connection) and serialized (where that appears to be fine).
 
 
Some libraries may have their own decisions - e.g. python3's sqlite3 will complain ''SQLite objects created in a thread can only be used in that same thread'', [https://ricardoanderegg.com/posts/python-sqlite-thread-safety/ though that seems to be overzealous].
 
 
 
'''Multiple processes'''


Multiple processes can access the same database.


Assume that:
* multiple can read concurrently
* writes are not concurrent with other writes
* writes are not concurrent with reads
It is somewhat up to clients to error out, or wait and/or time out{{verify}}


The default is using a rollback journal.
This is kept separate (good against corruption), and simpler, but is less concurrent.


WAL is more concurrent, but has a few more requirements.
Things work a little better with a non-standard method, but it comes with more requirements.




Limitations:
* locking (and the journaling that relies on it) relies on some filesystem semantics
: assume this will not work properly over some network filesystems (e.g. NFS, some SMB, a.k.a. windows network mounts).


* You should ''not'' carry an open connection though a [[fork]]


WAL is mainly about concurrency,
though depending on what you're doing,
it ''may'' have a positive side effect on speed.


WAL may have less lock contention going on.
====More on concurrent operations====
{{stub}}


One important difference is that, without the "one writer at most" (so that the commit can happen when it closes),
<!--
under WAL it becomes "the WAL keep growing until the last connection closes", and that is  much less guaranteed
Safety from multiple processes ''relates'' to how proper atomic commits are done,
to happen regularly.
namely in that it's largely details in the rollback journal, or WAL.




Note that, before we dig into underlying details,
that as in general you can always unintentionally mess up your own ability to have concurrency, e.g. by


Also keep in mind that
* keeping transactions open (easier to do with e.g. autocommit off)
* keeping transactions open (e.g. autocommit off),


* counting on garbage collection rather than explicit close() for cursors
* counting on garbage collection to close a connection, rather than an explicit close()


* database middleware doing their own thing to make things worse
* database middleware doing their own thing to make things worse
:: this can make a lot of sense in a "make everything act the same"
:: this can make a lot of sense in a "make everything act the same"
::




Say, in a nonsense
I can have 10 concurrent at 30 requests/sec each no issue


...but in real code I can get it to break with two writers
Comparison:
* in the absence of writing, both equally allow lots of concurrent reading
 
* WAL allows writes to collect while reading is still going on; rollback journal does not
 
* WAL is more about concurrency than speed
: there are only some cases where more concurrency means more speed
 
 
 
 
The details of the two also dictate what operations can be concurrent or not.
 
 
 
The default is using a '''rollback journal'''.
: This
:: copies the original version of what we are altering in a separate file
:: we alter the database file itself directly
:: and close=commit=remove that separate file.
: after a crash, the separate file is replayed to fix the database file
: rollback of an ongoing transaction is similar
: which means
:: you can have many concurrent readers
:: ...but reading is exclusive with writing
 
 
 
The alternative is ''WAL'' style journaling
: It
:: collects writes into a separate file (from multiple transactions)
:: the database is as of yet untouched (so separate from ongoing reads)
:: only writes into the database file at checkpoint time.
: which means
:: reads often don't block writes (as is easily true with rollback journal).
:: writes don't block reads
:: writes ''do'' still block other writes (there's only one WAL)
:: checkpointing (=writing WAL contents into the database) ''can'' be largely concurrent with reading ''but'' there are more hold-up edge cases.
 


:: writes are a little faster, because they happen once, not twice as with rollback journaling
:: reads can be slower because each reader must sometimes check whether the content they need is in the WAL
:: you can balance it towards somewhat faster reads, or writes, by controlling the WAL size
If you're familiar with MVCC, it's ''not'', but it's halfway there.
As such, read-heavy workloads are probably more concurrent.
-->
<!--
Say, in a nonsense benchmark I can have 10 concurrent at 30 requests/sec each no issue
...but in real code I can get it to break with two processes and a low transaction rate




Line 170: Line 216:


WAL needs to be re-incorporated into the main database, and it does this when the last connection to the database closes. With a very busy site you might find it take a few seconds for all connections to be close, but 100K hits per day should not be a problem.
WAL needs to be re-incorporated into the main database, and it does this when the last connection to the database closes. With a very busy site you might find it take a few seconds for all connections to be close, but 100K hits per day should not be a problem.




Line 177: Line 221:
journal_mode:
journal_mode:
* DELETE: rollback journal, delete file (default)
* DELETE: rollback journal, delete file (default)
* TRUNCATE: rollback journal, truncates file (cheaper on some situations - fewer filesystem interactions)
: TRUNCATE: rollback journal, truncates file (cheaper on some situations - fewer filesystem interactions)
* PERSIST: rollback journal, zeroes out header (cheaper on some situations)
: PERSIST: rollback journal, zeroes out header (cheaper on some situations)


* MEMORY: rollback journal in RAM - no disk IO, but if we crash in a middle of a write, the database is likely to corrupt
* MEMORY: rollback journal in RAM - no disk IO, but if we crash in a middle of a write, the database is likely to corrupt
Line 184: Line 228:
* WAL: WAL journaling - more concurrent, but requires SQLITE >= 3.7.0
* WAL: WAL journaling - more concurrent, but requires SQLITE >= 3.7.0


* OFF: no atomic commit or ROLLBACK, and database is likelier to corrupt. Not recommended  
* OFF: no atomic commit, or ROLLBACK, and database is likelier to corrupt. Not recommended.
 




Line 194: Line 237:
Depends.  
Depends.  


It often comes down to  
It often comes down to ''amount of interaction * average time spent in transaction > timeout value''
amount of interaction * average time spent in transaction > timeout value
but the the actual reason why is harder to discover


...but it is often not clear why.


With services in general, once your load is large enough to queues up,
you have issues.




Line 209: Line 254:




Sure, you may be able to get a thee times higher higher commit rate with WAL enabled
And yeah, you can delay the point at which that happens,
before things go sitty.  
like tryting to keep a transaction as short as possible
(try to deciding what to send before, not after you start a transaction).


...or no difference at all,
when the reason it was timing out had to do with ''inaction''.


Sure, in a good case you might get three times higher higher commit rate
with WAL enabled before things go sitty, but if your load was dataset bound,
the amount of erroring out will be the same.


...or no difference in commit rate at all, when the reason it was timing out
had to do with ''inaction''.


It doe seem that write-bounded jobs are maybe 3x as fast with WAL


Because the timeout is defined in seconds, and any reason that someone is doing nothing (whether it is their own behaviour or caused by something external) is an issue.






As such, your "database is locked" error is more likely to be
As such, your "database is locked" error is more likely to be
caused by a cursors left open (e.g. counting on the gc to get it)
caused by a  
or non-commited transactions, or such.
: transactions that won't be committed until later
: cursors left open (e.g. counting on the gc to get it)




Line 247: Line 297:




-->
<!--
https://fly.io/blog/sqlite-internals-rollback-journal/
-->
-->


Line 255: Line 308:
* https://www.sqlite.org/wal.html
* https://www.sqlite.org/wal.html


=====Errors=====
 
===CLI admin===
<!--
 
Note that the CLI is intentionally simple, so
 
 
Sometimes even just getting a column name to be readable (in {{inlinecode|.mode column}}) requires trickery bcause
.width auto
just looks at XX so sometimes forcing things like:
.width 10 10 30 10 auto
sqlite> SELECT * FROM sqlite_master;
works a little better.
 
 
https://www.sqlite.org/cli.html#changing_output_formats
 
 
 
 
Vacuum database:
vacuum
 
 
List tables:
.tables
-- or
SELECT name FROM sqlite_master; -- since 3.33, before that:
SELECT name FROM sqlite_schema;
 
List indexes:
.indexes
-- or
PRAGMA index_list()
 
 
Describe table
PRAGMA table_info('TABLENAME');
(you might like {{.header on}} {{inlinecode|.mode column}})
 
In SQL form:
.schema tablename
 
 
 
 
Get table size (in bytes?)
SELECT SUM("pgsize")  FROM "dbstat"  WHERE name='TABLENAME';
Which you may prefer in a "give me the largest" form like:
SELECT name ,SUM(pgsize)/1024 table_size  FROM "dbstat" GROUP BY name ORDER BY table_size desc;
(note: requires SQLITE_ENABLE_DBSTAT_VTAB)
 
 
 
 
 
.schema
 
.fullschema also adds things like the stat_ tables (the intent seems debugging that recreates the query plan).
 
 
.databases - open databases in current connection, where main is your data, and temp seems a tablespace for temporary tables.
 
.open opens another database file (after closing the current)
 
 
 
 
.read and .import are for importing data from SQL and CSV files
 
 
.dump puts the current database into one UTF string, usually used from the prompt something like
sqlite3 example.db .dump | gzip -c > example_db.sql.gz
 
As the docs mention this is pure SQL so you ''could'' stream it to another database engine.
 
 
 
'''dbstat'''
 
dbstat is a virtual table, meaning it calculates things on the fly.
 
It can be useful to calculate things like the size of indexes, though.
 
 
Beware that that means that e.g.
SELECT DISTINCT name FROM dbstat;
is not something you want because it touches a ''lot'' of data (and you can get that list elsewhere, see e.g. sqlite_master or PRAGMA index_list for table-specific indexes)
 
 
and e.g. the distinction between something like:
SELECT sum(pgsize) FROM dbstat WHERE name='sqlite_autoindex_kv_1';
and
SELECT name,sum(pgsize) as size  FROM dbstat  WHERE name LIKE 'sqlite_autoindex_%' GROUP BY name;
can be larger than you think
 
 
 
 
-->
 
===OPTIMIZE, ANALYZE, etc===


<!--
<!--
'''OPTIMIZE'''
Roughly speaking,
the query planner counts cases where (on a given connection) it would have been useful to have (any, or newer) ANALYZE results at hand;
optimize does an ANALYZE when
As such, it can be quite useful to do an OPTIMIZE before you close a connection -- if the table never changes this is usually a no-op,
and the times at which it ''does'' do work is probably good for subsequent access.
ANALYZE (by default) does a full scan of every index, which ''can'' be slow for larger data.
You limit the amount of work any one ANALYZE via {{inlinecode|PRAGMA analysis_limit}}.
You should conside this an ''approximate'' ANALYZE (may still be accurate if there was little work to be done, but don't count on this).
https://www.sqlite.org/lang_analyze.html
sqlite_stat1
: useful mainly to the query planner (contains information about tables and indices)
: allowed to be ''read'' (but not altered) by applications, which is sometimes useful
: https://www.sqlite.org/fileformat2.html#stat1tab
sqlite_stat4
: can be useful for the query planner (contains information about keys within an index, or primary key of a WITHOUT ROWID table)
: won't be updated from a partial ANALYZE
sqlite_stat2
: Only a thing between 3.6.18 and 3.7.8; basically replaced by 3 and 4
sqlite_stat3
: can be useful for the query planner (contains information about the keys in an index)
: 4 reads 3, but does not write it;
: you can continue 3 deprecated if you use 4{{verify}}
https://www.sqlite.org/fileformat2.html#stat4tab
ANALYZE kv;
select * from sqlite_stat1;
kv|sqlite_autoindex_kv_1|433 1
sqlite> select count(*) from KV;
433
BUT note that
* that 433 is the size of the index, which we only know is useful when we e.g. know that comes from a PRIMARY KEY / UNIQUE, NOT NULL constraint
:
* before an ANALYZE, sqlite_stat1 will not exist
Apparently SQLite can only use one index for each table,
so EXPLAIN can be more important than it is for
to tell it's choosing the right one -- and in some cases you can push it towards a better index choice.
https://stackoverflow.com/questions/12947214/sqlite-analyze-breaks-indexes
List all indexes
* SELECT name  FROM sqlite_master  WHERE type='index';
* or, on the prompt, .indexes
SQLite automatically creates internal indexes for UNIQUE and PRIMARY KEY constraints, [https://sqlite.org/fileformat2.html#intschema]
: which will be called something like  sqlite_autoindex_table_''number''
There are also per-query temporary indexes created by the query optimizer [https://sqlite.org/optoverview.html#autoindex] when it estimates that creating that table in RAM is faster.
-->
===Errors===
<!--
'''String or BLOB exceeded size limit''' seems to mean that one (or both) of:
* a query size exceeded SQLITE_MAX_SQL_LENGTH (1MB by default)
:: you can avoid this by using parametric values (binding) instead of literal query strings
* a row, text field, or blob field exceeded SQLITE_MAX_LENGTH (1GB by default)
(possibly other limits)
You can set these these while compiling.
SQLITE_MAX_LENGTH may be 2^31-1, and SQLITE_MAX_SQL_LENGTH may be no larger than SQLITE_MAX_LENGTH (and possibly also no larger than 1GB?{{verify}} -- not that wanting a 2GB query is very sane...





Latest revision as of 17:53, 25 March 2024

Database related

More theoretical - thinking about databases:

Everyday-use notes

For some introduction, see Some_databases,_sorted_by_those_types#SQLite

On text coding

On SQLite typing
This article/section is a stub — some half-sorted notes, not necessarily checked, not necessarily correct. Feel free to ignore, or tell me about it.

This page mentions a lot of what you want to know.


For contrast: where most RDBMSes are statically typed (converts data to the column's type on storage, fails if it can't)...

...SQLite is more like dynamic typing in that SQLite decides the type from the value you give it, and only somewhat cares about the according column's type.


A little more precisely, the schema defines a type affinity, not a rigid type.

'type affinity' here meaning 'the value may still be of any type, but we prefer this one if that works'

The SQL types map to the affinity of INTEGER, REAL, NUMERIC, TEXT, BLOB, or NULL

e.g. INTEGER is split into seven specific-sized things -- on disk, anyway; in memory all are loaded into int64(verify)
TEXT is stored as UTF-8, UTF-16LE, or UTF-16BE (according to the database encoding) (verify)

For example(verify)

  • if you store into a numeric column, it will do something like
    • if not well formed as an integer or real, it's stored as TEXT
    • if it's a real number (with digit) and store as REAL if (seems like float64 so if there was more precision in the text that will be lost)
    • if it's an integer, and find the smallest integer type that will store it
    • if it's an integer larger than int64 can store, it will try REAL
    • hex is stored as text


Assume that the schema is not used for conversions going out

AFAICT, what you get out is whatever type that got stored, and it is the logic on insert that is interesting.

...at least, that's what synamically typed languages tend to do with it. Statically typed languages (including SQLite's onw API) may get you to convert it, and/or makes you test for it(verify).



On concurrency

Say the writers: "SQLite is not designed to replace Oracle. It is designed to replace fopen()"


That said, it does about as much as it can with what the filesystem provides, which is pretty great on a single device.

...but keep in mind there are limits - and that filesystems vary. In particular some network filesystems do not do enough.


Concurrency from distinct tasks

Multiple threads

There is usually nothing to be gained from distinct threads accessing the same database, so maybe avoid doing this.

Yet if you insist: SQLite can be compiled to be thread-safe, and usually is. (the reason you can compile without is that removing this safety is slightly faster for cases you know will always be single-threaded)

You can query that compile option from the database at runtime if you want to be sure.

This page points out there is a further distinction between multi-threaded (where threads must not share a connection) and serialized (where that appears to be fine).


Some libraries may have their own decisions - e.g. python3's sqlite3 will complain SQLite objects created in a thread can only be used in that same thread, though that seems to be overzealous.


Multiple processes

Multiple processes can access the same database.

Assume that:

  • multiple can read concurrently
  • writes are not concurrent with other writes
  • writes are not concurrent with reads

It is somewhat up to clients to error out, or wait and/or time out(verify)


Things work a little better with a non-standard method, but it comes with more requirements.


Limitations:

  • locking (and the journaling that relies on it) relies on some filesystem semantics
assume this will not work properly over some network filesystems (e.g. NFS, some SMB, a.k.a. windows network mounts).
  • You should not carry an open connection though a fork


More on concurrent operations

This article/section is a stub — some half-sorted notes, not necessarily checked, not necessarily correct. Feel free to ignore, or tell me about it.


See also:


CLI admin

OPTIMIZE, ANALYZE, etc

Errors