Searching algorithms
This article/section is a stub — probably a pile of half-sorted notes, is not well-checked so may have incorrect bits. (Feel free to ignore, fix, or tell me) |
Contents
- 1 Intro
- 2 Keyword search
- 3 Exact-substring search
- 4 Multiple-substring
- 5 Approximate-substring
- 6 Tree and graph searches
- 6.1 Binary search tree
- 6.2 Heaps (and queues)
- 6.2.1 Binary heap
- 6.2.2 Binomial heap
- 6.2.3 Fibonacci heap
- 6.2.4 Leftist heap
- 6.2.5 Pairing heap
- 6.2.6 Skew heap
- 6.2.7 Van Emde Boas tree
- 6.2.8 2-3 heap
- 6.2.9 B-heap
- 6.2.10 d-ary heap
- 6.2.11 weak heap
- 6.2.12 Treap
- 6.2.13 (Randomized) Meldable heap
- 6.2.14 Radix heap / Monotone priority queue
- 6.2.15 Brodal queue
- 6.3 Tries
- 6.4 Search concepts
- 6.5 VP^{sb}-tree
- 6.6 SB-tree
- 6.7 Top tree
- 6.8 LC-RS tree
- 6.9 Balancing trees
- 7 Unsorted tree stuff
- 8 Unsorted
Intro
Keyword search
That is, we don't care about substrings / randomly positioned patterns.
Exact-substring search
You'll variably care more about some of about
- search time
- preprocessing time
- memory use
In the below:
- m is the pattern length
- n is the text length
- σ is the alphabet size
Naive / brute force
Try matching the string at each possible position.
Summary:
- No preprocessing
- Constant extra space (very little)
- Search phase gives all occurences in O(mn)
For a needle of length n in haystack of size m, this means O(m*n)
FSA (matcher)
Build a deterministic finite state automaton (a.k.a. DFA) to recognize a specific pattern/string.
At initial cost of building it, you can get linear-time matching.
Useful when the matching needs to be done repeatedly, the the pattern does not change and the data may.
Suffix trees
Suffix trees allow substring searches to be O(l+r) average-case (query length l, result length r) though bumping memory use up to O(n log n) average (O(n^2) worst). They can be built in O(n).
Suffix trees are a slight misnomer - what you index are substrings, while only the process of building involves suffixes directly.
http://mila.cs.technion.ac.il/~yona/suffix_tree/
Burrows-Wheeler Transform
This article/section is a stub — probably a pile of half-sorted notes, is not well-checked so may have incorrect bits. (Feel free to ignore, fix, or tell me) |
The Burrows-Wheeler transform is itself a reordering string transform, only. Its properties make it interesting for compression and search.
It happens to be reversible (which, looking at its input and output, seems a little magical).
It also happens to often put similar patterns near each other. This makes it interesting for compression (and e.g. bzip2 is built around BWT). The reversibility makes it easier/possible to use it.
The BWT is related to the suffix array (/ suffix-tree) line of though, and one way to calculate (and explain) the BWT is via suffix arrays. The latter help answer questions like "where does this arbitrary substring appear in the text?" without searching allt he data.
The BWT more easily supports questions like "how much does this arbitrary substring appear in the text" without searching all the data, so in O(1) (well, O(length_of_pattern)). (verify)
The hits themselves can only be retrieved by interpreting the data, in O(n) (verify)
See terms like LF Mapping, FM Index
See also:
- B Langmead, 2013, "Introduction to the Burrows-Wheeler Transform and FM Index"
Boyer-moore and variants
Where a native substring search would try to look for a pattern at each successive index, BM does pre-analysis of the pattern so that when it doesn't natch, it knows how far ahead it can skip.
This usually saves a bunch of work when the pattern is long and/or non-repetitive, such as natural text.
...and, due to the pre-analysis, you will be using this pattern repeatedly or in a large haystack.
See also:
Knuth-Morris-Pratt
Pre-analysis that also considers skips when matches fail, but more specifically considers avoiding partial re-matches (by matching right-to-left).
Which helps for simple-alphabet and/or repetitive patterns, e.g. DNA sequences, where it better than at least the most classical Boyer-Moore.
See also:
Aho-Corasick
Does pre-analysis of multiple search patterns to be searched for at once.
https://en.wikipedia.org/wiki/Aho%E2%80%93Corasick_algorithm
Z-Algorithm
Multiple-substring
That is, match on multiple patterns at the same time.
Rabin-Karp
http://en.wikipedia.org/wiki/Rabin%E2%80%93Karp_string_search_algorithm==
Aho–Corasick
https://en.wikipedia.org/wiki/Aho%E2%80%93Corasick_algorithm
Commentz-Walter
http://www.google.com/search?q=Commentz-Walter
Approximate-substring
Bitap / Baeza-Yates–Gonnet
http://en.wikipedia.org/wiki/Bitap_algorithm
Tree and graph searches
While trees are often introduced as an entirely abstract data structure, a lot of trees have rather specific properties for specific needs.
For example, self-balancing trees avoid becoming overly expensive with a lot of updates.
Some tree-search algorithms explore general tree properties,
and some are tightly bound to the properties of a specific tree.
As such, this section involves as much data structures as search algorithms.
Binary search tree
Heaps (and queues)
A heap is a tree that satisfies the heap property, basically that a node has a value with is larger-than-or-equal (or less-than-or-equal, depending on purpose) than its parent's.
In other words, it's a tree with orderedness as part of its nature, though exactly how varies.
Due to that nature, they are also often the implementation behind the concept of a (priority) queue,
(often to make inserts and removals cost low/constant)
and there are some heap implementations specifically for queues.
For a general comparison, see https://en.wikipedia.org/wiki/Heap_(data_structure)
Binary heap
https://en.wikipedia.org/wiki/Binary_heap
Binomial heap
https://en.wikipedia.org/wiki/Binomial_heap
Fibonacci heap
https://en.wikipedia.org/wiki/Fibonacci_heap
Leftist heap
https://en.wikipedia.org/wiki/Leftist_tree
Pairing heap
https://en.wikipedia.org/wiki/Pairing_heap
Skew heap
https://en.wikipedia.org/wiki/Skew_heap
Van Emde Boas tree
https://en.wikipedia.org/wiki/Van_Emde_Boas_tree
2-3 heap
https://en.wikipedia.org/wiki/2%E2%80%933_heap
B-heap
https://en.wikipedia.org/wiki/B-heap
d-ary heap
https://en.wikipedia.org/wiki/D-ary_heap
weak heap
https://en.wikipedia.org/wiki/Weak_heap
Treap
https://en.wikipedia.org/wiki/Treap
(Randomized) Meldable heap
https://en.wikipedia.org/wiki/Randomized_meldable_heap
Radix heap / Monotone priority queue
https://en.wikipedia.org/wiki/Monotone_priority_queue
Brodal queue
https://en.wikipedia.org/wiki/Brodal_queue
Tries
Suffix tree
(See also BWT; they are in part a logical extension of the suffix tree idea)
Radix tree
Hash tree
Ternary search tree
Y-fast trie
Search concepts
Branch and bound
A* search
Visit possibilities according to their (heuristically determined) promise, best to worst.
Beam search
Like A*, but only checks the most promising paths, not all.
VP^{sb}-tree
Different from SB-tree.
SB-tree
Variant of B-tree that is good at sequential fetches.
Different from VP^{sb}-tree.
http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.55.9482&rep=rep1&type=pdf
Top tree
http://en.wikipedia.org/wiki/Top_tree
LC-RS tree
Balancing trees
T-tree
Unsorted tree stuff
LLRB tree
AA tree
AVL tree
Scapegoat tree
Splay tree
Treap
Bx-tree
UB-tree
https://en.wikipedia.org/wiki/UB-tree
(a,b)-tree
https://en.wikipedia.org/wiki/(a,b)-tree
Dancing tree
Similar to B+ trees, but only rebalance on request.
https://en.wikipedia.org/wiki/Dancing_tree
HTree
https://en.wikipedia.org/wiki/HTree
B-trees and its friends
B-tree
B+ tree
B*-tree
Specific-purpose trees
Serves only a rather specific need.
MVP tree
(Multi Vantage Point)
Meant for metric-space data.
Allows looking for hits close to a query, according to that metric, so e.g. useful for 'nearby location' sort of things.
Based on a VP tree (which is more clearly just a type of BSP tree), adding storage of distances, and adding multiple vantage points to partition the space at each level (which is computationally expensive, so makes more sense for index-once-lookup-many cases than for rebalancing ones(verify)).
Tends to outperform M-trees?
http://en.wikipedia.org/wiki/MVP_Tree
https://code.google.com/p/mvptree-library/ (C implementation)
https://github.com/trolldbois/phash-graph-mvp/blob/master/mvptree.py (Python implementation)
M-tree
For metric-space data, otherwise similar to R-trees, B-trees.
M^{+}-tree
For metric-space data. Tries to mix the good parts of M-trees and MVP-trees(verify)
BSP tree
Quadtree
Octree
k-d tree
Implicit k-d tree
VP tree
Vantage Point trees
http://stevehanov.ca/blog/index.php?id=130
Merkle tree
http://en.wikipedia.org/wiki/Merkle_tree
Unsorted
- http://www.cs.ucr.edu/~stelo/oldevents.html
- http://igm.univ-mlv.fr/~lecroq/string/index.html
- http://en.wikipedia.org/wiki/Category:Search_algorithms
Factor Oracle