# String search algorithms

**This article/section is a stub**— some half-sorted notes, not necessarily checked, not necessarily correct. Feel free to ignore, or tell me about it.

# Intro

# Keyword search

That is, we don't care about substrings / randomly positioned patterns.

# Exact-substring search

You'll variably care more about some of about

- search time
- preprocessing time
- memory use

In the below:

- m is the pattern length
- n is the text length
- σ is the alphabet size

## Naive / brute force

Try matching the string at each possible position.

Summary:

- No preprocessing
- Constant extra space (very little)
- For a needle of length n in haystack of size m, all occurences takes O(m*n)

## FSA (matcher)

Build a deterministic finite state automaton (a.k.a. DFA) to recognize a specific pattern/string.

At initial cost of building it, you can get matching that is linear time with the haystack length.

Useful when the matching needs to be done repeatedly, the the pattern does not change and the data *may*.

## Suffix trees

Suffix trees are basically a sort of indexing of a fixed haystack.

It allow substring searches to be executed in O(l+r) average-case (query length l, result length r).

It however requires memory use of O(n log n) average (O(n^2) worst), and O(n) build time.

Suffix trees are a slight misnomer - what you index are substrings, while only the process of building involves suffixes directly.

http://mila.cs.technion.ac.il/~yona/suffix_tree/

## Burrows-Wheeler Transform

**This article/section is a stub**— some half-sorted notes, not necessarily checked, not necessarily correct. Feel free to ignore, or tell me about it.

The Burrows-Wheeler transform is itself a reordering string transform, only.

It happens to be reversible (which, looking at its input and output, seems a little magical).

The BWT is related to the suffix array (/ suffix-tree) line of thought, and one way to calculate (and explain) the BWT is via suffix arrays.

Because it happens to often put similar patterns near each other, this is interesting for compression (and e.g. bzip2 is built around BWT).
The reversibility makes it easier/possible to use it.

The BWT used as an index supports questions like "how much does this arbitrary substring appear in the text" without searching all the data, so in O(1) (well, O(length_of_pattern)). *(verify)*

The hits themselves can only be retrieved by interpreting the data, in O(n) *(verify)*

See also terms like LF Mapping, FM Index

See also:

- B Langmead, 2013, "Introduction to the Burrows-Wheeler Transform and FM Index"

## Boyer-moore and variants

Where a naive substring search would try to look for a pattern at each successive index, BM does pre-analysis of the pattern so that when it doesn't natch, it knows how far ahead it can skip.

This usually saves a bunch of work when

- the pattern is long and/or non-repetitive,such as natural text.

- ...and, due to needing pre-analysis, when you will be using this pattern repeatedly or in a large haystack.

See also:

## Knuth-Morris-Pratt

Pre-analysis that also considers skips when matches fail, but more specifically considers avoiding partial re-matches (by matching right-to-left).

Which helps for simple-alphabet and/or repetitive patterns, e.g. DNA sequences, where it can do better than at least the most classical Boyer-Moore.

See also:

## Z-Algorithm

# Multiple-substring

That is, match on multiple patterns at the same time.

## Aho-Corasick

Does pre-analysis of multiple search patterns to be searched for at once.

https://en.wikipedia.org/wiki/Aho%E2%80%93Corasick_algorithm

## Rabin-Karp

http://en.wikipedia.org/wiki/Rabin%E2%80%93Karp_string_search_algorithm==

## Aho–Corasick

https://en.wikipedia.org/wiki/Aho%E2%80%93Corasick_algorithm

## Commentz-Walter

http://www.google.com/search?q=Commentz-Walter

# Approximate-substring

## Bitap / Baeza-Yates–Gonnet

http://en.wikipedia.org/wiki/Bitap_algorithm

# Tree and graph searches

While trees are often introduced as an entirely abstract data structure, a lot of trees have rather specific properties for specific needs.

For example, self-balancing trees avoid becoming overly expensive with a lot of updates.

Some tree-search algorithms explore general tree properties,
and some are tightly bound to the properties of a specific tree.
As such, this section involves as much data structures as search algorithms.

### Binary search tree

### Heaps (and queues)

A heap is a tree that satisfies the **heap property**, basically that a node has a value with is larger-than-or-equal (or less-than-or-equal, depending on purpose) than its parent's.

In other words, it's a tree with orderedness as part of its nature, though exactly how varies.

Due to that nature, they are also *often* the implementation behind the **concept of a (priority) queue**,
(often to make inserts and removals cost low/constant)
and there are some heap implementations specifically for queues.

For a general comparison, see https://en.wikipedia.org/wiki/Heap_(data_structure)

#### Binary heap

https://en.wikipedia.org/wiki/Binary_heap

#### Binomial heap

https://en.wikipedia.org/wiki/Binomial_heap

#### Fibonacci heap

https://en.wikipedia.org/wiki/Fibonacci_heap

#### Leftist heap

https://en.wikipedia.org/wiki/Leftist_tree

#### Pairing heap

https://en.wikipedia.org/wiki/Pairing_heap

#### Skew heap

https://en.wikipedia.org/wiki/Skew_heap

#### Van Emde Boas tree

https://en.wikipedia.org/wiki/Van_Emde_Boas_tree

#### 2-3 heap

https://en.wikipedia.org/wiki/2%E2%80%933_heap

#### B-heap

https://en.wikipedia.org/wiki/B-heap

#### d-ary heap

https://en.wikipedia.org/wiki/D-ary_heap

#### weak heap

https://en.wikipedia.org/wiki/Weak_heap

#### Treap

https://en.wikipedia.org/wiki/Treap

#### (Randomized) Meldable heap

https://en.wikipedia.org/wiki/Randomized_meldable_heap

#### Radix heap / Monotone priority queue

https://en.wikipedia.org/wiki/Monotone_priority_queue

#### Brodal queue

https://en.wikipedia.org/wiki/Brodal_queue

### Tries

#### Suffix tree

(See also BWT; they are in part a logical extension of the suffix tree idea)

#### Radix tree

#### Hash tree

#### Ternary search tree

#### Y-fast trie

### Search concepts

#### Branch and bound

#### A* search

Visit possibilities according to their (heuristically determined) promise, best to worst.

#### Beam search

Like A*, but only checks the most promising paths, not all.

### VP^{sb}-tree

Different from SB-tree.

### SB-tree

Variant of B-tree that is good at sequential fetches.

Different from VP^{sb}-tree.

http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.55.9482&rep=rep1&type=pdf

### Top tree

http://en.wikipedia.org/wiki/Top_tree

### LC-RS tree

### Balancing trees

#### T-tree

# Unsorted tree stuff

#### LLRB tree

#### AA tree

#### AVL tree

#### Scapegoat tree

#### Splay tree

#### Bx-tree

#### UB-tree

https://en.wikipedia.org/wiki/UB-tree

#### (a,b)-tree

https://en.wikipedia.org/wiki/(a,b)-tree

#### Dancing tree

Similar to B+ trees, but only rebalance on request.

https://en.wikipedia.org/wiki/Dancing_tree

#### HTree

https://en.wikipedia.org/wiki/HTree

### B-trees and its friends

#### B-tree

#### B+ tree

#### B*-tree

## Specific-purpose trees

Serves *only* a rather specific need.

### MVP tree

(Multi Vantage Point)

Meant for metric-space data.

Allows looking for hits close to a query, according to that metric, so e.g. useful for 'nearby location' sort of things.

Based on a VP tree (which is more clearly just a type of BSP tree), adding storage of distances, and adding multiple vantage points to partition the space at each level (which is computationally expensive, so makes more sense for index-once-lookup-many cases than for rebalancing ones*(verify)*).

Tends to outperform M-trees?

http://en.wikipedia.org/wiki/MVP_Tree

https://code.google.com/p/mvptree-library/ (C implementation)

https://github.com/trolldbois/phash-graph-mvp/blob/master/mvptree.py (Python implementation)

### M-tree

For metric-space data, otherwise similar to R-trees, B-trees.

### M^{+}-tree

For metric-space data. Tries to mix the good parts of M-trees and MVP-trees*(verify)*

### BSP tree

#### Quadtree

#### Octree

#### k-d tree

#### Implicit k-d tree

#### VP tree

Vantage Point trees

http://stevehanov.ca/blog/index.php?id=130

### Merkle tree

http://en.wikipedia.org/wiki/Merkle_tree

# Unsorted

- http://www.cs.ucr.edu/~stelo/oldevents.html
- http://igm.univ-mlv.fr/~lecroq/string/index.html
- http://en.wikipedia.org/wiki/Category:Search_algorithms

Factor Oracle