String Matching

First of all we should know what string (A string is a sequence of characters.) matching is. Simply the definition for the String matching is the problem of finding occurrence(s) of a pattern string within another string or body of text.

This is also known as string searching, text searching.There are many different algorithms for efficient searching. (Let m be the length of the pattern, n be the length of the searching text and k = |Σ| be the size of the alphabet).

Brute force string search (naive string search).

An algorithm to find a string within another string or body of text by trying each position one at a time. There are many far faster string matching algorithm.

n = length(T);   m = length(P);
    for i = 0 to n-m do
     j       0
    while (j < m and T[i+j] = p[j]) do
          j         j+1
          if j = m then return i
    return “ no substring matching“
- Matching time - Θ(nm)
- Space - None

Boyer-Moore Algorithm.

A string matching algorithm that compares characters from the end of the pattern to its beginning. When characters don't match, searching jumps to the next possible match: the farthest of a table like that used in the Knuth-Morris-Pratt algorithm and the next matching position in the pattern.
- ```
T = ...IN THE UNTITLED WORKS BY
P =             EDITED
```
```
 
```
  Matching from the back, we first encounter a mismatch at "L". Because "L" does not occur in P, the shift can be increased by 4 without missing a potential match. In general, the mismatched character could be in P. In this case the shift can only be increased to the point where that character occurs in P. Here is a simplified version of Boyer-Moore using this bad-character heuristic.

Preprocessing time - Θ(m + k)
Matching time - best Ω(n/m) , worst O(mn)
Space - Θ(k)

Karp-Rabin (Rabin-Karp) Algorithm.

Rabin-Karp – the idea Compare a string's hash values, rather than the strings themselves. For efficiency, the hash value of the next position in the text is easily computed from the hash value of the current position.

- Preprocessing time - Θ(m)
- Matching time - average Θ(n + m) , worst Θ((n−m)m)
- Space - O(1)

Knuth-Morris-Pratt algorithm.

A string matching algorithm that turns the search string into a finite state machine, then runs the machine with the string to be searched as the input string.

kmpMatch(T,P)
n = length(T);
m = length(P);
q = 0;            % no.of chars matched
 a= prefixFn(P)        % precompute a
for i = 1 to n
    while q > 0 and P[q+1] <> T[i]
                    then q = a[q]
    if P[q+1] = T[i] then q++
    if q = m then print “valid shift at“ i-m
    q = a[q]    % goto next match

Preprocessing time - Θ(m).
Matching time - Θ(n).
Space - Θ(m).

Bitap algorithm

- A string matching algorithm which keeps an array of bits, R, showing if prefixes of the pattern don't match at the current place. Before searching, mismatch arrays are computed for each character in the alphabet and saved in an array, S. For the next position, with the character c, R = shift(R) or S[c]. If the last bit of R is 0, the pattern matches.