Substring Matching : Rabin–Karp algorithm

3/1/2014

To check whether a given string is a substring of another string is a very common programming operation. However, I was not able to find a succinct, complete description of some of the algorithms behind substring searching. Hopefully, readers will find this post and the associated code useful!

The fundamental string searching (matching) problem is defined as follows: given two strings - a text and a pattern, determine whether the pattern appears in the text. The problem is also known as "the needle in a haystack problem."

The Naive Method
The idea is straightforward -- for every position in the text, consider it a starting position of the pattern and see if you get a match.

The naive method exhibits a worst case time complexity of O(n*m) because we potentially compare each element of the text with every element of the pattern. In other words, the naive method generates EVERY possible substring of the text and compares it with pattern.

Rabin-Karp Algorithm (RK)
Rabin–Karp algorithm is a string searching algorithm created by Michael O. Rabin and Richard M. Karp in 1987. The Rabin–Karp algorithm focuses on speeding up the generation of a substring derived from text and its comparison to the pattern with the help of Hash Function.

The method behind the RK algorithm is :
Let the Pattern be P (of length L) and the text be T (of length n).

Hash P to get h(P) [This takes O(L) time]
Iterate through all length L substrings of T, hashing those substrings and comparing to h(P) [ This takes O(n*L) ]
If a substring hash value does match h(P), do a string comparison on that substring and P, stopping if they do match and continuing if they do not. [ O(L) ]

In other words, the RK algorithm simply hashes EVERY possible substring on the text and compares it with the hash of the pattern. At this point you must be wondering how this is any better than the naive implementation. But as we shall see shortly, the RK algorithm improves its run time by using a rolling hash. To understand what a rolling hash is, we first need to know what a typical hashing function would look like.

The Choice of Hash Function

It should be easy to compare two hash values. For example, if the range of the hash function is a set of suﬃciently small nonnegative integers, then two hash values can be compared with a single machine instruction
The number of false positives induced by the hash function should be similar to that achieved by a “random” function. If the range of the hash function is of size m, we’d like each hash value to be achieved by approximately the same number of L-symbol strings (where L is the length of the pattern)
It should be easy (e.g., a constant number of machine instructions) to compute h(Si+1) given h(Si)

What if we hash each string to the sum of the ASCII values of its characters?

Let us take a step back from string and walk through the 3 step above by considering integer arrays.
Let the pattern P and the text T be:
P = [9,0,2,1,0]
T=[4,8,9,0,

The length 5 substrings of T would be:
S0 = [9,8,7,1,2]
S1 = [8,7,1,2,3]
S2 = [7,1,2,3,4]
.......
For each of these substring, our hash function to generate a hash value (an integer). Let the size of the Hash Table be m. Our hash function will be:

In other words, we will take the length 5 array of integers and concatenate the integers into
a 5 digit number, then take the number mod m. (we take mod m so that we can narrow down the 5 digit number into a number in the range of 0 - m. Remember the hash number generated is used as an index into the hash table of size m).

Now h(P) is 90210 mod m
h(S0) is 48902 mod m
h(S1) is 89021 mod m

Do you see the relationship between h(S0) and h(S1) ? In fact, we can generate h(S1) by using h(S0)! We start with 48902, remove the first digit to get 8902, multiply by 10 to get 89020, and then add the next digit to get 89021. i.e

We can now imagine a window sliding over all the substrings in S. Calculating the hash value of the next substring only touches 2 elements: the element leaving the window and the element
entering the window. Finding the hash value of the next substring is now a O(1) operation.
In this numerical example, we looked at single digit integers and set our base b = 10 so that
we can interpret the arithmetic easier. To generalize for other base b and substrings of length L, our hash function is:

and the formula to calculate the next Hash would be:

Back to Strings
Since strings can be interpreted as an array of integers, we can apply the same method we used on numbers to the initial problem, improving the runtime. The algorithm steps are now:

Hash P to get h(P) [ O(L) ]
Hash the first substring of S of length L [O(L)]
Use the rolling hash method to calculate the subsequent O(n) substrings in S, comparing the hash values to h(P) [This is O(n) ]
If a substring hash value does match h(P), do a string comparison on that substring and P, stopping if they do match and continuing if they do not. [O(L)] (Why? because of Collisions! We still need to check if the strings match exactly, even though their hash values are same.)

This speeds up the algorithm and as long as the total time spent doing string comparison is O(n), then the whole algorithm is also O(n). We can run into problems if we expect O(n) collisions in our hash table, since then we spend O(nL) in step 4. Thus we have to ensure that our table size is O(n) (where n is the length of the text) so that we expect O(1) total collisions and only have to go to step 4 O(1) times. In this case, we will spend O(L) time in step 4, which still keeps the whole running time at O(n).

Now we need to choose the value of the base b (see the formula highlighted in the black box). b is often chosen to be a power of 2 so that b^L-1 (b power L-1) can be computed fast (example: b=16 or b=32). But what happens if b^L-1 becomes too large? This can easily cause an integer overflow. Which is why we need mod m.

What about m? What should be the size of the Hash table? m is often a prime number.
I was lucky to find this excellent discussion on StackOverflow about the value of m and the "nature of math" and strongly suggest that you read it:
http://stackoverflow.com/questions/1145217/why-should-hash-functions-use-a-prime-number-modulus

Let us now summarize our constants:

Let us choose b to be 256 ( a power of 2)
m should be a prime number. We will generate this prime number using the Class BigInteger (java.math)
We will precompute b^L-1 mod m (again check the formula in the highlighted box). Instead of repeatedly computing b^L-1 to generate the rolling hash values, we should rather precompute this.
So let b^L-1 mod m be R

We can now proceed with our algorithm:
Algorithm: RabinKarpPatternSearch (pattern, text)
Input: pattern array of length L, text array of length n

patternHash = computePatternSignature (pattern);
Optimization: compute b^L-1 mod m just once. So R = b^L-1 mod m
texthash = compute signature of text[0]...text[L-1] (i.e compute the hash of the first substring of the text).
textCursor = 0;
while textCursor != end of text
if textHash = patternHash // Potential match.
if exact_match (pattern, text, textCursor) // Match found
return textCursor
endif
// Different strings with same signature, so continue search.
endif
textCursor = textCursor + 1
//Use O(1) computation to compute next signature:
textHash = compute signature of text[textCursor],...,text[textCursor+L-1]
endwhile
return -1

Output: position in text if pattern is found, -1 if not found.

Code: You may find the complete Java implementation at:
https://github.com/sarveshsaran/RabinKarp

0 Comments

Substring Matching : Rabin–Karp algorithm

Leave a Reply.

Sarvesh Saran

Archives

Categories