Category: Programming

Substring Matching : Rabin–Karp algorithm

3/1/2014

To check whether a given string is a substring of another string is a very common programming operation. However, I was not able to find a succinct, complete description of some of the algorithms behind substring searching. Hopefully, readers will find this post and the associated code useful!

The fundamental string searching (matching) problem is defined as follows: given two strings - a text and a pattern, determine whether the pattern appears in the text. The problem is also known as "the needle in a haystack problem."

The Naive Method
The idea is straightforward -- for every position in the text, consider it a starting position of the pattern and see if you get a match.

The naive method exhibits a worst case time complexity of O(n*m) because we potentially compare each element of the text with every element of the pattern. In other words, the naive method generates EVERY possible substring of the text and compares it with pattern.

Rabin-Karp Algorithm (RK)
Rabin–Karp algorithm is a string searching algorithm created by Michael O. Rabin and Richard M. Karp in 1987. The Rabin–Karp algorithm focuses on speeding up the generation of a substring derived from text and its comparison to the pattern with the help of Hash Function.

The method behind the RK algorithm is :
Let the Pattern be P (of length L) and the text be T (of length n).

Hash P to get h(P) [This takes O(L) time]
Iterate through all length L substrings of T, hashing those substrings and comparing to h(P) [ This takes O(n*L) ]
If a substring hash value does match h(P), do a string comparison on that substring and P, stopping if they do match and continuing if they do not. [ O(L) ]

In other words, the RK algorithm simply hashes EVERY possible substring on the text and compares it with the hash of the pattern. At this point you must be wondering how this is any better than the naive implementation. But as we shall see shortly, the RK algorithm improves its run time by using a rolling hash. To understand what a rolling hash is, we first need to know what a typical hashing function would look like.

The Choice of Hash Function

It should be easy to compare two hash values. For example, if the range of the hash function is a set of suﬃciently small nonnegative integers, then two hash values can be compared with a single machine instruction
The number of false positives induced by the hash function should be similar to that achieved by a “random” function. If the range of the hash function is of size m, we’d like each hash value to be achieved by approximately the same number of L-symbol strings (where L is the length of the pattern)
It should be easy (e.g., a constant number of machine instructions) to compute h(Si+1) given h(Si)

What if we hash each string to the sum of the ASCII values of its characters?

Let us take a step back from string and walk through the 3 step above by considering integer arrays.
Let the pattern P and the text T be:
P = [9,0,2,1,0]
T=[4,8,9,0,

The length 5 substrings of T would be:
S0 = [9,8,7,1,2]
S1 = [8,7,1,2,3]
S2 = [7,1,2,3,4]
.......
For each of these substring, our hash function to generate a hash value (an integer). Let the size of the Hash Table be m. Our hash function will be:

In other words, we will take the length 5 array of integers and concatenate the integers into
a 5 digit number, then take the number mod m. (we take mod m so that we can narrow down the 5 digit number into a number in the range of 0 - m. Remember the hash number generated is used as an index into the hash table of size m).

Now h(P) is 90210 mod m
h(S0) is 48902 mod m
h(S1) is 89021 mod m

Do you see the relationship between h(S0) and h(S1) ? In fact, we can generate h(S1) by using h(S0)! We start with 48902, remove the first digit to get 8902, multiply by 10 to get 89020, and then add the next digit to get 89021. i.e

We can now imagine a window sliding over all the substrings in S. Calculating the hash value of the next substring only touches 2 elements: the element leaving the window and the element
entering the window. Finding the hash value of the next substring is now a O(1) operation.
In this numerical example, we looked at single digit integers and set our base b = 10 so that
we can interpret the arithmetic easier. To generalize for other base b and substrings of length L, our hash function is:

and the formula to calculate the next Hash would be:

Back to Strings
Since strings can be interpreted as an array of integers, we can apply the same method we used on numbers to the initial problem, improving the runtime. The algorithm steps are now:

Hash P to get h(P) [ O(L) ]
Hash the first substring of S of length L [O(L)]
Use the rolling hash method to calculate the subsequent O(n) substrings in S, comparing the hash values to h(P) [This is O(n) ]
If a substring hash value does match h(P), do a string comparison on that substring and P, stopping if they do match and continuing if they do not. [O(L)] (Why? because of Collisions! We still need to check if the strings match exactly, even though their hash values are same.)

This speeds up the algorithm and as long as the total time spent doing string comparison is O(n), then the whole algorithm is also O(n). We can run into problems if we expect O(n) collisions in our hash table, since then we spend O(nL) in step 4. Thus we have to ensure that our table size is O(n) (where n is the length of the text) so that we expect O(1) total collisions and only have to go to step 4 O(1) times. In this case, we will spend O(L) time in step 4, which still keeps the whole running time at O(n).

Now we need to choose the value of the base b (see the formula highlighted in the black box). b is often chosen to be a power of 2 so that b^L-1 (b power L-1) can be computed fast (example: b=16 or b=32). But what happens if b^L-1 becomes too large? This can easily cause an integer overflow. Which is why we need mod m.

What about m? What should be the size of the Hash table? m is often a prime number.
I was lucky to find this excellent discussion on StackOverflow about the value of m and the "nature of math" and strongly suggest that you read it:
http://stackoverflow.com/questions/1145217/why-should-hash-functions-use-a-prime-number-modulus

Let us now summarize our constants:

Let us choose b to be 256 ( a power of 2)
m should be a prime number. We will generate this prime number using the Class BigInteger (java.math)
We will precompute b^L-1 mod m (again check the formula in the highlighted box). Instead of repeatedly computing b^L-1 to generate the rolling hash values, we should rather precompute this.
So let b^L-1 mod m be R

We can now proceed with our algorithm:
Algorithm: RabinKarpPatternSearch (pattern, text)
Input: pattern array of length L, text array of length n

patternHash = computePatternSignature (pattern);
Optimization: compute b^L-1 mod m just once. So R = b^L-1 mod m
texthash = compute signature of text[0]...text[L-1] (i.e compute the hash of the first substring of the text).
textCursor = 0;
while textCursor != end of text
if textHash = patternHash // Potential match.
if exact_match (pattern, text, textCursor) // Match found
return textCursor
endif
// Different strings with same signature, so continue search.
endif
textCursor = textCursor + 1
//Use O(1) computation to compute next signature:
textHash = compute signature of text[textCursor],...,text[textCursor+L-1]
endwhile
return -1

Output: position in text if pattern is found, -1 if not found.

Code: You may find the complete Java implementation at:
https://github.com/sarveshsaran/RabinKarp

0 Comments

An Application of LRU (Least Recently Used) Data Structure: Your phone's home screen

2/28/2014

4 Comments

This is a screenshot of my Samsung Galaxy S3's home screen. I have (over a period of time) carefully curated my most recently used apps and placed them on the home screen. This save me a LOT of time, allowing me to quickly launch applications that I have recently accessed. Now wouldn't it be cool if Samsung's TouchWiz did this automatically? Imagine getting rid of all the frustration searching for an app you accessed only an hour ago?

The android home screen is a good use case for a very important data structure. In order to keep the most recently used apps on the home screen, we will need a data structure that keeps track of such applications and automatically removes the Least Recently Used (LRU) app. Our home screen has 12 slots, so the size of our LRU cache would 12. (Do not confuse this with a CPU cache, which needs guarantees of consistency and efficiency. I'm using the term "cache" to simply mean a list of our most recently used applications).

Let us now think about how we can implement an LRU cache. Our cache must support the following operations:

Keep the most recently used apps at the front of the list.
When the user opens an app not in the list, add this app to the front of the list.
If the list is full, remove the least recently used app from the list.

One way to do this would be to use a Doubly Linked List (to store the apps) and a HashMap that stores the appID as the key and a reference to the node in the Doubly Linked List as the value. This would allow us to quickly look if an app (with a particular appID) is present in the list or not (making lookup an O(1) operation). If not present, adding a new node at the head of the list is an O(1) operation. If the list is full, removing+updating the tail of the list is again O(1). Using a doubly linked list also allows to remove a node from the list and promote it to the head of the list in constant time (hence using a singly linked list is not a good idea).

Fortunately for us, java.util provides a data structure that can be used an an LRU cache: the LinkedHashMap(http://docs.oracle.com/javase/7/docs/api/java/util/LinkedHashMap.html).The LinkedHashMap provides a special constructor to create a hash map whose order of iteration is the order in which its entries were last accessed, from least-recently accessed to most-recently (access-order).
The LinkedHashMap requires that the removeEldestEntry(Map.Entry) method be overridden to impose a policy for removing stale mappings automatically when new mappings are added to the map. Here's a simple implementation of an LRU data structure using a LinkedHashMap :

In the implementation below: we pass capacity+1 to the super class because LinkedHashMaps first add a node before deleting the least recently used node.

OUTPUT:
newsstand, viber, whatsapp, maps, amazon, linkedin, youtube, outlook, kindle, facebook, camera, keep,

4 Comments

Breadth-first search and Depth-first search

1/23/2014

1 Comment

Breadth-first search (BFS) is a graph traversal algorithm that explores nodes in the order of their distance from the roots. Here distance is defined as the minimum path length from a root to the node. Its pseudo-code would look something like:

Here the white nodes are those not marked as visited, the gray nodes are those marked as visited and that are in frontier, and the black nodes are visited nodes no longer in frontier. Rather than having a visited flag, we can keep track of a node's distance in the field v.distance. When a new node is discovered, its distance is set to be one greater than its predecessor v.
Basically, when frontier is a first-in, first-out (FIFO) queue, we get breadth-first search. All the nodes on the queue have a minimum path length within one of each other. In general, there is a set of nodes to be popped off, at some distance k from the source, and another set of elements, later on the queue, at distance k+1.

Here's a small example. In the graph above, our root is S.

At the beginning, color all the vertices white
Initiate an empty queue Q
Add the node S to the frontier. Color it gray.

3. Remove S. Mark it as Black.
4. Mark all its white neighbors (a,b,c) as gray and add them to the frontier.
5. The rest of the algorithm simply repeats the above until Q is empty.

Time Analysis
Let us assume that the input graph G is stored with an adjacency list.

Coloring all vertices white (at the beginning of BFS) takes O(|V |) time, where V is the set of vertices in G.
Then, every edge in E (the set of edges in G) is processed at most twice.
Therefore, the total running time is O(|V | + |E|).

Depth-first search (DFS) : What if we were to replace the FIFO queue with a LIFO stack? In that case we get a completely different order of traversal, namely DFS. With a stack, the search will proceed from a given node as far as it can before backtracking and considering other nodes on the stack.

You can think of DFS as a person walking through the graph following arrows and never visiting a node twice except when backtracking, when a dead end is reached. The diagram below shows the DFS traversal of a graph starting from node A.

Time Analysis
Let us assume that the input graph G is stored with an adjacency list.

There can be at most |V| calls to DFS_visit
Then, every edge in E (the set of edges in G) is processed at most twice.
Therefore, the total running time is O(|V | + |E|), same a BFS.
The sequence of calls to DFS forms a tree. For the graph above the tree is:

A
B C
D
E

So the DFS algorithm maintains an amount of state that is proportional to the size of this path from the root. On a balanced binary tree, DFS maintains state proportional to the height of the tree, or O(log |V|).
In BFS, where the amount of state (the queue size) corresponds to the size of the perimeter of nodes at distance k from the starting node. In both algorithms the amount of state can be O(|V|) in the worst case.

Note:
If we want to search the whole graph, then a single recursive traversal may not suffice. If we had started a traversal with node C, we would miss all the rest of the nodes in the graph. To do a depth-first search of an entire graph, we call DFS on an arbitrary unvisited node, and repeat until every node has been visited.

Graph Representation : Adjacency List Vs Adjacency Matrix

A graph can be stored either as a matrix or a list of nodes. The correct choice depends on the problem.

An adjacency matrix uses O(n*n) memory, where n is the number of nodes.
It has fast lookups to check for presence or absence of a specific edge, but it is slow to iterate over all edges.
Adjacency lists use memory in proportion to the number edges, which might save a lot of memory if the adjacency matrix is sparse. It is fast to iterate over all edges, but finding the presence or absence specific edge is slightly slower than with the matrix.

Topological Sort
One of the most useful algorithms on graphs is topological sort, in which the nodes of an acyclic graph are placed in an order consistent with the edges of the graph. This is useful when you need to order a set of elements, for example, suppose you have a set of tasks to perform, but some tasks have to be done before other tasks can start. In what order should you perform the tasks? This problem can be solved by representing the tasks as nodes in a graph, where there is an edge from task 1 to task 2 if task 1 must be done before task 2. Then a topological sort of the graph will give an ordering in which task 1 precedes task 2. Obviously, to topologically sort a graph, it cannot have cycles.

A key observation in Topological Sorting is that a node finishes (is marked black) after all of its descendants have been marked black. Therefore, a node that is marked black later must come earlier when topologically sorted. For example, in the traversal example above, nodes are marked black in the order C, E, D, B, A. Reversing this, we get the ordering A, B, D, E, C. This is a topological sort of the graph. Interestingly enough, a postorder traversal generates nodes in the reverse of a topological sort.

Algorithm:
The algorithm for Topological sort is similar to DFS.

We perform a depth-first search over the entire graph, starting anew with an unvisited node if previous starting nodes did not visit every node.
As each node is finished (colored black), put it on the head of an initially empty list.
This ensures that a node that is marked black later, appears at the head of the list.
This clearly takes time linear in the size of the graph: O(|V| + |E|).

Detecting Cycle
We can use the idea of Topological sorting to detect a cycle in a graph. Since a node finishes after its descendants, a cycle involves a gray node pointing to one of its gray ancestors that hasn't finished yet. If one of a node's successors is gray, there must be a cycle.

To detect cycles in graphs, therefore, we choose an arbitrary white node and run DFS. If that completes and there are still white nodes left over, we choose another white node arbitrarily and repeat. Eventually all nodes are colored black. If at any time we follow an edge to a gray node, there is a cycle in the graph. Therefore, cycles can be detected in O(|V+E|) time.

Java Code:
In the example code below, the sample graph used is quite relatively dense and hence I use an adjacency matrix to represent the graph. To quickly look up a vertex, i enforce the following naming convention/lookup convention:
a b c d e f g h i j k
0 1 2 3 4 5 6 7 8 9 10
so the vertex 'a' is stored at index 0 in the array of vertices. An edge between vertex 'a' and 'b' would hence be an edge between 0 and 1.

A Vertex hence has a label/value and a state (initially white).
public int[][] matrix;
public Vertex[] vertices;

We store the list of vertices in an array and the graph is stored in an adjacency matrix.
You can find the full source code at:
https://github.com/sarveshsaran/ProgrammingSnippets/blob/master/GraphDFS.java

1 Comment

Graph traversals

1/20/2014

1 Comment

In the first post of a series of posts, I would like to introduce some elementary graph traversal techniques and discuss the abstract tricolor algorithm. While there are several resources on graph traversals on the internet, most resources either get bogged down with terminology or go straight to the code.
Topics

tricolor algorithm
breadth-first search
depth-first search
cycle detection
topological sort
connected components

Tricolor Algorithm
Abstractly, graph traversal algorithms can be expressed in terms of the tricolor algorithm. In the tricolor algorithm each node of the graph is assigned a color that changes over time:

White nodes are undiscovered nodes that have not yet been seen in the current traversal and may even be unreachable.
Black nodes are nodes that are reachable and that the algorithm is done with.
Gray nodes are nodes that have been discovered but that the algorithm is not done with yet. They are at the frontier between the white ad black nodes.

A typical graph traversal algorithm starts with no black nodes and the root is gray. As the algorithm proceeds, white nodes turn into gray nodes and gray nodes convert to black ones. Eventually there are no gray nodes left and the algorithm is done. The tricolor algorithm maintains a key invariant at all times: there are no edges from white nodes to black nodes. This is clearly true at the beginning (root is gray,all others are white) and it is also true at the end because we know that any remaining white nodes cannot be reached from the black nodes (a node turns black only when it has been fully explored).

The Algorithm

Initially color all nodes white (unexplored)
Color the root gray.
while some gray node x exist
color some white successors of x gray
if x has no successors left, color it black

This algorithm is abstract. It is up to the implementation of a particular specialized traversal algorithm to :

Decide which gray node(s) x to start with
Decide which neighbors of a gray node to color gray
Delay coloring a node black.

One key advantage in defining graph search/traversal in terms of the tricolor algorithm is that the tricolor algorithm works even when gray nodes are worked on concurrently, as long as the black-white invariant is maintained.

1 Comment

Substring Matching : Using Suffix Trees

12/27/2013

0 Comments

In the last post I discussed the Rabin Karpe algorithm for substring matching. While this algorithm exhibits a best case complexity of O(n) [where n is the size of the text], using hash functions for string matching has certain limitations. A hash function relies on us knowing the exact length of the pattern. What if you don’t know how long the pattern x is going to be? In cases where the pattern varies, using a suffix tree makes a lot of sense.

Suffix trees are much faster when the text is fixed and known first while the patterns vary.
Time Complexity: O(m) for single time processing the text, then only O(n) for each new pattern.

In the rest of the post, I am going to describe what a suffix tree is, how to construct and query it.

Prefixes & Suffixes
For a string S:

Prefix of S: substring of S beginning at the first position of S
Suffix of S: substring that ends at its last position.

Example:
S=AACTAG

Prefixes: AACTAG,AACTA,AACT,AAC,AA,A
Suffixes: AACTAG,ACTAG,CTAG,TAG,AG,G

P is a substring of S if P is a prefix of some suffix of S. Suffix Trees exploit this property of strings to solve the substring pattern matching problem.

Suffix Tree

A suffix tree ST for an m-character string S is a rooted directed tree with exactly m
leaves numbered 1 to m. Each internal node, other than the root, has at least two children and each edge is labeled with a nonempty substring of S.

The key feature of the suffix tree is that for any leaf i, the concatenation of the edge-labels on the path from the root to the leaf i exactly spells out the suffix of S that starts at position i.

Does a suffix tree always exist? Not necessarily.
If one suffix Sj of S matches a prefix of another suffix Si of S, then the path for Sj would not end at a leaf.
Example:
S = xabxa
S1 = xabxa and S4 = xa. In this case, the suffix S4 is also a prefix of suffix S1.

How to avoid this problem?

Assume that the last character of S appears nowhere else in S.
Add a new character $ not in the alphabet to the end of S.

Suffix Tree for S= xabxa$

How to Construct a Suffix Tree (Naive Method)

Let us first build a suffix tree the naive way.

Time Complexty: Naïve method - O(m2) (m = text size)

CODE:
Here's an excellent implementation of the suffix tree described above:http://en.literateprograms.org/Suffix_tree_(Java) and
http://en.literateprograms.org/Special:DownloadCode/Suffix_tree_(Java)

In the next post, I will discuss Ukkonen’s linear-time construction that takes time O(m) to build a suffix tree.

REFERENCES:

http://www.stanford.edu/~mjkay/gusfield.pdf
http://www.cs.ucf.edu/~shzhang/Combio11/lec3.pdf
http://www.cs.duke.edu/courses/fall12/compsci260/resources/suffix.trees.in.detail.pdf

0 Comments

Real programmers don't need comments. We can read code!

5/14/2013

0 Comments

So I found my self staring at 300 lines of indecipherable code. No comments, no pointers as to what is going on....and all I had to do was fix a small bug. I've always found fixing bugs fun and crucial. If you don't fix the small bugs then you cannot move on to that cool feature you wanted to add (that will change the face of the company and promote you to CEO). After hours of scratching my head, running the debugger, staring at the screen, playing ping pong, more staring at the screen, the code was still indecipherable. So I caught hold of my colleague and asked if he knew who wrote this code? and all I got was...."Why do you need him anyway? Surely you can read code!".

Well..yes. I can read code and so can any computer graduate. But fixing code should not be like cracking a cold war era encrypted message. Code needs Comments. No matter how smart your team is, no matter if the guy who wrote the code is NEVER leaving the company. Comments make life easy and save time.

And no...this does not count:

return 1; // returns 1

Stack Overflow on the best and worst of comments and one of my favorites from Dilbert.

0 Comments

Substring Matching : Rabin–Karp algorithm

An Application of LRU (Least Recently Used) Data Structure: Your phone's home screen

Breadth-first search and Depth-first search

Graph traversals

Substring Matching : Using Suffix Trees

Real programmers don't need comments. We can read code!

Sarvesh Saran

Archives

Categories