One word review: **visionary**

This is my review of the book "Zero to One - Notes on startups , or how to build the future" by Peter Thiel. Its quite an ambition title, but the book delivers on its promise (except the last two chapters with the weird illustrations) . I was excited to read this book primarily because of its author. Peter Theil is co-founder of Paypal (which recently separated from eBay and went IPO) and Palantir and was an early investor in Facebook. He is a extremely successful venture capitalist, his firm Founders Fund has an impressive portfolio including Strip and ZocDoc (ZocDoc being one of the most promising health tech startups to come out of New York city in a while).

The main premise of the book is to reinforce the belief that investing in technology (not just computers) will lead to a better future. The book has lots of nuggets of wisdom, but here are my favorites:

]]>This is my review of the book "Zero to One - Notes on startups , or how to build the future" by Peter Thiel. Its quite an ambition title, but the book delivers on its promise (except the last two chapters with the weird illustrations) . I was excited to read this book primarily because of its author. Peter Theil is co-founder of Paypal (which recently separated from eBay and went IPO) and Palantir and was an early investor in Facebook. He is a extremely successful venture capitalist, his firm Founders Fund has an impressive portfolio including Strip and ZocDoc (ZocDoc being one of the most promising health tech startups to come out of New York city in a while).

The main premise of the book is to reinforce the belief that investing in technology (not just computers) will lead to a better future. The book has lots of nuggets of wisdom, but here are my favorites:

*To go from 0 to 1 is to build something new:*To go from**1 to n**is to**iteratively improve**or build upon existing technology. Going from 0 to 1 will lead to a better future, going from 1 to n turns you into HP or Microsoft.*Chance has very little to do with success**(sorry Malcolm Gladwell)*: People succeed not necessarily because of their upbringing but because they rise to the occasion, invent and work hard.*The geek shall inherit the world. But the geek must learn to work with others*: Only a classic piece of literature could be made in isolation. To build an industry you need a team.*The smaller, the more focused the team the greater the chances of success:*This is something I 100% agree with. The agile software manifesto also talks about the same notion.*To be weird, peculiar and unique is good:*Geeks are weird, socially awkward, sometimes plain crazy. They defy conventional wisdom and this is why they take the less beaten path and succeed. They are less vulnerable to preconceived notions. They are less likely to repeat other peoples mistakes.*Everyone sells:*Engineers must learn not to underestimate the importance of sales and delivery. No matter how good your product is, you still need to convince people to use it.*Monopoly is a good thing, undifferentiated competition is bad*: A startup must choose a specific user base (Paypal for example chose eBay sellers) and dominate that space and then look to expand. Facebook chose Harvard students, Tesla chose the the luxury green market.*Planned Optimism is the key to the future*: Believe in miracle of technology, but have a plan. A bad plan is better than no plan. To be agile means to adapt. But you still need to have a plan. It was unplanned optimism that lead to the dot com bubble.*Build a valuable company that no one is building***:**A company should add value but must also be**valuable***.*This basically means that your company must have a useful product but must also generate revenue. People must be willing to pay for your product.- A framework to evaluate your startup/idea/project
- Is it innovative? This is the key!
- Is now a good time for this product?
- Will it be a monopoly in a small segment?
- Do you have the right team?
- Do you have Sales and Delivery?
- Will your product last? Or will an existing big player make it worthless. This is a crucial point and reminds me of a bunch of e-commerce startups that failed.
- Do you have a secret? Does your product have a killer feature that no other product in the market have?

From Wikipedia : "Reinforcement learning is an area of

- Reinforcement Learning Maps
**situations**to**actions**to**maximize**a numerical reward. - Unlike supervised learning, the learner is not told which actions to take but must
**discover**which actions yield the most reward**by trying**them. - Highly useful in cases of significant uncertainty about the environment

MAB is best understood through this analogy:

A gambler at a row of slot machines has to decide which machines to play, how many times to play each machine and in which order to play them. When played, each machine provides a reward from a distribution specific to that machine. The objective is to maximize the sum of rewards earned through a sequence of lever pulls.

Some real world exploration/exploitation examples:

Restaurant Selection

•Lets Index the arms by*a*, and the probability distribution over possible rewards **r** for each arm* ***a** can be written as** ***pa(r)* .

• We have to**find the arm with the largest mean reward ***μa=Ea[r]*

•In practice*pa(r)* are non-stationary

So how does one come up with an optimal (albeit approximate) strategy to explore and exploit so as to reap maximum rewards? There are two classes of well known strategies in literature:

Epsilon Greedy

Select the best lever most of the time, pull a random lever some of the time (show random ads sometimes, and the best

ad most of the time). In this strategy we choose to pull a new lever (explore) with a frequency of*epsilon* (hence the name). Hence *epsilon* is the fraction of times we sample a lever randomly and *1- epsilon* is the fraction of times we choose optimally. An epsilon value of 1.0 will cause it to always explore, while a value of 0.0 will cause it to always exploit the best preforming lever.

Key Advantages:

Thompson Sampling

Thompson Sampling is a randomized algorithm based on Bayesian ideas. The first version of this Bayesian heuristic is more than 80 years old, dating to Thompson (1933). It is a member of the family of randomized probability matching

algorithms. The basic idea is to*assume a simple prior distribution *on the underlying parameters of the reward distribution of every lever, and at every time step, play a lever according to its *posterior probability *of being the best arm.

In other words, we encode our belief about where the expected reward** μa **is for lever** a **in probability distribution **p(μa|data). **For example, if each lever is an advert, **p(μa|data) **could be the probability distribution of the mean CTR of the advert given historical data about the advert. When the reward is a binomial (click Vs no click) a Beta-Binomial model is usually a convenient choice.

Key Advantages:

References

]]>Restaurant Selection

- Exploitation : Go to your favorite restaurant
- Exploration: Try a new restaurant

- Exploitation : Continue using existing well
- Exploration: Drill at a new location

- Exploitation: Show the most successful advert
- Exploration :Show a different advert

•Lets Index the arms by

• We have to

•In practice

So how does one come up with an optimal (albeit approximate) strategy to explore and exploit so as to reap maximum rewards? There are two classes of well known strategies in literature:

- Greedy : the
*best*lever (based on previous trials) is always pulled except when a (uniformly) random action is taken. A popular example of a Greedy strategy is called**Epsilon Greedy.** - Probabilistic Matching : the number of pulls for a given lever should
*match*its actual probability of being the optimal lever. Example:**Thompson Sampling**

Epsilon Greedy

Select the best lever most of the time, pull a random lever some of the time (show random ads sometimes, and the best

ad most of the time). In this strategy we choose to pull a new lever (explore) with a frequency of

Key Advantages:

- Very easy to implement.
- Will not get stuck in some local optimal state.
- The best performing arm will be used most of the time.

- How do you pick the value
*epsilon*? This is a tricky problem to solve and the wrong epsilon value could lead to either 0 exploration or too much exploration.

Thompson Sampling

algorithms. The basic idea is to

In other words, we encode our belief about where the expected reward

Key Advantages:

- Easy to implement.
- Robust against delayed feedback: Often feedback about rewards is not immediately available. Imagine each lever being a product on sale on an eCommerce site. The reward is sold Vs not sold. The knowledge that a product was sold may not be immediately available to the bandit system. In this case, Thompson Sampling (being randomized) will keep exploring (because its randomized) instead of being stuck showing a sub performing product.

- Exploration may cease to exist if the algorithm converges quickly. Continuing with the previous example, a product that gets a few sales may continue getting more sales (simply because its shown more) and may overwhelm new products for being shown. Thompson sampling hence needs careful tuning and experimentation.

References

- Kuleshov, Volodymyr, and Doina Precup. "Algorithms for multi-armed bandit problems."
*arXiv preprint arXiv:1402.6028*(2014). - Chapelle, Olivier, and Lihong Li. "An empirical evaluation of thompson sampling."
*Advances in neural information processing systems*. 2011. - Agrawal, Shipra, and Navin Goyal. "Analysis of Thompson sampling for the multi-armed bandit problem."
*arXiv preprint arXiv:1111.1797*(2011). - https://en.wikipedia.org/wiki/Multi-armed_bandit

The American education system is unlike that in many other countries. Education is primarily the responsibility of the state government, and so there is little standardization in the curriculum, for example. The individual states have great control over what is taught in their schools and over the requirements that a student must meet. There is hence a huge variation in courses, subjects, and other activities – it almost always depends on where the school is located.

A standardized testing system, addressed towards K-12 education is no simple task. Wouldn't it be cool if we got rid of pen and paper tests across schools? Isn't the basic premise of testing a group of diverse students on the same exam flawed? Each student has his own level of concentration, learning patterns, memory power, special interests, natural intuition etc and expecting a whole grade of students to reach the same level of curriculum proficiency is unreasonable. Is it fair to compare a student who is weak in Spanish to someone whose native language is Spanish? Or to compare the math scores of a student with special interest in sports to a student who loves math! We need computer adaptive tests in schools, tests that adapt to each students unique ability. Tests that can be given again and again, that do not just capture the final score but the student's progress across the school year. Tests that are neither easy nor tough, that will appeal to the math wiz and will be sufficient for the rest.

More importantly, standardization allows schools to compare the performance of their students with the students across the country. This is crucial as it allows principals and administrators to accurately judge the performance of not just students but also their teachers! Teaching is undoubtedly one of the most important jobs in the world and a standardized adaptive system can only help in improving teaching quality. So how can computer adaptive tests help in bringing uniformity when each school district has its own curriculum? This is a complex problem, and teachers, education psychologists, child psychologists have to work together to come up with a solution. In the meanwhile, simply adopting adaptive tests in each school district will be a big step forward.

Technology is making big strides in higher education (coursera for example), and it is about time its used effectively in the K-12 education system. Technology can provide tools that can help identify children with reading difficulties, that can help monitor the progress of students suffering from ADHD, that can challenge bright students to their full potential and give a reliable platform for school managers,authorities to compare and monitor the performance of their teaching staff.

Some novel examples of how technology is helping kids learn are:

]]>A standardized testing system, addressed towards K-12 education is no simple task. Wouldn't it be cool if we got rid of pen and paper tests across schools? Isn't the basic premise of testing a group of diverse students on the same exam flawed? Each student has his own level of concentration, learning patterns, memory power, special interests, natural intuition etc and expecting a whole grade of students to reach the same level of curriculum proficiency is unreasonable. Is it fair to compare a student who is weak in Spanish to someone whose native language is Spanish? Or to compare the math scores of a student with special interest in sports to a student who loves math! We need computer adaptive tests in schools, tests that adapt to each students unique ability. Tests that can be given again and again, that do not just capture the final score but the student's progress across the school year. Tests that are neither easy nor tough, that will appeal to the math wiz and will be sufficient for the rest.

More importantly, standardization allows schools to compare the performance of their students with the students across the country. This is crucial as it allows principals and administrators to accurately judge the performance of not just students but also their teachers! Teaching is undoubtedly one of the most important jobs in the world and a standardized adaptive system can only help in improving teaching quality. So how can computer adaptive tests help in bringing uniformity when each school district has its own curriculum? This is a complex problem, and teachers, education psychologists, child psychologists have to work together to come up with a solution. In the meanwhile, simply adopting adaptive tests in each school district will be a big step forward.

Technology is making big strides in higher education (coursera for example), and it is about time its used effectively in the K-12 education system. Technology can provide tools that can help identify children with reading difficulties, that can help monitor the progress of students suffering from ADHD, that can challenge bright students to their full potential and give a reliable platform for school managers,authorities to compare and monitor the performance of their teaching staff.

Some novel examples of how technology is helping kids learn are:

- A project that I am personally involved in , FAST. The
__Formative Assessment System for Teachers (FAST__) is a suite of highly efficient assessment tools designed for screening, progress monitoring, and program evaluation as part of a Response to Intervention (RtI) model of service delivery. - CogCubed creates games and models player behavior. These cognitive games can be used by clinicians, consumers, researchers and developers.
- storysmart is a new suite of apps that provide both a recreational and therapeutic activity for elementary-school aged children and are designed to help them develop social communication, social cognition, critical thinking and narrative skills.
- Pearson School : Core Curriculum products, learning management system.

This article will explore the key differences and tradeoffs between the two at a fairly high level (but with enough technical information!). It is important to understand that Columnar database is a

A relation database is a logical concept. A columnar database, or column-store, is a physical concept. Column oriented databases may be relational or not, just as row oriented databases may adhere more or less to relational principles.

Let us look at an example:

Here's a representation on a table called

This is how the sales table will be stored in a **row oriented database :**

**On Disk:**

Date Store Product Customer Price

-------------------------------------------------------------

2015-09-1 store1 product1 customer1 1.0

2015-09-1 store1 product2 customer2 4.0

2015-09-2 store2 product2 customer3 1.0

And** in memory** representation (assume each field is 8 bytes for simplicity and deal with the comma delimiter)

Address0 : 2015-09-1,store1,product1,customer1,1.0

Address40: 2015-09-1,store1,product2,customer2,4.0

Adressss80: 2015-09-2,store2,product2,customer3,1.0

The same table in a**column store :**

**On Disk:**

Date

2015-09-1

2015-09-1

2015-09-2

Store

store1

store1

store2

Product

product1

product2

product2

Cutomer

customer1

customer2

customer3

Price

1.0

4.0

1.0

**In Memory:**

Address0: 2015-09-1,2015-09-2

Address16:store1,store2

Address32:product1,product2

Address64: customer1,customer2,customer3

Address88:1.0,4.0,1.0

Now lets talk about an important aspects of database performance:

**Locality**: Are the attributes we want to fetch next to each other? For example, imagine I want to sum the total sales (for all stores) on 2015-09-1. In a columnar store, I only need to look at the memory block from address 88 onwards. The probability of these blocks bring in memory (cache) will be fairly high.

**Tradeoffs:**

There are several interesting tradeofs depending on the access patterns. If data is stored on the disk, then if a query needs to access only a single record (i.e., all or some of the attributes of a single row of a table), a column-store will have to seek several times (to all columns/files of the table referenced in the query) to read just this single record. However, if a query needs to access many records, then large swaths of entire columns can be read, amortizing the seeks to the different columns.

In a conventional row-store, in contrast, if a query needs to access a single record, only one seek is needed as the whole record is stored contiguously, and the overhead of reading all the attributes of the record (rather than just the relevant attributes requested by the current query) will be negligible relative to the seek time. However, as more and more records are accessed, the transfer time begins to dominate the seek time, and a column-oriented approach begins to perform better than a row-oriented approach.**For this reason, column-stores are typically used in analytic applications, with queries that scan a large fraction of individual tables and compute aggregates or other statistics over them**.

]]>Date Store Product Customer Price

-------------------------------------------------------------

2015-09-1 store1 product1 customer1 1.0

2015-09-1 store1 product2 customer2 4.0

2015-09-2 store2 product2 customer3 1.0

And

Address0 : 2015-09-1,store1,product1,customer1,1.0

Address40: 2015-09-1,store1,product2,customer2,4.0

Adressss80: 2015-09-2,store2,product2,customer3,1.0

The same table in a

Date

2015-09-1

2015-09-1

2015-09-2

Store

store1

store1

store2

Product

product1

product2

product2

Cutomer

customer1

customer2

customer3

Price

1.0

4.0

1.0

Address0: 2015-09-1,2015-09-2

Address16:store1,store2

Address32:product1,product2

Address64: customer1,customer2,customer3

Address88:1.0,4.0,1.0

- Column-store systems completely vertically partition a database into a collection of individual columns that are stored separately.
- Each column is stored separately on disk

Now lets talk about an important aspects of database performance:

There are several interesting tradeofs depending on the access patterns. If data is stored on the disk, then if a query needs to access only a single record (i.e., all or some of the attributes of a single row of a table), a column-store will have to seek several times (to all columns/files of the table referenced in the query) to read just this single record. However, if a query needs to access many records, then large swaths of entire columns can be read, amortizing the seeks to the different columns.

In a conventional row-store, in contrast, if a query needs to access a single record, only one seek is needed as the whole record is stored contiguously, and the overhead of reading all the attributes of the record (rather than just the relevant attributes requested by the current query) will be negligible relative to the seek time. However, as more and more records are accessed, the transfer time begins to dominate the seek time, and a column-oriented approach begins to perform better than a row-oriented approach.

Mobile application developers have plenty of things to worry about. As mobile devices become more powerful, end users expect more out of their devices. For every PC/Mac application they expect an equivalent mobile app (and it better look good). Mobile applications have evolved from simple games (remember Snake?) and utility apps (everything from a handy flash light, to a calculator, an alarm etc etc) to full fledged applications that do everything from helping you manage your money, taking and editing photographs, video chatting, browsing the internet, recommending music, keeping you healthy and helping you find a date. As the gap between PC and mobile devices decreases, mobile app developers now face unique challenges at every stage of development. They have to worry about different platforms, between native and web, different screen sizes, different frameworks, complex backend processing, backend algorithms etc etc.

The complexity of developing mobile apps is compounded by the difficulty of back-end processing. Mobile backend services (Baas) help relieve the some of the responsibilities from app developers. This area is now the new venue for the eco system wars, with both established companies and several startups competing for developer attention. Some popular BaaS solutions include:

The complexity of developing mobile apps is compounded by the difficulty of back-end processing. Mobile backend services (Baas) help relieve the some of the responsibilities from app developers. This area is now the new venue for the eco system wars, with both established companies and several startups competing for developer attention. Some popular BaaS solutions include:

**D**ata : Store app data on the cloud**P**ush Notifications: Send notifications to a user's phone**S**ocial Media Integration : Login through Facebook, Twitter, Live**C**ustom Code Deployment : A very important feature. Deploy your own custom logic to the cloud**H**osting : Most mobile apps need a landing website.**T**hird Party Data Integration : Again, a very critical feature.

Surprisingly, Microsoft and Google are relatively new players, and with Facebook's acquisition of Parse things are getting heated up. Kinvey, StackMob and Parse appear to the most mature providers with most number of features provided. I will update this post soon, with my personal experience using Parse and Azure.

- Flexibility: If you're main criteria is flexibility (you have a large team of developers for example) then its best to go for Windows Azure or the Google App engine (GAE). They put a lot of emphasis on custom code. With Azure and GAE, you setup code that runs on every requests and can fire off additional requests, read more data, send push notifications and control the request as a whole. They offer the capability of running your own backend on a cloud instance (at a price). Hence larger higher-end enterprise customers will pay more for additional flexibility/protection.
- Ease of Use: If you're short of time, money and other resources, Parse or Kinvey is the way to go. They have a mature API and offer a lot of off-the-shelf features that can get you started in no time. Both of them now provide the ability to run custom code, but you will not have access to the raw instance like Azure or GAE.

As an avid user of Reddit, I was inspired to study the network structure of Reddit for a social network analysis class.

Reddit is a social news website driven by user content. Reddit is a data-rich website with a complex network structure. Reddit comment threads may trail for more than two weeks and one single post can easily exceed 1000 comments, which are mainly replies to other comments rather than direct responses to the original posts. There is hence an implicit relationship based on shared interests between the comments, and between the comment and the post, which can be used to construct a social network.

**Step 1: Data Collection**

Currently no large Reddit datasets exist. Thus a key part of this project was to crawl Reddit and retrieve the content. There are three parts to crawling Reddit:

1. Get list of subreddits

2. Get list of submissions for each subreddit

3. Get content and comments for each submission

Reddit’s threaded comment system provides the most interesting data. Each comment contains the user, a timestamp, a list of replies in the form of comments, and the score (up votes minus down votes) for that individual comment.*At the end of the crawling phase, I collected the profiles of **229254 users, with **696165 comments from 6864 posts*.

The crawler was written using Python and the PRAW library. PRAW, an acronym for “*Python Reddit API Wrapper*”, is a python package that allows for simple access to Reddit’s API. PRAW aims to be as easy to use as possible and is designed to follow all of reddit’s API rules.

**Step2 : Visualizing the Discussion Structure : Radial Tree to the Rescue**

Even though Reddit has a convenient interface for participating in discussions, the ability to examine the structure of the comments from the comment list is very limited. However, a special tree representation (as shown in [Gomez et. al.] for the post comments provides a convenient way to visualize the discussion structure. The central node is defined as the post itself. Any comments made directly on this post are attached as children in a radial pattern around this central node. Nested comments are attached similarly to their parent until the entire comment tree is visible. This tree structure grows outward from the central node as the discussion takes place over time. The figure below shows a fairly popular post that received over 750 comments represented with the radial tree structure. The next figure* *shows a post with a similar number of comments, however, this post has very few discussions reaching deeper nesting levels, and almost all of the comments appear as direct replies to the post.

Reddit is a social news website driven by user content. Reddit is a data-rich website with a complex network structure. Reddit comment threads may trail for more than two weeks and one single post can easily exceed 1000 comments, which are mainly replies to other comments rather than direct responses to the original posts. There is hence an implicit relationship based on shared interests between the comments, and between the comment and the post, which can be used to construct a social network.

Currently no large Reddit datasets exist. Thus a key part of this project was to crawl Reddit and retrieve the content. There are three parts to crawling Reddit:

1. Get list of subreddits

2. Get list of submissions for each subreddit

3. Get content and comments for each submission

Reddit’s threaded comment system provides the most interesting data. Each comment contains the user, a timestamp, a list of replies in the form of comments, and the score (up votes minus down votes) for that individual comment.

The crawler was written using Python and the PRAW library. PRAW, an acronym for “

Even though Reddit has a convenient interface for participating in discussions, the ability to examine the structure of the comments from the comment list is very limited. However, a special tree representation (as shown in [Gomez et. al.] for the post comments provides a convenient way to visualize the discussion structure. The central node is defined as the post itself. Any comments made directly on this post are attached as children in a radial pattern around this central node. Nested comments are attached similarly to their parent until the entire comment tree is visible. This tree structure grows outward from the central node as the discussion takes place over time. The figure below shows a fairly popular post that received over 750 comments represented with the radial tree structure. The next figure

The structure of the trees is highly heterogeneous. For some posts, the tree reaches a high depth with very few comments due to intense discussion among a few users. In other posts, there are hundreds of comments in the first two nesting levels and very few outside of that. Sometimes the majority of the discussion actually happens in one of the child threads and the tree has a skewed appearance.

Next, we plot the distribution of comment depths for all posts. As you can see, the majority of comments are made in the first few nesting levels, however, we can see that there is a substantial amount of discussion going on in the deeper levels.

Next, we plot the distribution of comment depths for all posts. As you can see, the majority of comments are made in the first few nesting levels, however, we can see that there is a substantial amount of discussion going on in the deeper levels.

What is Reddit without controversy? But how exactly do we capture "controversy"? Reddit allows you to sort posts in several ways. “Top” lists the most popular posts by number of comments, “Hot” lists recently popular posts, and

There are many ways to define how controversial a post is. Perhaps the most simple is to define it as the total number of comments. Posts that triggered a lot of discussion are probably controversial. However, while a post might have lots of comments, there may not be much reciprocal discussion. The maximum depth a post reaches, therefore, seems like a better measure. This too has problems however. If two users get engaged in a long discussion, the corresponding post will be considered highly controversial even if all the other comments appear in the first few nesting levels. We would like a method to account for these two types of bias. The measure we will use is

We used the following adapted version of h-index: The h-index of a post is the deepest nesting level of the radial tree with at least h comments.

When ranking posts by h-index, we need a way to break the ties because many posts have the same h-index values. I used the same method as [Gomez et.al], which is to prioritize the posts which have reached higher h-index values with fewer comments.

What follows next in this post is a interesting project I did last semester. The key idea was to use user data from Foursquare and recommend

This conclusion could help a data engineer who does not have the time or resources to build sophisticated models. Examples of sophisticated models include models that combine a users taste (through ratings and reviews), income/economic segment and social network influence (information from Facebook) to come up with a high accuracy prediction. An easier alternative is to simply analyze the geographical location of the venue and the visitors frequenting nearby venues. So we need to complete the prediction puzzle:

In probabilistic terms we need to find out: p(go|like, close) i.e

This simple probabilistic model will give a reasonably good estimate of finding people most likely to visit a particular venue of interest. So how do we find out P(like) and P(close) ?

- Compute a user’s center of mass
- Center of Mass = average location over all check-ins
- Probability of Traveling a Certain Distance for a Venue = (No. of check-ins made within venue radius / total check-ins)

How do we compute P(like) from Foursquare ratings and check-in data? We use

If you really think about it, this problem can be formulated in simple “recommender systems” terms i.e. how to recommend venues (items) to people (users). So we can run state of the art

In the future, one can augment this model by adding time, age/income, influence of friends, reviews etc to build an ensemble method that gives the best accuracy. However, location is a good place to start.I have attached my project slides with this email and perhaps one day I will break the individual components down and explain the model in detail.

The fundamental string searching (matching) problem is defined as follows: given two strings - a text and a pattern, determine whether the pattern appears in the text. The problem is also known as "the needle in a haystack problem."

The idea is straightforward -- for every position in the text, consider it a starting position of the pattern and see if you get a match.

The naive method exhibits a worst case time complexity of O(n*m) because we potentially compare each element of the text with every element of the pattern. In other words, the naive method generates EVERY possible substring of the text and compares it with pattern.

**Rabin-Karp Algorithm (RK)**

**Rabin–Karp algorithm** is a string searching algorithm created by Michael O. Rabin and Richard M. Karp in 1987. The Rabin–Karp algorithm focuses on speeding up the generation of a substring derived from text and its comparison to the pattern with the help of **Hash Function**.

The method behind the RK algorithm is :

Let the Pattern be P (of length L) and the text be T (of length n).

In other words, the RK algorithm simply hashes EVERY possible substring on the text and compares it with the hash of the pattern. At this point you must be wondering how this is any better than the naive implementation. But as we shall see shortly, the RK algorithm improves its run time by using a**rolling hash. **To understand what a rolling hash is, we first need to know what a typical hashing function would look like.

*The Choice of Hash Function***What if we hash each string to the sum of the ASCII values of its characters?**

Let us take a step back from string and walk through the 3 step above by considering integer arrays.

Let the pattern P and the text T be:

P = [9,0,2,1,0]

T=[4,8,9,0,

The length 5 substrings of T would be:

S0 = [9,8,7,1,2]

S1 = [8,7,1,2,3]

S2 = [7,1,2,3,4]

.......

For each of these substring, our hash function to generate a hash value (an integer). Let the size of the Hash Table be*m*. Our hash function will be:

The method behind the RK algorithm is :

Let the Pattern be P (of length L) and the text be T (of length n).

- Hash P to get h(P) [This takes O(L) time]
- Iterate through all length L substrings of T, hashing those substrings and comparing to h(P) [ This takes O(n*L) ]
- If a substring hash value does match h(P), do a string comparison on that substring and P, stopping if they do match and continuing if they do not. [ O(L) ]

In other words, the RK algorithm simply hashes EVERY possible substring on the text and compares it with the hash of the pattern. At this point you must be wondering how this is any better than the naive implementation. But as we shall see shortly, the RK algorithm improves its run time by using a

- It should be easy to compare two hash values. For example, if the range of the hash function is a set of suﬃciently small nonnegative integers, then two hash values can be compared with a single machine instruction
- The number of false positives induced by the hash function should be similar to that achieved by a “random” function. If the range of the hash function is of size m, we’d like each hash value to be achieved by approximately the same number of L-symbol strings (where L is the length of the pattern)
- It should be easy (e.g., a constant number of machine instructions) to compute h(Si+1) given h(Si)

Let us take a step back from string and walk through the 3 step above by considering integer arrays.

Let the pattern P and the text T be:

P = [9,0,2,1,0]

T=[4,8,9,0,

The length 5 substrings of T would be:

S0 = [9,8,7,1,2]

S1 = [8,7,1,2,3]

S2 = [7,1,2,3,4]

.......

For each of these substring, our hash function to generate a hash value (an integer). Let the size of the Hash Table be

In other words, we will take the length 5 array of integers and concatenate the integers into

a 5 digit number, then take the number mod m. (we take mod m so that we can narrow down the 5 digit number into a number in the range of 0 - m. Remember the hash number generated is used as an index into the hash table of size m).

Now h(P) is 90210 mod m

h(S0) is**48902** mod m

h(S1) is**89021** mod m

Do you see the relationship between h(S0) and h(S1) ? In fact,**we can generate h(S1) by using h(S0)**! We start with 48902, remove the first digit to get 8902, multiply by 10 to get 89020, and then add the next digit to get 89021. i.e

a 5 digit number, then take the number mod m. (we take mod m so that we can narrow down the 5 digit number into a number in the range of 0 - m. Remember the hash number generated is used as an index into the hash table of size m).

Now h(P) is 90210 mod m

h(S0) is

h(S1) is

Do you see the relationship between h(S0) and h(S1) ? In fact,

We can now imagine a window sliding over all the substrings in S. Calculating the hash value of the next substring only touches 2 elements: the element leaving the window and the element

entering the window. Finding the hash value of the next substring is now a*O(1)* operation.

In this numerical example, we looked at single digit integers and set our base**b = 10** so that

we can interpret the arithmetic easier.**To generalize for other base b** and substrings of length L, our hash function is:

entering the window. Finding the hash value of the next substring is now a

In this numerical example, we looked at single digit integers and set our base

we can interpret the arithmetic easier.

and the **formula to calculate the next Hash **would be:

Since strings can be interpreted as an array of integers, we can apply the same method we used on numbers to the initial problem, improving the runtime. The algorithm steps are now:

- Hash P to get h(P) [ O(L) ]
- Hash the first substring of S of length L [O(L)]
- Use the rolling hash method to calculate the subsequent O(n) substrings in S, comparing the hash values to h(P) [This is O(n) ]
__If a substring hash value does match h(P), do a string comparison on that substring and P, stopping if they do match and continuing if they do not__. [O(L)] (Why? because of Collisions! We still need to check if the strings match exactly, even though their hash values are same.)

I was lucky to find this excellent discussion on StackOverflow about the value of m and the "nature of math" and strongly suggest that you read it:

http://stackoverflow.com/questions/1145217/why-should-hash-functions-use-a-prime-number-modulus

Let us now summarize our constants:

- Let us choose b to be 256 ( a power of 2)
- m should be a prime number. We will generate this prime number using the Class BigInteger (java.math)
- We will
*precompute b^L-1 mod m (*again check the formula in the highlighted box). Instead of repeatedly computing b^L-1 to generate the rolling hash values, we should rather precompute this. - So let
*b^L-1 mod m*be**R**

We can now proceed with our algorithm:

*patternHash = computePatternSignature (pattern);**Optimization: compute b^L-1 mod m just once. So R = b^L-1 mod m**texthash = compute signature of text[0]...text[L-1] (i.e compute the hash of the first substring of the text).**textCursor = 0;***while**textCursor != end of text**if**textHash = patternHash // Potential match.**if**exact_match (pattern, text, textCursor) // Match found**return**textCursor**endif***// Different strings with same signature, so continue search.***endif***textCursor = textCursor + 1**//Use O(1) computation to compute next signature:**textHash = compute signature of text[textCursor],...,text[textCursor+L-1]***endwhile****return**-1

Code: You may find the complete Java implementation at:

https://github.com/sarveshsaran/RabinKarp

This is a screenshot of my Samsung Galaxy S3's home screen. I have (over a period of time) carefully curated my most recently used apps and placed them on the home screen. This save me a LOT of time, allowing me to quickly launch applications that I have recently accessed. Now wouldn't it be cool if Samsung's TouchWiz did this automatically? Imagine getting rid of all the frustration searching for an app you accessed only an hour ago?

The android home screen is a good use case for a very important data structure. In order to keep the most recently used apps on the home screen, we will need a data structure that keeps track of such applications and automatically removes the

Let us now think about how we can implement an LRU cache. Our cache must support the following operations:

- Keep the most recently used apps at the front of the list.
- When the user opens an app not in the list, add this app to the front of the list.
- If the list is full, remove the least recently used app from the list.

One way to do this would be to use a Doubly Linked List (to store the apps) and a HashMap that stores the appID as the key and a reference to the node in the Doubly Linked List as the value. This would allow us to quickly look if an app (with a particular appID) is present in the list or not (making lookup an O(1) operation). If not present, adding a new node at the head of the list is an O(1) operation. If the list is full, removing+updating the tail of the list is again O(1). Using a doubly linked list also allows to remove a node from the list and promote it to the head of the list in constant time (hence using a singly linked list is not a good idea).

Fortunately for us, java.util provides a data structure that can be used an an LRU cache: the LinkedHashMap(http://docs.oracle.com/javase/7/docs/api/java/util/LinkedHashMap.html).The LinkedHashMap provides a special constructor to create a hash map whose order of iteration is the order in which its entries were last accessed, from least-recently accessed to most-recently (

The LinkedHashMap requires that the

In the implementation below: we pass capacity+1 to the super class because LinkedHashMaps first add a node before deleting the least recently used node.

newsstand, viber, whatsapp, maps, amazon, linkedin, youtube, outlook, kindle, facebook, camera, keep,

Here the white nodes are those not marked as visited, the gray nodes are those marked as visited and that are in frontier, and the black nodes are visited nodes no longer in frontier. Rather than having a visited flag, we can keep track of a node's distance in the field *v.distance*. When a new node is discovered, its distance is set to be one greater than its predecessor v.

Basically, when frontier is a first-in, first-out (FIFO) queue, we get breadth-first search. All the nodes on the queue have a minimum path length within one of each other. In general, there is a set of nodes to be popped off, at some distance*k *from the source, and another set of elements, later on the queue, at distance *k+1.*

Basically, when frontier is a first-in, first-out (FIFO) queue, we get breadth-first search. All the nodes on the queue have a minimum path length within one of each other. In general, there is a set of nodes to be popped off, at some distance

Here's a small example. In the graph above, our root is S.

- At the beginning, color all the vertices white
- Initiate an empty queue Q
- Add the node S to the frontier. Color it gray.

3. Remove S. Mark it as Black.

4. Mark all its

5. The rest of the algorithm simply repeats the above until Q is empty.

Let us assume that the input graph G is stored with an adjacency list.

- Coloring all vertices white (at the beginning of BFS) takes O(|V |) time, where V is the set of vertices in G.
- Then, every edge in E (the set of edges in G) is processed at most twice.
- Therefore, the total running time is O(|V | + |E|).

You can think of DFS as a person walking through the graph following arrows and never visiting a node twice except when backtracking, when a dead end is reached. The diagram below shows the DFS traversal of a graph starting from node A.

Let us assume that the input graph G is stored with an adjacency list.

- There can be at most |V| calls to DFS_visit
- Then, every edge in E (the set of edges in G) is processed at most twice.
- Therefore, the total running time is O(|V | + |E|), same a BFS.
- The sequence of calls to DFS forms a tree. For the graph above the tree is:

B C

D

E

- So the DFS algorithm maintains an amount of state that is proportional to the size of this path from the root. On a balanced binary tree, DFS maintains state proportional to the height of the tree, or O(log |V|).
- In BFS, where the amount of state (the queue size) corresponds to the size of the perimeter of nodes at distance
*k*from the starting node. In both algorithms the amount of state can be O(|V|) in the worst case.

Note:

If we want to search the whole graph, then a single recursive traversal may not suffice. If we had started a traversal with node C, we would miss all the rest of the nodes in the graph. To do a depth-first search of an

A graph can be stored either as a matrix or a list of nodes. The correct choice depends on the problem.

**An adjacency matrix**uses O(n*n) memory, where n is the number of nodes.- It has fast lookups to check for presence or absence of a specific edge, but it is slow to iterate over all edges.
**Adjacency lists**use memory in proportion to the number edges,__which might save a lot of memory if the adjacency matrix is sparse__. It is fast to iterate over all edges, but finding the presence or absence specific edge is slightly slower than with the matrix.

Topological Sort

One of the most useful algorithms on graphs is topological sort, in which the nodes of an acyclic graph are placed in an order consistent with the edges of the graph. This is useful when you need to order a set of elements, for example, suppose you have a set of tasks to perform, but some tasks have to be done before other tasks can start. In what order should you perform the tasks? This problem can be solved by representing the tasks as nodes in a graph, where there is an edge from task 1 to task 2 if task 1 must be done

A key observation in Topological Sorting is that a node finishes (is marked black) after all of its descendants have been marked black. T

The algorithm for Topological sort is similar to DFS.

- We perform a depth-first search over the entire graph, starting anew with an unvisited node if previous starting nodes did not visit every node.
- As each node is finished (colored black), put it on the head of an initially empty list.
- This ensures that a node that is marked black later, appears at the head of the list.
- This clearly takes time linear in the size of the graph: O(|V| + |E|).

We can use the idea of Topological sorting to detect a cycle in a graph. S

To detect cycles in graphs, therefore, we choose an arbitrary white node and run DFS. If that completes and there are still white nodes left over, we choose another white node arbitrarily and repeat. Eventually all nodes are colored black. If at any time we follow an edge to a gray node, there is a cycle in the graph. Therefore, cycles can be detected in O(|V+E|) time.

Java Code:

In the example code below, the sample graph used is quite relatively dense and hence I use an adjacency matrix to represent the graph. To quickly look up a vertex, i enforce the following naming convention/lookup convention:

a b c d e f g h i j k

0 1 2 3 4 5 6 7 8 9 10

so the vertex 'a' is stored at index 0 in the array of vertices. An edge between vertex 'a' and 'b' would hence be an edge between 0 and 1.

A Vertex hence has a label/value and a state (initially white).

public int[][] matrix;

public Vertex[] vertices;

We store the list of vertices in an array and the graph is stored in an adjacency matrix.

You can find the full source code at:

https://github.com/sarveshsaran/ProgrammingSnippets/blob/master/GraphDFS.java

]]>public int[][] matrix;

public Vertex[] vertices;

We store the list of vertices in an array and the graph is stored in an adjacency matrix.

You can find the full source code at:

https://github.com/sarveshsaran/ProgrammingSnippets/blob/master/GraphDFS.java