What follows next in this post is a interesting project I did last semester. The key idea was to use user data from Foursquare and recommend individuals to venues. The primary conclusion I reached to after trying several recommender models was: The concept of geographic distance is as important as the user’s taste in all venue categories. In simpler terms, this means that the likelihood of a person visiting a restaurant depends a LOT on the geographic location of the restaurant and not just the users taste alone! So how could this conclusion help a data engineer come up with an answer to the venue recommendation problem?
This conclusion could help a data engineer who does not have the time or resources to build sophisticated models. Examples of sophisticated models include models that combine a users taste (through ratings and reviews), income/economic segment and social network influence (information from Facebook) to come up with a high accuracy prediction. An easier alternative is to simply analyze the geographical location of the venue and the visitors frequenting nearby venues. So we need to complete the prediction puzzle:
Whereabouts + Preferences = ?
In probabilistic terms we need to find out: p(go|like, close) i.e given the probability that a user likes a venue and the probability that he is willing to travel a certain distance, what is the probability that he will visit the venue?
This simple probabilistic model will give a reasonably good estimate of finding people most likely to visit a particular venue of interest. So how do we find out P(like) and P(close) ?
- Compute a user’s center of mass
- Center of Mass = average location over all check-ins
- Probability of Traveling a Certain Distance for a Venue = (No. of check-ins made within venue radius / total check-ins)
How do we compute P(like) from Foursquare ratings and check-in data? We use Item-Item Collaborative filtering algorithm.
If you really think about it, this problem can be formulated in simple “recommender systems” terms i.e. how to recommend venues (items) to people (users). So we can run state of the art User-User Collaborative Filtering algorithms on the inverted venue-by-people matrix (whose value mij is 1 if user j checked-in in the venue i; 0 otherwise). This will give us, for each venue, a list of people who might like to visit it. Such algorithms are not effective mainly because of Data Sparsity, because a venue is visited, on average, by very few people. In fact, the number visitors per venue is power law. We found, that instead of using a User-User Collaborative Filtering Algorithm, an Item-Item Collaborative algorithm gave better results. Since our data is sparse, we measure "likes" not based on similarity among users but among venues! item-to-item collaborative filtering matches each of the user’s venues with similar venues.
In the future, one can augment this model by adding time, age/income, influence of friends, reviews etc to build an ensemble method that gives the best accuracy. However, location is a good place to start.I have attached my project slides with this email and perhaps one day I will break the individual components down and explain the model in detail.