CS 424 Project 4

In this project we will be looking at the data that netflix provided for its $1,000,000 challenge. In our case, given a particular rental in their database that we assume you like, the application should recommend several other rentals that you should also like based on the ratings of others - ie given a rental in the database, the system should find all of the other renters that liked that film and then find all the films that those renters liked, prioritize them, and display them.

The data is available at http://archive.ics.uci.edu/ml/datasets/Netflix+Prize

The files of particular interest are:
README - always good to start with the meta-data
movie_titles.txt - contains the 18,000 titles mathed to their ID number which includes movies, TV shows, etc.
training_set - directory containing one file for each title ID showing who rented it, what rating they gave it, and when they rented it.

Since there are a lot of films and a lot of renters in the database we will again start by making some assumptions to simplify things. A rental needs to have been rented by at least 2500 people to be valid (which cuts the initial 17,770 titles down to about 5,000.) A renter needs to have rented at least 50 titles to be valid which will help you cut down the 480,000 renters.

For a C you need ...

for the given rental (title) you should see the name of the rental linked by a force directed graph to n recommended rentals as other nodes in the graph. Clicking on one of the recommended rentals moves that one to the center and generates a new set of prioritized recommended rental nodes.
show the overall popularity of that selected rental(s) as a percentage of 5 / 4 / 3 / 2 / 1 star ratings
you should chart the rental popularity of the selected rental(s) over time - ie how many times was it rented per month over the last n months
show the popularity of the recommended rentals (based on total number of rentals)
show the quality of the recommended rentals (based on the average rating)
the program should start off with a list of the n titles. You can chose these yourselves and they should show a fair amount of variation on a variety of variables and include the top rented, top rated, least rented, lowest rated (from within the subset discussed above.)

For a B you need to add ...

clicking on one of the recommended rentals should bring up the list of n rentals recommended by that one in addition to the previous one(s). Rentals that are in common should be more recommended
a control to limit the popularity of rentals shown. You may find that very popular rentals overwhelm the more offbeat titles so the user should have a control that limits the titles displayed to only those above and/or below a certain popularity level.
the bad movie version - given a film that the user doesnt like, use other renters that didnt like it to find films that they also didnt like. This implies that your starting set of films should include several bad ones

For an A you need to add ...

genre data from the imdb database. Combining data from multiple databases can be tricky but often necessary. We will again limit the titles to ones that are not videogames, TV shows, documentaries, adult films, etc. You will need to be a little generous in finding matches between the imdb and netflix titles, (especially with the use of 'The' and punctuation.) The imdb will have a lot more remakes with the same title than the netflix database, but since we are only using the genre information that shouldn't matter too much. The imdb uses the original foreign language title rendered in English while netflix uses the English title so you will unfortunately lose most of the foreign films, and many early films, but you should still have a good number of titles to work with. Once you have the genre information you can use it to label the various nodes in the graph, and to organize the nodes. Some nodes will have these labels and some will not.
decrease the restrictions on number of votes and number of rentals to be valid - ie increase the size of the dataset that you are accessing.
use the tool to find some interesting things and document them on your webpage.

You should create a web page that describes your work on the project. I will be linking this web page to the course notes so please send me a 1024 x 768 jpg image of your visualization for the web. This should be named p4.<someone_in_your_groups_last_name>.jpg. Again, please make sure that your application is Mac / Windows / Linux compatible. Again, the web page should describe the contribution of each team member.

Each group will bring their visualization to class to present it and describe its features to the rest of the class. This allows everyone to see a variety of solutions to the problem, and a variety of implementations.

For those who are interested there is a discussion here on the winning algorithms for the netflix prize:
http://www.netflixprize.com//community/viewtopic.php?id=1537

2009 Project 4

I Heard it through the Grapevine