Project 4

I Heard it through the Grapevine

In this project we are going to do a bit of network visualization using force directed graphs in processing, increase the size of the source datafiles to 2 gigabytes, and deal with integrating databases.

We will randomize the groups again for this project, so you will be working with new people.

In this project we will be looking at the data that netflix provided for its $1,000,000 challenge. In our case, given a particular rental in their database that we assume you like, the application should recommend several other rentals that you should also like based on the ratings of others - ie given a rental in the database, the system should find all of the other renters that liked that film and then find all the films that those renters liked, prioritize them, and display them.

The data is available at

The files of particular interest are:
README - always good to start with the meta-data
movie_titles.txt - contains the 18,000 titles mathed to their ID number which includes movies, TV shows, etc.
training_set - directory containing one file for each title ID showing who rented it, what rating they gave it, and when they rented it.

Since there are a lot of films and a lot of renters in the database we will again start by making some assumptions to simplify things. A rental needs to have been rented by at least 2500 people to be valid (which cuts the initial 17,770 titles down to about 5,000.)
A renter needs to have rented at least 50 titles to be valid which will help you cut down the 480,000 renters.

For a C you need ...

For a B you need to add ...

For an A you need to add ...

Note that this project explicitly is not using the prior rental history of the user to make recommendations. One could choose a user from the database as a starting point to gain more data to use in creating good recommendations for that person, but that's a different project.

You should create a web page that describes your work on the project. I will be linking this web page to the course notes so please send me a 1024 x 768 jpg image of your visualization for the web. This should be named p4.<someone_in_your_groups_last_name>.jpg. Again, please make sure that your application is Mac / Windows / Linux compatible. Again, the web page should describe the contribution of each team member.

Each group will bring their visualization to class to present it and describe its features to the rest of the class. This allows everyone to see a variety of solutions to the problem, and a variety of implementations.

For those who are interested there is a discussion here on the winning algorithms for the netflix prize:

last revision 11/5/09 - added link to discussions of winning netflix contest algorithms, and clarification that we are not making use of any past rental history of the user in the project