I Heard it
project we are going to do a bit of network visualization using
directed graphs in processing, increase the size of the source
datafiles to 2 gigabytes, and deal with integrating databases.
We will randomize the groups again for this project, so you will
working with new people.
In this project we
will be looking at the data that netflix provided for its
challenge. In our case, given a particular rental in their
that we assume you like, the application should recommend
rentals that you should also like based on the ratings of others
given a rental in the database, the system should find all of
renters that liked that film and then find all the films that
renters liked, prioritize them, and display them.
The data is available at http://archive.ics.uci.edu/ml/datasets/Netflix+Prize
The files of particular interest are:
README - always good to
with the meta-data
the 18,000 titles mathed to their ID number which includes
containing one file for each title ID showing who rented it,
rating they gave it, and when they rented it.
Since there are a lot of films and a lot of renters in the
will again start by making some assumptions to simplify things.
rental needs to have been rented by at least 2500 people to be
(which cuts the initial 17,770 titles down to about 5,000.) A renter needs
have rented at least 50 titles to be valid which will help you
the 480,000 renters.
For a C you need ...
For a B you need to add
- for the given rental (title) you should see the name of the
by a force directed graph to n recommended rentals as other
the graph. Clicking on one of the recommended rentals moves
that one to
the center and generates a new set of prioritized recommended
- show the overall popularity of that selected rental(s) as a
percentage of 5 / 4 / 3 / 2 / 1 star ratings
- you should chart the rental popularity of the selected
over time - ie how many times was it rented per month over the
- show the popularity of the recommended rentals (based on
number of rentals)
- show the quality of the recommended rentals (based on the
- the program should start off with a list of the n titles.
chose these yourselves and they should show a fair amount of
on a variety of variables and include the top rented, top
rented, lowest rated (from within the subset discussed above.)
For an A you need to add
- clicking on one of the recommended rentals should bring up
list of n
rentals recommended by that one in addition to the previous
one(s). Rentals that are in common should be more recommended
- a control to limit the popularity of rentals shown. You may
that very popular rentals overwhelm the more offbeat titles so
should have a control that limits the titles displayed to only
above and/or below a certain popularity level.
- the bad movie version - given a film that the user doesnt
other renters that didnt like it to find films that they also
like. This implies that your starting set of films should
several bad ones
- genre data from the imdb database. Combining data from
databases can be tricky but often necessary. We will again
titles to ones that are not videogames, TV shows,
films, etc. You will need to be a little generous in finding
between the imdb and netflix titles, (especially with the use
and punctuation.) The imdb will have a lot more remakes with
title than the netflix database, but since we are only using
information that shouldn't matter too much. The imdb uses the
language title rendered in English while netflix uses the
so you will unfortunately lose most of the foreign films, and
early films, but you should still have a good number of titles
with. Once you have the genre information you can use it to
various nodes in the graph, and to organize the nodes. Some
have these labels and some will not.
- decrease the restrictions on number of votes and number of
rentals to be valid - ie increase the size of the dataset that
- use the tool to find some interesting things and document
Note that this
project explicitly is not using the prior rental history of the
to make recommendations. One could choose a user from the
database as a
starting point to gain more data to use in creating good
recommendations for that person, but that's a different project.
page that describes your work on the project. I will be
linking this web page to the course notes
so please send me a 1024 x 768 jpg image of your visualization for
should be named p4.<someone_in_your_groups_last_name>.jpg.
sure that your application is Mac / Windows / Linux compatible.
the web page should describe the contribution of each
Each group will bring their visualization to class to present it
its features to the rest of the
class. This allows everyone to see a variety of solutions to the
problem, and a variety of implementations.
For those who are interested there is a discussion here on the
algorithms for the netflix prize:
last revision 11/5/09
- added link to discussions of winning netflix contest
clarification that we are not making use of any past rental
the user in the project