2009 Project 3

Saturday Night at the Movies



In this project we are going to increase the size of the source datafiles to around 750 megabytes and concentrate more on massaging the data into an easily viewable form to look at trends in the internet movie database over several decades.

We will randomize the groups again for this project, so you will be working with new people.



In this project you will be looking over the time period from 1910 to (almost) 2010 tracking the careers of a number of actors and directors. Your main visualization will be somewhat similar to the NY Times Billboard rankings from week 5 of the course with time (in years) on the x-axis and quality (as measured by the imdb) on the y-axis. By picking a given actor or director, the user should be able to see the quality of his/her work over the years as a year-by-year bar chart. By choosing another actor or director the user should be able to compare the careers of different people.

Most of the work on this project will be in converting the data into an appropriate shape to visualize. The first part of that is cutting down the data size. To that end we will limit the dataset by avoiding: tv shows, tv movies, video games, adult films, direct-to-video releases, documentaries, and short subjects. That should knock the initial 300,000 film entries down a fair amount.

You should then find the top 5-10 male actors, female actors, and directors for each decade. You must to do this in your code, though you can certainly use online resources to help you figure out who those people should be to make sure your code is working correctly. The top people should have done at least m films with more than n votes (where m and n may depend on the decade - document your decisions) For the purpose of this project a 'top' person is defined by working on a highly rated film. One way to start is to pick the top rated films from each decade and then look for commonalities in the cast and directors, count them up, and eventually pick the top 5-10 actors, actresses, directors for each decade. There may be some overlap from decade to decade which is fine. You should document the decisions that you make to create the algorithm that determines a top person.

Your code could have one phase where it works out who the best are and then another phase that does the interactive visualization - the code does not need to automatically compute the top people every time it runs - it could do the computation under user control, or only if that data hasn't been already computed. You could have two separate applications - once to parse through the current data and generate new data files and another to take those new data files and run the visualization. The imdb data files are updated regularly so your code should continue to run on new datasets unless there is a significant change to their format.

The data can be found at http://www.imdb.com/interfaces

Moreso than the previous projects the work on this one can be split between the data side and the visualization side as long as there is a common data format to link them. You should quickly work out which of the imdb data files you need and what final form you want the data to be in. Then you could pick a couple sample actors or directors and generate all the appropriate data in the proper form so the visualization work can proceed while working out the code to pick the top people automatically.


For a C you need ...

For a B you need to add ...

For an A you need to add ...



You should create a web page that describes your work on the project. I will be linking this web page to the course notes so please send me a 1024 x 768 jpg image of your visualization for the web. This should be named p3.<someone_in_your_groups_last_name>.jpg. Again, please make sure that your application is Mac / Windows / Linux compatible. Again, the web page should describe the contribution of each team member.

Each group will bring their visualization to class to present it and describe its features to the rest of the class. This allows everyone to see a variety of solutions to the problem, and a variety of implementations.

NEW: I would also like each group to email me a list of you top 5 male actors, female actors, and directors for each decade so I can also show a comparison of the names that the different algorithms generated. I would like each of the three lists in this form so its easy to put them together:
1910
#1
last name

#2
last name

#3
last name

#4
last name

#5
last name
1920
#1
last name

#2
last name

#3
last name

#4
last name

$5
last name
etc



I would also like to have these lists the night before the project is due so I have time to compile the lists before the presentations in class.


last revision 10/1/09