Project 3

Whatever Happened to Randolph Scott

Project due at 11:59pm Monday 11/1/10 Chicago time

Project 3 will be the second group project and the focus here will be more on the data integration side of things.

In this project we will make use of the internet movie database ( to investigate movie genres over time.

The raw data is available at

The genres.list file contains information on the genre(s) of the films in the database.

We will use a subset of the main genres: Action, Adventure, Animation, Comedy, Crime, Documentary, Drama, Fantasy, Family, Film-Noir, Horror, Musical, Mystery, Romance, Sci-Fi, Thriller, War, Western

We will ignore some of the genres such as Adult, Short, Sport, Biography, History, Reality-TV, Talk-Show, News, Game-Show.
You can go through the genres list and remove any line with any of these ignored genres. If the film also has a genre or genres other than the ignored ones then it will remain in the sample for the project, unless it is 'Adult' or 'Short'. If 'Adult' or 'Short' are listed as a genre however, all entries for that film should be removed. Do be careful not to remove other films with the same name that came out in a different year.

100 Most Shocking Music Moments (2009) (TV)        Reality-TV
would get removed from the project because its only Reality-TV ( and would also get removed as shown below because its TV)

Tin Cup (1996)                        Comedy
Tin Cup (1996)                        Drama
Tin Cup (1996)                        Romance
Tin Cup (1996)                        Sport
would still exist in the project because even though the Sport genre is removed it still has Comedy, Drama, and Romance.

Hardware Wars (1978)                    Action
Hardware Wars (1978)                    Comedy
Hardware Wars (1978)                    Sci-Fi
Hardware Wars (1978)                    Short
would be completely removed since one of its genres is Short.

We will also ignore any direct-to-video releases (V), video games (VG), TV movies (TV), and TV episodes {Title (#episode #)}

Some films also have more than one genre eg:
North by Northwest (1959)                Action
North by Northwest (1959)                Adventure
North by Northwest (1959)                Drama
North by Northwest (1959)                Mystery
North by Northwest (1959)                Romance
North by Northwest (1959)                Thriller

The release-dates.list file gives more detailed release date information e.g.:
North by Northwest (1959)                USA:17 July 1959      (Los Angeles, California)
North by Northwest (1959)                USA:28 July 1959      (Chicago, Illinois)
North by Northwest (1959)                USA:6 August 1959    (New York City, New York)

Up through the 70s it was common for films in the US to be released in different cities in different weeks and months. In this file we will be interested in the earliest USA release date. Some will have only years, others month and day.

The ratings file gives the distribution of quality ratings that people have given each film both as an aggregate number and a distribution eg:
      0000001223   91021   8.6  North by Northwest (1959)

We will also limit the films to those that have already been released as the database includes films that have been announced or are already in production. We will start with films released in 1920, since most early films would be considered short films today.

As with the earlier projects you should first get a copy of the data files and take a look at them to try and understand their structure. Then you need to decide how you want to integrate these various files together into a form that you can visualize. This may involve writing shell scripts or programs to process the files, or loading the information into a database.

For a C you need ...

For a B you need to add ...

For an A you need to add ...

As with project 2, each Friday each member of the team should post a description of what they did on the project to the project web site. There are more opportunities in this project for different team members to focus on different areas.

As with Project 1 and 2 you should create a web page describing your work on the application. If possible make an embedded applet of your application but the data files may be large so be sure to have a link so people can download your application (and the necessary data files) to run your application. Please make sure that your application is Mac / Windows / Linux compatible. If you can get your app to run online through a browser then do include that version as well. The web page should describe the contribution of each team member (ie who worked on which interface elements, who worked on converting the data into a more usable form, etc.)

Please send me a 1024 x 768 jpg image of your visualization for the web. This should be named p2.<someone_in_your_groups_last_name>.jpg.

Each person in the group should also send me a private email ranking your coworkers on the project on a scale from 1 (low) to 5 (high) in terms of how good a coworker they were on the project. If you never want to work with them again, give them a 1. If this person would be a first choice for a partner on a future project then give them a 5. If they did what was expected but nothing particularly good or bad then give them a 3. By default your score should be 3 unless you have a particular reason to increase or decrease the number. Please confine your responses to 1, 2, 3, 4, 5 and no 1/3ds or .5s please. I will average out all these scores for projects 2 through 4 and keep them in mind when assigning final grades.

Each group will present their work to the class and describe its features to the rest of the class. This allows everyone to see a variety of solutions to the problem, and a variety of implementations.

Since we have six groups to go through in 75 minutes, each group will talk for 8 minutes plus 4 minutes for questions from the other groups. The presentation time should be evenly split among all members of the group. Each group sitting in the audience will be allowed one question for the group currently presenting, so come up with a good one. In answering the questions the group presenting should be concise so all of the other groups have a chance to ask questions.

last revision 10/19/10 - made the genre issue even more clear.