In this project we
are going to increase the size of the source datafiles to around
750
megabytes and concentrate more on massaging the data into an
easily
viewable form to look at trends in the internet movie database
over
several decades.
We will randomize the groups again for this project, so you will
be
working with new people.
In this project you
will be looking over the time period from 1910 to (almost) 2010
tracking the careers of a number of actors and directors. Your
main
visualization will be somewhat similar to the NY Times Billboard
rankings from week 5 of the course with time (in years) on the
x-axis
and quality (as measured by the imdb) on the y-axis. By picking
a given
actor or director, the user should be able to see the quality of
his/her work over
the years as a year-by-year bar chart. By choosing another actor
or
director the
user should be able to compare the careers of different people.
Most of the
work
on this project will be in converting the data into an
appropriate
shape
to visualize. The first part of that is cutting down the data
size. To
that end we will limit the dataset by avoiding: tv shows, tv
movies,
video games,
adult films, direct-to-video releases, documentaries, and short
subjects. That should knock the initial 300,000 film entries
down a
fair amount.
You should
then
find the top 5-10 male actors, female actors, and directors for
each
decade. You must to do this in your code, though you can
certainly use
online resources to help you figure out who those people should
be to
make sure your code is working correctly. The top people should
have
done at least m films with more than n votes (where m and n may
depend
on the
decade - document your decisions) For the purpose of this
project a
'top' person is defined by
working on a highly rated film. One way to start is to pick the
top
rated films from each decade and then look for commonalities in
the
cast and directors, count them up, and eventually pick the top
5-10
actors, actresses, directors for each decade. There may be some
overlap
from decade to decade which is fine. You should document the
decisions
that you make to create the algorithm that determines a
top person.
Your code could have one phase where it works out who
the best are and then another phase that does the interactive
visualization - the code does not need to automatically compute
the top
people every time it runs - it could do the computation under
user
control, or only
if that data hasn't been already computed. You could have two
separate
applications - once to parse through the current data and
generate new
data files and another to take those new data files and run the
visualization. The imdb data files are updated regularly so your
code
should continue to run on new datasets unless there is a
significant
change to their format.
Moreso than the
previous projects the work on this one can be split between the
data
side and the visualization side as long as there is a common
data
format to link them. You should quickly work out which of the
imdb data
files you need and what final form you want the data to be in.
Then you
could pick a couple sample actors or directors and generate all
the
appropriate data in the proper form so the visualization work
can
proceed while working out the code to pick the top people
automatically.
For a C you need ...
choose the name of an actor or director from a list on the
screen
and see their career on the graph. If the person did more than
one film
per year then the ratings of those films should be averaged.
if a person was both an actor and a director then it should
be
obvious which films were which
ability to see the career of a person with the chart ranging
from
0-10 showing the average of
all films, which you should compute from the films in the
limited
dataset (no tv shows, etc), i.e. the x axis sits at 0
ability to see the chart as ratings above / below the
average of
all films, i.e. the x axis sits at the average
you should check your visualization with a colour blindness
checking web page to see that its ok
the interface should update quickly when the user interacts
ability to mouse over a given column in the chart and see
the
name of that specific film (or films) for that column
For a B you need to add
...
ability to compare n actors and / or directors
simultaneously,
each on their own graph which are aligned in time
in
addition to showing the average of multiple films per year,
it should
also show the minimum and maximum
ability
to
code the columns by genre using a subset of the imdb genre
codes. If a
film has multiple genres then it should have multiple codes.
Very
likely you will need to reduce the number of genres either
by dropping
or combining. Give your rationale for your choices on your
web page.
ability
to
code each decade for a person by the genres they worked in
during that decade
ability
to
show the min / max / average ratings for each decade
ability
to
see a chart showing the overall career in terms of
distribution of
ratings and genres
if
there
are no people selected then clicking on a decade should show
which
people were active that decade, if there are no people
selected then
clicking on a genre should show which people were active in
that genre,
clicking on a decade and a genre would show people in the
intersection.
For an A you need to add
...
ability to pan and compress/expand the years
add the bottom 20 male
actors, female actors, and directors over the last 100 years
who have
done at least 10 films with at least m votes. This is harder
that
finding the top people since you shouldn't pick on complete
unknowns,
so you may end up with just 'mediocre' actors or actors who
are
desperate or who make bad choices on a regular basis.
you shoud pick several people that you would like to add
that
show interesting behavior (i.e. good actors who become good
directors,
good actors that go on to work in bad movies and then are
rediscovered
and go on to do good movies again, people with very long
careers, etc)
and document them on your webpage. Since the imdb has inherant
biases
towards US films, films in English, and modern films, feel
free to make
choices that show a broader range.
You
should
create
a web page that describes your work on the project. I will be
linking this web page to the course notes
so please send me a 1024 x 768 jpg image of your visualization for
the
web. This
should be named p3.<someone_in_your_groups_last_name>.jpg.
Again,
please make
sure that your application is Mac / Windows / Linux compatible.
Again,
the web page should describe the contribution of each
team member.
Each group will bring their visualization to class to present it
and
describe
its features to the rest of the
class. This allows everyone to see a variety of solutions to the
problem, and a variety of implementations.
NEW: I would also like each
group to email me a list of you top 5 male actors, female
actors, and
directors for each decade so I can also show a comparison of the
names
that the different algorithms generated. I would like each of
the three
lists in this form so its easy to put them together:
1910
#1
last name
#2
last name
#3
last name
#4
last name
#5
last name
1920
#1
last name
#2
last name
#3
last name
#4
last name
$5
last name
etc
I would also like to have
these
lists the night before the project is due so I have time to
compile the
lists before the presentations in class.