Project 3 - Saturday Night at the Movies - Rough Draft
Team member choice due
3/30/20 at 8:59 pm Chicago time
Project alpha version due 4/13/20 at 8:59 pm Chicago time
Project final version due 4/27/20 at 8:59 pm Chicago time
Project 3 will be the second group project and the focus
here will be on dealing with data files.
Again we will be using the main classroom wall as the target
display, and again your interface should work with touch on that wall
for a normal sized person.
You can keep your group from Project 2 or form new
groups. Send Andy an email once you know your group for project 3 (even
if it is the same as for project 2). I
will create groups for people that do not form groups on their own.
Again the standard
group size will be 2 or 3 people per group.
As with Project 2 you should very quickly set up a web page for your
new group project and send the URL to andy. Each Friday of the project
each team member should post on the project web site an overview of
what he/she did on the project that week.
We are going to make use of the
internet movie database (IMDB) and visualize information related to
genres and plot keywords over 100 years of film.
The current IMDB does not make a lot
of data available for free, but the IMDB from December 2017 did, so we
are going to make use of those data files which are available from
ftp.fu-berlin.de/pub/misc/movies/database/frozendata/ among other sites.
It will take 2-3 gig to download the
At that point you can start filtering
the files using tools of your choice (grep, python scripting, etc.) Just
be sure to document the process so it can be duplicated. This means
either keeping track of every tool and command you use, or writing a
script that does all of it for you, so you could start with downloading
these files and end up with the files you are going to use in your
The files we are interested in
The IMDB has more than movies, so you
can start by removing a bunch of other things:
- TV episodes (TV) or lines that start
with a " can be removed
- made for video entries (V) can be
- video game entries entries (VG) can
- internet entries entries (internet)
can be removed along with re-releases (re-release) and blu-ray
releases (Blu-ray premiere)
We are also going to limit the films
in other ways:
- any entry that doesn't have a year
(????) can be removed
- any entry with a rating of USA:X or
USA:NC-17 can be removed
- remove any entries in the following
genres: Short, Adult, Reality-TV, Talk-Show, Game-Show, News,
and any from any genres with fewer than 100 entries - note that this
one is different as films can have multiple genres, so if a film has
ANY of these genres all of its lines need to be removed, which should
leave about 22 genres
- there are 250,000+ unique keywords,
so remove all keywords with less than 20 uses, this should remove
about 95+% of them, and remove any lines that use those exact
keywords. Note that you need to be careful how you remove lines with
those less-used keywords to make sure you are doing exact matches with
the entire keywords and not just substrings of the keywords.
- for the certificates we are only
going to use the USA: certificates, you can remove those from other
countries. You can also remove the USA-TV certificates (TV-G, TV-14,
TV-Y, TV-PG, TV-MA) and lines for any films that were re-rated for TV
(TV rating) and anything with a video game certificates (E, E10+, T, C,
M, Not Rated)
- for running times we are only going
to deal with films 60+ minutes long
note that the name of a film plus its
year is the unique identifier in the IMDB (films with the same name that
came out the same year have modified years to differentiate them)
also note that the certificates in
the US have changed quite a bit over the years. Be sure to keep track of
all of them (aside from the ones we have removed) and not just the
For a C you need to:
For a B you
need to be able to combine data from the different files together and:
an initial set of visualizations when your application stars up
giving an overview of the entire (remaining) dataset showing
number of films released each year, number of films released
each month, distribution of running times, distribution of
certificates, distribution of genres, top n keywords.
the user to see a tabular version of each visualization to get
the actual numbers
the average number of films released per year, the average
number of films released per month, the average running time
For an A you
need to add:
the user to choose a decade and see number
of films released each month,
of running times, distribution of certificates, distribution of
genres, top n keywords for that decade
the user to choose a genre and see the
number of films released in that genre each year, percentage of
films released in that genre each year, number of films in that
genre released each month, percentage of films released in that
genre each month, distribution of running times of films in that
genre, distribution of certificates for films in that genre, top
n keywords in that genre.
the user to choose a keyword from a list of the top n and see the
number of films released with
that keyword each year, percentage of
films released with that keyword each year, number of films released
with that keyword each month,
percentage of films released released
with that keyword each month,
distribution of running times of films released
with that keyword, distribution of
certificates for films released
with that keyword, distribution of
genres of films with that keyword.
a user to manually type in a keyword and do the same thing
the user to choose a certificate and
number of films released with that certificate each year,
percentage of films released with
that certificate each year, number of
each month, percentage of films released released
each month, distribution of running times of films with
distribution of genres of films with
that certificate, top n keywords for
films with that certificate.
another table that shows the number of films that satisfy the
current criteria, and if that number is less than 100, a list of
the films that satisfy the current criteria
the user to specify more than one genre and update all of the B
the user to specify more than one keyword and
update all of the B charts appropriately
the user to specify more than one certificate and
update all of the B charts appropriately
users to specify combinations of genres, keywords, and
Students need to:
all of these case you need to make sure that your visualizations are
well constructed with good color and font choices, proper labeling, and
that they effectively reveal the truth about the data to the user
that as part of the web page part of the grade you will need to use your
interface to show your findings, so make sure that the way your
interface displays information is clear.
Allow users to conduct an investigation on a certain topic with your
Again, your app will be
evaluated running full screen with touch interaction on the classroom
and should not require scrolling.
For this project you should host your solution using the evl shiny
server. There are a new set of logins for this project.
There are two deadlines for this project. By the first deadline you should
have implemented the initial screen layout of your application and have
the basic functionality allowing the user to perform an example of the
various 'C' functionality on the classroom wall. This will make sure that
your group is on track and that you can focus on making a good interface
and set of visualizations, not just functional ones. Personally, I think
you should have the entire C functionality done at that point if you are
going for an A on the project as a whole. You should make this version of
the interface available on your group project page.
As part of the final turn in you should create a set of web pages that
describe your work on the project. This should include:
all of which should have plenty of
screenshots with meaningful captions. Web pages like this can be very helpful later on in helping
you build up a portfolio of your work when you start looking for a job
so please put some effort into it.
Be sure to document any external
libraries, tools, etc. that you make use of - give credit where credit
is due for everything that you didn't create yourself.
- 1 page with a link to your visualization solution and a description
of how to use your application and the things you can do with it.
- 1 page on the data you used, including where you got it, what you
did to it.
- 1 page with links to a zip file containing your well commented
source code, additionally needed data files, and any instructions
necessary to run it. These instructions should start from the
assumption that the reader has a web browser on their computer and
tell the user everything else he/she needs to know and do to get it
running using R studio.
- 1 page on what interesting things you found using your application.
This one is particularly important. Show that you can use your
application to find interesting things in the dataset and show the
screenshots to prove it.
- 1 page on the roles of the different team members
You should also create a 2-3 minute
YouTube video showing the use of your application including narration
with decent audio quality. That video should be in a very obvious
place on your main project web page. The easiest way to do this is to
either shoot it at the classroom wall or use a screen-capture tool
while interacting with your application, though you will most likely
find its useful to do some editing afterwards to tighten the video up.
Its also a good idea to have a video like this available as a backup
during your presentation just in case of gremlins. You
may want to shoot this video on the wall itself.
The web page including screen
snapshots and video need to be done by the deadline so be sure to leave
enough time to get that work done.
I will be linking your web page to
the course notes so please send andy and sai a nice jpg image of your visualization for the web. This should be
When the project is done, each person in the group should also send Andy a
private email with no one else CC'd ranking your coworkers on the project
on a scale from 1 (low) to 5 (high) in terms of how good a coworker they
were on the project. If you never want to work with them again, give them
a 1. If this person would be a first choice for a partner on a future
project then give them a 5. If they did what was expected but nothing
particularly good or bad then give them a 3. By default your score should
be 3 unless you have a particular reason to increase or decrease the
number. If you are giving a score other than 3 you need to say why. Please
confine your responses to 1, 2, 3, 4, 5 and no 1/3ds or .5s please. Each
person's score on the project will be based on the overall score for the
group modified by these rankings.
Each group will show their
visualization to the class and describe its features. This allows
everyone to see a variety of solutions to the problem, and a variety of
your presentation ... several times.
All team members are
expected to participate equally in that presentation.
Project 3 groups:
Note that group numbers are changing
for project 3 and there will be new accounts on the shiny.evl.uic.edu
These new accounts will be in the form p3g#
current list of packages installed on the evl shiny server:
last revision 10/24/19