Project 3 - Saturday Night at the Movies - Rough Draft
Team member choice due
3/30/20 at 8:59 pm Chicago time
Project alpha version due 4/13/20 at 8:59 pm Chicago time
Project final version due 4/27/20 at 8:59 pm Chicago time
Project 3 will be the second group project and the focus
here will be on dealing with data files.
Again we will be using the main classroom wall as the target
display, and again your interface should work with touch on that wall
for a normal sized person.
You can keep your group from Project 2 or form new
groups. Send Andy an email once you know your group for project 3 (even
if it is the same as for project 2). Again
the standard group size
will be 1 - 3 people per group.
As with Project 2 you should very quickly set up a web page for your
new group project and send the URL to andy. Each Friday of the project
each team member should post on the project web site an overview of
what he/she did on the project that week.
We are going to make use of the
internet movie database (IMDB) and visualize information related to
genres and plot keywords over 100 years of film.
The current IMDB does not make a lot
of data available for free, but the IMDB from December 2017 did, so we
are going to make use of those data files which are available from
ftp.fu-berlin.de/pub/misc/movies/database/frozendata/ among other sites.
It will take 2-3 gig to download the
At that point you can start filtering
the files using tools of your choice (grep, python scripting, etc.) Just
be sure to document the process so it can be duplicated. This means
either keeping track of every tool and command you use, or writing a
script that does all of it for you, so you could start with downloading
these files and end up with the files you are going to use in your
The files we are interested in
The IMDB has more than movies, so you
can start by removing a bunch of other things:
- TV episodes (TV) or lines that start
with a " can be removed
- made for video entries (V) can be
- video game entries entries (VG) can
- internet entries entries (internet)
can be removed along with re-releases (re-release) and blu-ray
releases (Blu-ray premiere)
We are also going to limit the films
in other ways:
- any entry that doesn't have a year
(????) can be removed
- any entry with a rating of USA:X or
USA:NC-17 can be removed
- remove any entries in the following
genres: Short, Adult, Reality-TV, Talk-Show, Game-Show, News,
and any from any genres with fewer than 100 entries - note that this
one is different as films can have multiple genres, so if a film has
ANY of these genres all of its lines need to be removed, which should
leave about 22 genres
- there are 250,000+ unique keywords,
so remove all keywords with less than 20 uses, this should remove
about 95+% of them, and remove any lines that use those exact
keywords. Note that you need to be careful how you remove lines with
those less-used keywords to make sure you are doing exact matches with
the entire keywords and not just substrings of the keywords.
- for the certificates we are only
going to use the USA: certificates, you can remove those from other
countries. You can also remove the USA-TV certificates (TV-G, TV-14,
TV-Y, TV-PG, TV-MA) and lines for any films that were re-rated for TV
(TV rating) and anything with a video game certificates (E, E10+, T, C,
M, Not Rated)
- for running times we are only going
to deal with films 60+ minutes long
note that the name of a film plus its
year is the unique identifier in the IMDB (films with the same name that
came out the same year have modified years to differentiate them)
also note that the certificates in
the US have changed quite a bit over the years. Be sure to keep track of
all of them (aside from the ones we have removed) and not just the
a C you need to:
For a B you
need to be able to combine data from the different files together and:
an initial set of visualizations when your application stars up
giving an overview of the entire (remaining) dataset showing
number of films released each year, number of films released
each month of the year, distribution of running times,
distribution of certificates, distribution of genres, top n
the user to see a tabular version of each visualization to get
the actual numbers
the average number of films released per year, the average
number of films released per month, the average running time
the user to choose a decade and see number
of films released each month, distribution
of running times, distribution of certificates, distribution of
genres, top n keywords for that decade compared to the overall
distribution for the entire data set
For an A you
need to add:
the user to choose a genre and see the number of films released in
that genre each year, each decade, percentage of films released in
that genre each year, number of films in that genre released each
month, percentage of films released in that genre each month,
distribution of running times of films in that genre, distribution of
certificates for films in that genre, top n keywords in that genre.
- Allow the user to choose a keyword from a list of the top n and see
the number of films released with that keyword each year, each decade,
percentage of films released with that keyword each year, number of
films released with that keyword each month, percentage of films
released released with that keyword each month, distribution of
running times of films released with that keyword, distribution of
certificates for films released with that keyword, distribution of
genres of films with that keyword.
- Allow the user to choose a certificate and see the number of films
released with that certificate each year, each decade,
percentage of films released with that certificate each year, number
of films released with that certificate each month, percentage of
films released released with that certificate each month, distribution
of running times of films with that certificate, distribution of
genres of films with that certificate, top n keywords for films with
- Allow the user to use combinations of these selections (genre +
keyword + certificate) to see films that satisfy all three
- Add another table that shows the number of films that satisfy the
a list of the top 10 films that satisfy the current criteria ordered
based on the ratings.list
- Allow the user to specify more than one genre at a time and update
all of the B charts and the A table appropriately
- Allow the user to specify more than one keyword at a time and update
all of the B charts and the A table appropriately
- Allow the user to specify more than one certificate at a time and
update all of the B charts and the A table appropriately
Students need to:
all of these case you need to make sure that your visualizations are
well constructed with good color and font choices, proper labeling, and
that they effectively reveal the truth about the data to the user
that as part of the web page part of the grade you will need to use your
interface to show your findings, so make sure that the way your
interface displays information is clear.
Allow the user to manually type in a keyword and see the same
visualizations from the B range
- Allow users to select or type in multiple keywords for up to three
simultaneous searches and see the results visualized to compare them.
Come up with some interesting searches.
Again, your app will be evaluated running full screen with touch
interaction on the classroom wall
and should not require scrolling.
For this project you should host your solution using the evl shiny
server. There are a new set of logins for this project.
There are two deadlines for this project. By the first deadline you should
have implemented the initial screen layout of your application and have
the basic functionality allowing the user to perform an example of the
various 'C' functionality on the classroom wall. This will make sure that
your group is on track and that you can focus on making a good interface
and set of visualizations, not just functional ones. Personally, I think
you should have the entire C functionality done at that point if you are
going for an A on the project as a whole. You should make this version of
the interface available on your group project page.
As part of the final turn in you should create a set of web pages that
describe your work on the project. This should include:
all of which should have plenty of
screenshots with meaningful captions. Web pages like this can be very helpful later on in helping
you build up a portfolio of your work when you start looking for a job
so please put some effort into it.
Be sure to document any external
libraries, tools, etc. that you make use of - give credit where credit
is due for everything that you didn't create yourself.
- 1 page with a link to your visualization solution and a description
of how to use your application and the things you can do with it.
- 1 page on the data you used, including where you got it, what you
did to it.
- 1 page with links to a zip file containing your well commented
source code, additionally needed data files, and any instructions
necessary to run it. These instructions should start from the
assumption that the reader has a web browser on their computer and
tell the user everything else he/she needs to know and do to get it
running using R studio.
- 1 page on what interesting things you found using your application.
This one is particularly important. Show that you can use your
application to find interesting things in the dataset and show the
screenshots to prove it.
- 1 page on the roles of the different team members
You should also create a 2-3 minute
YouTube video showing the use of your application including narration
with decent audio quality. That video should be in a very obvious
place on your main project web page. The easiest way to do this is to
either shoot it at the classroom wall or use a screen-capture tool
while interacting with your application, though you will most likely
find its useful to do some editing afterwards to tighten the video up.
Its also a good idea to have a video like this available as a backup
during your presentation just in case of gremlins. You
may want to shoot this video on the wall itself.
The web page including screen
snapshots and video need to be done by the deadline so be sure to leave
enough time to get that work done.
I will be linking your web page to
the course notes so please send andy and sai a nice jpg image of your visualization for the web. This should be
When the project is done, each person in the group should also send Andy a
private email with no one else CC'd ranking your coworkers on the project
on a scale from 1 (low) to 5 (high) in terms of how good a coworker they
were on the project. If you never want to work with them again, give them
a 1. If this person would be a first choice for a partner on a future
project then give them a 5. If they did what was expected but nothing
particularly good or bad then give them a 3. By default your score should
be 3 unless you have a particular reason to increase or decrease the
number. If you are giving a score other than 3 you need to say why. Please
confine your responses to 1, 2, 3, 4, 5 and no 1/3ds or .5s please. Each
person's score on the project will be based on the overall score for the
group modified by these rankings.
Each group will show their
visualization to the class and describe its features. This allows
everyone to see a variety of solutions to the problem, and a variety of
your presentation ... several times.
All team members are
expected to participate equally in that presentation.
Project 3 groups:
Note that group numbers are changing
for project 3 and there will be new accounts on the shiny.evl.uic.edu
These new accounts will be in the form p3g#
current list of packages installed on the evl shiny server:
last revision 10/24/19