Team member choice due
4/7/20 at 8:59 pm Chicago time
Project final version due 4/27/20 at 8:59 pm Chicago time
Project 3 will be the second group project and the focus
here will be on dealing with data files.
Given UIC moving to on-line courses, we will set a 1980 x
1080 touch-based display as the target platform.
You can keep your group from Project 2 or form new
groups. Send Andy an email once you know your group for project 3 (even
if it is the same as for project 2). Again
the standard group size
will be 1 - 3 people per group.
As with Project 2 you should very quickly set up a web page for your
new group project and send the URL to andy. Each Friday of the project
each team member should post on the project web site an overview of
what he/she did on the project that week.
We are going to make use of the
internet movie database (IMDB) and visualize information related to
genres and plot keywords over 100 years of film.
The current IMDB does not make a lot
of data available for free, but the IMDB from December 2017 did, so we
are going to make use of those data files which are available from ftp://ftp.fu-berlin.de/pub/misc/movies/database/frozendata/
among other sites.
It will take 2-3 gig to download the
necessary files.
At that point you can start filtering
the files using tools of your choice (grep, python scripting, etc.) Just
be sure to document the process so it can be duplicated and automated.
This means either keeping track of every tool and command you use, or
writing a script that does all of it for you, so you could start with
downloading these files and end up with the files you are going to use
in your application.
The files we are interested in
include:
release-dates.list
running-times.list
certificates.list
genres.list
keywords.list
movies.list
ratings.list
The IMDB has more than movies, so you
can start by removing a bunch of other things:
TV episodes (TV) or lines that start
with a " can be removed
made for video entries (V) can be
removed
video game entries entries (VG) can
be removed
internet entries entries (internet)
can be removed along with re-releases (re-release) and blu-ray
releases (Blu-ray premiere)
We are also going to limit the films
in other ways:
any entry that doesn't have a year
(????) can be removed
any entry with a rating of USA:X or
USA:NC-17 can be removed
remove any entries in the following
genres: Short, Adult, Reality-TV, Talk-Show, Game-Show, News,
and any from any genres with fewer than 100 entries - note that this
one is different as films can have multiple genres, so if a film has
ANY of these genres all of its lines need to be removed, which should
leave about 22 genres
there are 250,000+ unique keywords,
so remove all keywords with less than 20 uses, this should remove
about 95+% of them, and remove any lines that use those exact
keywords. Note that you need to be careful how you remove lines with
those less-used keywords to make sure you are doing exact matches with
the entire keywords and not just substrings of the keywords.
for the certificates we are only
going to use the USA: certificates, you can remove those from other
countries. You can also remove the USA-TV certificates (TV-G, TV-14,
TV-Y, TV-PG, TV-MA) and lines for any films that were re-rated for TV
(TV rating) and anything with a video game certificates (E, E10+, T, C,
M, Not Rated)
for running times we are only going
to deal with films 60+ minutes long
Note that the name of a film plus its
year is the unique identifier in the IMDB (films with the same name that
came out the same year have modified years to differentiate them)
Also note that the certificates (G,
PG, etc.) in the US have changed quite a bit over the years. Be sure to
keep track of all of them (aside from the ones we have removed) and not
just the current ones.
For
a C you need to:
Show
an initial set of visualizations when your application stars up
giving an overview of the entire (remaining) dataset showing
number of films released each year, number of films released
each month of the year (i.e. number of films released in
September across the whole database), distribution of running
times, distribution of certificates, distribution of genres, top
n keywords.
Allow
the user to see a tabular version of each visualization to get
the actual numbers
Show
the average number of films released per year, the average
number of films released per month, the average running time
Allow
the user to choose a decade and see number
of films released each month, distribution
of running times, distribution of certificates, distribution of
genres, top n keywords for that decade compared to the overall
distribution for the entire data set
Allow
the user to choose a year and see number
of films released each month, distribution
of running times, distribution of certificates, distribution of
genres, top n keywords for that year compared to the overall
distribution for the entire data set
ability
bring up information on the dashboard about who wrote the project,
what libraries are being used to visualize it, where the data came
from, etc.
For a B you
need to be able to combine data from the different files together and:
Allow
the user to choose a genre and see the number of films released in
that genre each year, number in each decade, percentage of films
released in that genre each year, percentage in each decade, number of
films in that genre released each month, percentage of films released
in that genre each month, distribution of running times of films in
that genre, distribution of certificates for films in that genre, top
n keywords in that genre.
Allow the user to filter the visuals based on a selected keyword
from a list of the top n (n >= 10), certificate, or running time
(break the running times up intro an appropriate set of categories).
The user should then see the visuals update to show the number of
films that satisfy that criteria released each year, each decade,
percentage of films released each year, percentage each decade, number
of films released each month, percentage of films released each month,
distribution of running times, distribution of certificates,
distribution of genres, distribution of running times, top n keywords
associated with these films, range.
Allow the user to use combinations of these selections (genre +
keyword + running time +
certificate) to see films that satisfy all four
Allow
the user to see a tabular version of each visualization to get
the actual numbers
Show the number of films that satisfy the current criteria out of
the total in the database
For an A you
need to add:
Add
a list of the top 10 films that satisfy the current criteria ordered
based on the ratings.list
Allow the user to specify more than one genre at a time and update
all of the B and C charts and tables appropriately
Allow the user to specify more than one keyword at a time and update
all of the B and C charts and tables appropriately
Allow the user to specify more than one certificate at a time and
update all of the B and C charts and tables appropriately
Graduate
Students need to:
Allow the user to manually type in a keyword and see the same
visualizations from the B range
Allow users to select or type in multiple keywords for up to three
simultaneous searches and see the results visualized to compare them.
Come up with some interesting searches.
In
all of these case you need to make sure that your visualizations are
well constructed with good color and font choices, proper labeling, and
that they effectively reveal the truth about the data to the user
Note
that as part of the web page part of the grade you will need to use your
interface to show your findings, so make sure that the way your
interface displays information is clear.
Again, your app will be evaluated running full screen with touch
interaction on the classroom wall
and should not require scrolling.
For this project you should host your solution using the evl shiny
server. There are a new set of logins for this project in the form p3gX
where X is your group number. You should use this new login for project
3 to preserve the time stamps on everything from project 2.
There are two deadlines for this project. By the first deadline you should
have implemented the initial screen layout of your application and have
the basic functionality allowing the user to perform an example of the
various 'C' functionality on the target platform. This will make sure that
your group is on track and that you can focus on making a good interface
and set of visualizations, not just functional ones. Personally, I think
you should have the entire C functionality done at that point if you are
going for an A on the project as a whole. You should make this version of
the interface available on your group project page.
As part of the final turn in you should create a set of web pages that
describe your work on the project. This should include:
1 page with a link to your visualization solution and a description
of how to use your application and the things you can do with it.
1 page on the data you used, including where you got it, what you
did to it.
1 page with links to a zip file containing your well commented
source code, additionally needed data files, and any instructions
necessary to run it. These instructions should start from the
assumption that the reader has a web browser on their computer and
tell the user everything else he/she needs to know and do to get it
running using R studio.
1 page on what interesting things you found using your application.
This one is particularly important. Show that you can use your
application to find interesting things in the dataset and show the
screenshots to prove it.
1 page on the roles of the different team members
all of which should have plenty of
screenshots with meaningful captions. Web pages like this can be very helpful later on in helping
you build up a portfolio of your work when you start looking for a job
so please put some effort into it.
Be sure to document any external
libraries, tools, etc. that you make use of - give credit where credit
is due for everything that you didn't create yourself.
You should also create a 2-3 minute
YouTube video showing the use of your application including narration
with decent audio quality. That video should be in a very obvious
place on your main project web page. You will most likely find its
useful to do some editing after capturing the video to tighten the
video up. Its also a good idea to have a video like this available as
a backup during your presentation just in case of gremlins.
You may want to shoot this video
on the wall itself.
The web page including screen
snapshots and video need to be done by the deadline so be sure to leave
enough time to get that work done.
I will be linking your web page to
the course notes so please send andy and sai a nice jpg image of your visualization for the web. This should be
named p3.<someone_in_your_groups_last_name>.jpg.
When the project is done, each person in the group should also send Andy a
private email with no one else CC'd ranking your coworkers on the project
on a scale from 1 (low) to 5 (high) in terms of how good a coworker they
were on the project. If you never want to work with them again, give them
a 1. If this person would be a first choice for a partner on a future
project then give them a 5. If they did what was expected but nothing
particularly good or bad then give them a 3. By default your score should
be 3 unless you have a particular reason to increase or decrease the
number. If you are giving a score other than 3 you need to say why. Please
confine your responses to 1, 2, 3, 4, 5 and no 1/3ds or .5s please. Each
person's score on the project will be based on the overall score for the
group modified by these rankings.
Given UIC moving to on-line courses, we will not be able to
have in-class demonstrations, so we will do the presentations on-line
through blackboard. You can do by screen sharing a demo of your
application or by talking over the video you created.
Project 3 groups:
1 Vijay
Vemu, Kevin Kowalski, Samuel Kajah
2 Ho
Chon, Brandon Graver, Nicholas Abbasi
3 Angela
Timochina
4 Matt
Jankowski, Charly Sandoval, Amber Little
5 Desiree
Murray, Amy Ngo
6 Syed
Hadi, Josh Rowan, Sean Stiely
7 Abdul
Latif, Imaad Sohrab, Jaoudat Karime
8 Ansul
Goenka, Parikshit Solunke
9 Aashish
Agrawal, Ivan Madrid, Richard Miramonte
10 Prachal
Patel, Zohar Sajith
11 ?
Note that group numbers are changing
for project 3 and there will be new accounts on the shiny.evl.uic.edu
server.
These new accounts will be in the form p3g#
current list of packages installed on the evl shiny server:
>
installed.packages()[,1:2]
last revision 4/20/2020 - updated groups
4/11//2020 - fixed some inconsistencies in the B requirements