CS 424 Project 3

Team member choice due 4/7/20 at 8:59 pm Chicago time
Project final version due 4/27/20 at 8:59 pm Chicago time

We are going to make use of the internet movie database (IMDB) and visualize information related to genres and plot keywords over 100 years of film.

The current IMDB does not make a lot of data available for free, but the IMDB from December 2017 did, so we are going to make use of those data files which are available from ftp://ftp.fu-berlin.de/pub/misc/movies/database/frozendata/ among other sites.

At that point you can start filtering the files using tools of your choice (grep, python scripting, etc.) Just be sure to document the process so it can be duplicated and automated. This means either keeping track of every tool and command you use, or writing a script that does all of it for you, so you could start with downloading these files and end up with the files you are going to use in your application.

The IMDB has more than movies, so you can start by removing a bunch of other things:

Note that the name of a film plus its year is the unique identifier in the IMDB (films with the same name that came out the same year have modified years to differentiate them)

Also note that the certificates (G, PG, etc.) in the US have changed quite a bit over the years. Be sure to keep track of all of them (aside from the ones we have removed) and not just the current ones.

For a C you need to:

Show an initial set of visualizations when your application stars up giving an overview of the entire (remaining) dataset showing number of films released each year, number of films released each month of the year (i.e. number of films released in September across the whole database), distribution of running times, distribution of certificates, distribution of genres, top n keywords.
Allow the user to see a tabular version of each visualization to get the actual numbers
Show the average number of films released per year, the average number of films released per month, the average running time
Allow the user to choose a decade and see number of films released each month, distribution of running times, distribution of certificates, distribution of genres, top n keywords for that decade compared to the overall distribution for the entire data set
Allow the user to choose a year and see number of films released each month, distribution of running times, distribution of certificates, distribution of genres, top n keywords for that year compared to the overall distribution for the entire data set
ability bring up information on the dashboard about who wrote the project, what libraries are being used to visualize it, where the data came from, etc.

For a B you need to be able to combine data from the different files together and:

Allow the user to choose a genre and see the number of films released in that genre each year, number in each decade, percentage of films released in that genre each year, percentage in each decade, number of films in that genre released each month, percentage of films released in that genre each month, distribution of running times of films in that genre, distribution of certificates for films in that genre, top n keywords in that genre.
Allow the user to filter the visuals based on a selected keyword from a list of the top n (n >= 10), certificate, or running time (break the running times up intro an appropriate set of categories). The user should then see the visuals update to show the number of films that satisfy that criteria released each year, each decade, percentage of films released each year, percentage each decade, number of films released each month, percentage of films released each month, distribution of running times, distribution of certificates, distribution of genres, distribution of running times, top n keywords associated with these films, range.
Allow the user to use combinations of these selections (genre + keyword + running time + certificate) to see films that satisfy all four
Allow the user to see a tabular version of each visualization to get the actual numbers
Show the number of films that satisfy the current criteria out of the total in the database

For an A you need to add:

Add a list of the top 10 films that satisfy the current criteria ordered based on the ratings.list
Allow the user to specify more than one genre at a time and update all of the B and C charts and tables appropriately
Allow the user to specify more than one keyword at a time and update all of the B and C charts and tables appropriately
Allow the user to specify more than one certificate at a time and update all of the B and C charts and tables appropriately

Graduate Students need to:

Allow the user to manually type in a keyword and see the same visualizations from the B range
Allow users to select or type in multiple keywords for up to three simultaneous searches and see the results visualized to compare them. Come up with some interesting searches.

In all of these case you need to make sure that your visualizations are well constructed with good color and font choices, proper labeling, and that they effectively reveal the truth about the data to the user

Note that as part of the web page part of the grade you will need to use your interface to show your findings, so make sure that the way your interface displays information is clear.

Again, your app will be evaluated running full screen with touch interaction on the classroom wall and should not require scrolling.

For this project you should host your solution using the evl shiny server. There are a new set of logins for this project in the form p3gX where X is your group number. You should use this new login for project 3 to preserve the time stamps on everything from project 2.

There are two deadlines for this project. By the first deadline you should have implemented the initial screen layout of your application and have the basic functionality allowing the user to perform an example of the various 'C' functionality on the target platform. This will make sure that your group is on track and that you can focus on making a good interface and set of visualizations, not just functional ones. Personally, I think you should have the entire C functionality done at that point if you are going for an A on the project as a whole. You should make this version of the interface available on your group project page.

As part of the final turn in you should create a set of web pages that describe your work on the project. This should include:

1 page with a link to your visualization solution and a description of how to use your application and the things you can do with it.
1 page on the data you used, including where you got it, what you did to it.
1 page with links to a zip file containing your well commented source code, additionally needed data files, and any instructions necessary to run it. These instructions should start from the assumption that the reader has a web browser on their computer and tell the user everything else he/she needs to know and do to get it running using R studio.
1 page on what interesting things you found using your application. This one is particularly important. Show that you can use your application to find interesting things in the dataset and show the screenshots to prove it.
1 page on the roles of the different team members

all of which should have plenty of screenshots with meaningful captions. Web pages like this can be very helpful later on in helping you build up a portfolio of your work when you start looking for a job so please put some effort into it.

Be sure to document any external libraries, tools, etc. that you make use of - give credit where credit is due for everything that you didn't create yourself.

You should also create a 2-3 minute YouTube video showing the use of your application including narration with decent audio quality. That video should be in a very obvious place on your main project web page. You will most likely find its useful to do some editing after capturing the video to tighten the video up. Its also a good idea to have a video like this available as a backup during your presentation just in case of gremlins. You may want to shoot this video on the wall itself.

The web page including screen snapshots and video need to be done by the deadline so be sure to leave enough time to get that work done.

I will be linking your web page to the course notes so please send andy and sai a nice jpg image of your visualization for the web. This should be named p3.<someone_in_your_groups_last_name>.jpg.

When the project is done, each person in the group should also send Andy a private email with no one else CC'd ranking your coworkers on the project on a scale from 1 (low) to 5 (high) in terms of how good a coworker they were on the project. If you never want to work with them again, give them a 1. If this person would be a first choice for a partner on a future project then give them a 5. If they did what was expected but nothing particularly good or bad then give them a 3. By default your score should be 3 unless you have a particular reason to increase or decrease the number. If you are giving a score other than 3 you need to say why. Please confine your responses to 1, 2, 3, 4, 5 and no 1/3ds or .5s please. Each person's score on the project will be based on the overall score for the group modified by these rankings.

Given UIC moving to on-line courses, we will not be able to have in-class demonstrations, so we will do the presentations on-line through blackboard. You can do by screen sharing a demo of your application or by talking over the video you created.

2020 Project 3 - Saturday Night at the Movies