Week 2

Introduction to R, Shiny, Jupyter, and GitHub



For the projects in this class we are going to use R to do the heavy lifting on the data side and Shiny to handle the front end so we can create interactive data visualization applications for the web. We will also be using R with Jupyter notebooks to show the process of using visualization in an investigation.




Currently there are two main freely available languages for doing this kind of work: R and Python, each with their advantages and disadvantages. For now I think R is still ahead given its level of support, but that may change in the next few years. The basics of using either language are the same, the techniques are the same, and after learning one it should not be difficult to move to the other, or to other solutions such as SAS.

Here are links to some web pages that were created with R and Shiny to show various ways they can be used to show data.
http://shiny.rstudio.com/gallery/


You should start by downloading R (https://www.r-project.org/) (4.1.2) from a site such as https://repo.miserver.it.umich.edu/cran/
and installing it.

and then the free version of RStudio Desktop (2021.09) from https://rstudio.com/products/rstudio/download/
and installing it

There is a very good set of video tutorials with source code here that take about 2-3 hours to complete:
https://shiny.rstudio.com/tutorial/ that will give you a very good grounding in the basics.

There is also Swirl for learning R - http://swirlstats.com/ which is also a very nice way to learn the basics of R.

We will be using R for all of the projects so its worth taking the time to learn it now.

Some useful links

Start RStudio and you should see something similar to a modern IDE with an area for viewing and editing code, a terminal, an area for seeing the current state of the R environment, and a multi-purpose area.

A lot of the power of R comes from the variety of packages that can be installed into it

To see what packages are currently installed, in the terminal at the > prompt, type installed.packages()[,1:2]

To see which packages are out of date type old.packages()

To update all old packages type update.packages(ask = FALSE)

Some useful libraries to install with install.packages() include:

e.g. install.packages("shiny")

I will be going through a demonstration of using these tools to visualize some local temperature data in class. For over 15 years we have been collecting temperature data in the various rooms of the lab, trying to understand why the temperature used to change so dramatically at times, trying to keep the people and the machines here happy.

The relevant files are located here

What this code produces is interactive, hosted on the shiny site, here

and a snapshot below:



The visualization was designed to run as a compromise between an HD display and our wide classroom wall and in either case be touch-screen compatible, which is also why the version you download has the menu at the upper left slightly lowered so people could reach it to interact with it at the wall. The projects will involve developing and demonstrating visualizations for this classroom wall.

evl Classroom


Lets play a bit with the evl temperature data using R and R studio. As you go through this tutorial take a screen snapshot including each of the visualizations. You will be turning these in through gradescope.

one nice way to do this is to copy and paste the following relevant commands below into the RStudio console (in the lower left panel of R studio) one by one to see their affect on the current environment.


download a copy of the evlWeatherForR.zip file from the 'relevant files' link above and unzip it.
set the working directory to the evlWeatherForR directory
setwd("dir-you-want-to-go-to")

test if thats correct with
getwd()

to get a listing of the directory use
dir()

read in one file
evl2006 <- read.table(file = "history_2006.tsv", sep = "\t", header = TRUE)  ... be careful to avoid smart quotes. You should now see evl2006 in the Global Environment at the upper right

take a look at it
evl2006

and then some commands to get an overall picture of the data:
str(evl2006)
summary(evl2006)
head(evl2006)
tail(evl2006)
dim(evl2006)

hmmm almost all the fields are integers including the hour and the temperature for the 7 different rooms, but Date is a factor - what is a factor? use the Help viewer in R-Studio. R assumes that this is categorical data and assigns an integer value to each unique string.

convert the dates to internal format and remove the original dates
newDates <- as.Date(evl2006$Date, "%m/%d/%Y")
evl2006$newDate<-newDates
evl2006$Date <- NULL

we didn't need to use the intermediary newDates, but it can be safer when you are starting out so you can check your results before you write over things you didn't mean to. Right now our data sets are small enough that we shouldn't need to worry about running out of memory.
if we use str(evl2006) again, now we have newDate in date format

try doing some simple graphs and stats

plot all temps in room 4 from 2006 using the built in plotting
plot(evl2006$newDate, evl2006$S4, xlab = "Month", ylab = "Temperature")

we can set the y axis to a fixed range for all temps in a given room
plot(evl2006$newDate, evl2006$S4, xlab = "Month", ylab = "Temperature", ylim=c(65, 90))

the built in plotting it nice to get quick views but it isn't very powerful or very nice looking so now lets get the noon temp and plot it for one / all the rooms using ggplot

if ggplot2 is not already installed then lets install it
install.packages("ggplot2")

and then lets load it in
library(ggplot2)


we want values from evl2006 where $hour is 12
noons <- subset(evl2006, Hour == 12)

we can list all the noons for a particulate room
noons$S2

note that you can see similar information in the environment panel at the upper right

or plot them
ggplot(noons, aes(x=newDate, y=S2)) + geom_point(color="blue") +  labs(title="Room Temperature in room ???", x="Day", y = "Degrees F") + geom_line()

we can set the min and max and add some aesthetics

ggplot(noons, aes(x=newDate, y=S2)) + geom_point(color="blue") +  labs(title="Room Temperature in room ???", x="Day", y = "Degrees F") + geom_line() + coord_cartesian(ylim = c(65,90))

add a smooth line through the data

ggplot(noons, aes(x=newDate, y=S2)) + geom_point(color="blue") +  labs(title="Room Temperature in room ???", x="Day", y = "Degrees F") + geom_line() + coord_cartesian(ylim = c(65,90)) + geom_smooth()

no points just lines and the smooth curve

ggplot(noons, aes(x=newDate, y=S2)) +  labs(title="Room Temperature in room ???", x="Day", y = "Degrees F") + geom_line() + coord_cartesian(ylim = c(65,90)) + geom_smooth()

just the smooth curve

ggplot(noons, aes(x=newDate, y=S2)) +  labs(title="Room Temperature in room ???", x="Day", y = "Degrees F") + coord_cartesian(ylim = c(65,90)) + geom_smooth()

we can show smooth curves for all of the rooms at noon at the same time

ggplot(noons, aes(x=newDate)) +  labs(title="Room Temperature in room ???", x="Day", y = "Degrees F") + coord_cartesian(ylim = c(65,90)) + geom_smooth(aes(y=S2)) + geom_smooth(aes(y=S1)) + geom_smooth(aes(y=S3)) + geom_smooth(aes(y=S4)) + geom_smooth(aes(y=S5)) + geom_smooth(aes(y=S6))+ geom_smooth(aes(y=S7))

show smooth curves for all of the rooms at all hours at the same time

ggplot(evl2006, aes(x=newDate)) +  labs(title="Room Temperature in room ???", x="Day", y = "Degrees F") + coord_cartesian(ylim = c(65,85)) + geom_smooth(aes(y=S2)) + geom_smooth(aes(y=S1)) + geom_smooth(aes(y=S3)) + geom_smooth(aes(y=S4)) + geom_smooth(aes(y=S5)) + geom_smooth(aes(y=S6))+ geom_smooth(aes(y=S7))


we can play with the style of the points and make them blue
ggplot(evl2006, aes(x=newDate, y=S4)) + geom_point(color="blue") +  labs(title="Room Temperature in room ???", x="Day", y = "Degrees F")

we can create a bar chart for all the temps for given room
ggplot(evl2006, aes(x=factor(S4)))  + geom_bar(stat="count", width=0.7, fill="steelblue")

or just the noon temps (note that we only see temps that existed in the data so some temps may be 'missing' on the x axis (e.g. 79)
ggplot(noons, aes(x=factor(S5)))  + geom_bar(stat="count", fill="steelblue")

we can do a better bar chart that treats the temperatures as numbers so there wont be any missing, and we can get control over the range of the x axis.
temperatures <- as.data.frame(table(noons[,6]))
temperatures$Var1 <- as.numeric(as.character(temperatures$Var1))

we can get a summary of the temperature data
summary(temperatures)

ggplot(temperatures, aes(x=Var1, y=Freq)) + geom_bar(stat="identity", fill="steelblue") + labs(x="Temperature (F)", y = "Count") + xlim(60,90)



and then could create a box and whisker plot of those values to see their distribution
ggplot(temperatures, aes(x = "", y = temperatures[,1])) + geom_boxplot() + labs(y="Temperature (F)", x="") + ylim(55,90)


in this example we had several temperature values for a given time in each row (called "wide data") but sometimes you get "long data" where, in this case, each of the temperature values would be in their own row with an identifier saying which room it is. The reshape2 library can covert between wide and long data. In this case we can convert the noons data with

longNoons <- melt(data=noons, id.vars=c("Hour", "newDate"))

Now we use ggplot2 in a slightly different way, for example we can plot all of the noon temperatures for the 7 rooms with this one command where 'variable' is the new variable that melt created to hold the various room identifiers.

ggplot(longNoons) + geom_line(aes(x=newDate, y=value, color=variable))

or break them up into 7 separate plots
ggplot(longNoons) + geom_line(aes(x=newDate, y=value, color=variable)) + facet_wrap(~variable)


so we have a lot of options here. Shiny allows us to give a user access to do these things interactively on the web using a GUI.


Some things to be careful of:

- be careful of smart quotes - they are bad
- be careful of commas, especially in the shiny code
- remember to set your working directory in R Studio
- try clearing out your R studio session regularly and running your code to make sure your code is self-contained using rm(list=ls())
- be careful of groupings to get your lines to connect the right way in charts
- be careful what format your data is in - certain operations can only be performed on certain data types






Here is another data set to play with on Thursday in class as part of a data scavenger hunt using Jupyter

While we used R in RStudio last time, today we are going to use R in a Jupyter Notebook. RStudio and Shiny are nice for creating interactive applications on the web and the IDE is very helpful for seeing information about the data you have loaded and the functions available to you. Jupyter is better for showing the sequence of an investigation through a data set.


You should first should install Anaconda (Individual edition, Graphical Installer) (with Python 3.8) to make the package management somewhat easier  - https://www.anaconda.com/products/individual#Downloads

It includes Python 3.8.X in case you don't already have it.

Then launch Anaconda-Navigator. Then follow the tutorial here: https://docs.anaconda.com/anaconda/navigator/tutorials/r-lang/ through step 4 and create a new R notebook by going to Home in Anaconda Navigator, Launching Jupyter Notebook, and then in the upper right using the New dropdown to create a new R notebook. Note that during the creation of the new environment choosing a version of Python other than 3.7 might cause the install to fail for incompatible packages.

Note that Anaconda and Jupyter are using a separate set of R libraries from the ones in R studio. There are ways to link the two, but for now this keeps it simple. RStudio will likely be running R 4.1.X while Anaconda will be running 3.6.X.

A blank notebook should load.

If instead the notebook keeps trying to load the kernel for about 30 seconds, and you see issues with IRkernel in the terminal window, and eventually you see a connection error, then we need to add a library to R. Go back to Anaconda and click on the R environment that you created and 'Open Terminal'. In the terminal window that appears type 'R' at the prompt to launch the anaconda version of R. Then at the R prompt type install.packages('IRkernel') and  let it install and then type IRkernel::installspec() and now if you launch a new Jupyter instance and create a new R notebook in it and the kernel should load.

(if this isn't the first time you have installed or upgraded R you may get a message that there is a prior installation of IRkernel that can't be removed. In that case you will likely need to go and remove the older version manually. In R you can ask where the packages are stored with .libPaths() and then go to that location in the file system, delete the older version, and then repeat the install command)

In future you may need to re-launch this terminal window to update some the packages in R.

Try to find interesting trends and changes in those trends. One of the main ideas here is to get a feel for how people use visualization interactively to look for patterns and events and outliers in the data. In this case we will start with some familiar concepts of utility usage - electricity, water, and natural gas.

When you are done download your Jupyter notebook as a notebook and as an HTML file, print the HTML file to a PDF and add this PDF to your submission to gradescope.

a quick reference:


Here is some background information on the data:


Some dates that you should be able to find in the data



Here is how to start playing with the data:

if you want to play with a downloaded version of the data then set the correct path to the data (as we did above) and then
utility <- read.table(file = "utilitydata2021.tsv", sep = "\t", header = TRUE)

or read the data from the web with
utility <- read.table(file = "https://www.evl.uic.edu/aej/424/utilitydata2021.tsv", sep = "\t", header = TRUE)

sometimes there is missing data - lets check
complete.cases(utility) - all of the rows should be TRUE, so we are good. If not, then:
utility[complete.cases(utility), ]
utility <- utility[complete.cases(utility), ]

at this point you should get a whole bunch of TRUEs.


lets convert the two year and month columns into a date
library(lubridate)

what will it look like if we concatenate the year and month columns:
paste(utility$Year, utility$Month, "01", sep="-")

lets create a new field using that concatenation to create a date:
utility$newDate <- ymd(paste(utility$Year, utility$Month, "01", sep="-"))

and then lets use ggplot2 to take a look at some things like temperature, which should have a familiar cyclical pattern and reasonable values for chicago

library(ggplot2)
ggplot(utility, aes(x=newDate, y=Temp_F)) + geom_point() + geom_line()

that should look pretty regular. Now, depending on your screen, the plot may look a little small which can make things hard to read but we can make the plot wider and taller for all future graphs with a command like
options(repr.plot.width=20, repr.plot.height=4)

and then I could take a look at some of the utility data like natural gas to see if it follows the same pattern while adding some better labels
ggplot(utility, aes(x=newDate, y=Gas_Th_per_Day)) + geom_point() + geom_line() + labs(title = "Monthly Natural Gas Usage", x = "Year", y ="Ave Therms per day")

hmmm ... here there is a general pattern, and a time when that general pattern changes somewhat

I could draw both the temperature and gas usage lines together and color the temperature in red and the amount of gas used in blue. The gas usage is also a much smaller number than the temperature so lets scale it up by a factor of 10 before we display it. Since natural gas is used for cooking and heating does the pattern make sense?
ggplot(utility, aes(x=newDate, y=10*Gas_Th_per_Day)) + geom_line(colour="blue") + geom_line(aes(y=Temp_F, colour="red")) + xlab("Date") + labs(title = "Monthly Natural Gas Usage vs Temperature", x = "Year", y ="Temp(F) & 10*Ave Therms per day")

or take a look at electricity and set the y axis lower limit to 0 and the upper limit to 60
ggplot(utility, aes(x=newDate, y=E_kWh_per_Day)) + geom_line() + geom_point() + coord_cartesian(ylim = c(0,60))  + labs(title = "Monthly Electricity Usage", x = "Year", y ="Ave kWh per day")

since electricity is used for air conditioning, among other things, there should be an overall pattern that is similar to the natural gas usage, but like the natural gas usage there are shifts in the overall patterns. Hmmm ... can electricity usage fall below zero? Yes it can.

we could just look at June since that is a warm month
junes <- subset(utility, Month == 6)
ggplot(junes, aes(x=newDate, y=E_kWh_per_Day)) + geom_point(color="blue") + geom_line(size=2) + coord_cartesian(ylim = c(0,60)) + labs(title = "June Electricity Usage", x = "Year", y ="Ave kWh per day")

maybe I want to compare June electricity usage to the temperature in June over all these years to see if there is a direct correlation
ggplot(data=junes, aes(x=newDate, y=E_kWh_per_Day, colour="E_kWh_per_Day")) + geom_point() + geom_line(size=2) + coord_cartesian(ylim = c(0,80)) + geom_line(aes(y=Temp_F, colour="Temp_F"),size=2) + labs(title = "June Electricity Usage Compared to Temperature", x = "Year", y ="kWh per day & Temperature (F)")


For the rest of the class you should investigate the water, gas, and electricity data to see what you can find, and create a report in your groups Jupyter Notebook documenting your findings. Remember doing this in science class? same thing, just with data.


All of this from data on a 1-month time scale. As we move to this data being more available on smaller timescales, down to 30 minutes for ComEd data, it starts to become easier and easier to track people's behavior, even down to knowing what room a person is likely in based on real time utility usage (knowing when a gas oven turns on, or a PC stars using more power). We will talk more about privacy issues later in the course.





GitHub

If you don't have git on your machine you can download it here - https://git-scm.com/
The windows installer has a lot of options, but if you stick to the defaults it should work fine. This will give you a nice command line version and a primitive GUI version so you may want to look into other GUI clients if that is your style.

We will be using GitHub for turning in the projects - https://github.com/

If you don't have a git account yet you should sign up for one.

Some Tutorials

GitHub has their own getting started page which is a nice intro - https://guides.github.com/activities/hello-world/


This is a nice intro for the basics - https://rubygarage.org/blog/most-basic-git-commands-with-examples

Quick Command List - https://rogerdudler.github.io/git-guide/

At minimum you will be using git to turn in all your code and data files from your projects, but I would also recommend regularly updating your files on git so there is some external proof of when you submitted, as well as having backup copies at various checkpoints. Note that git can also be a nice place to store a copy of your website files to prove they were done on time, and you can host your website for your project on git as well if you prefer. Note that I would not rely solely on Git for backing up your projects - keep multiple backups in multiple places.

For example I could take my evl weather R project above and move it up to a new git repository:

- I already have a Git account so I can use my favorite web browser to go to github.com and sign in and use the + in the upper right corner to add a New repository
- I can give it a name like Week2, make it private for now. Then I click the green button to create the repository
- Keep this tab open in your browser - we will come back to it.

- Go to the terminal and change the current directory to the one for your R project, e.g. cd Documents/CS424/evlWeatherForR

- If you look at the files in that directory you should see a app.R, a jpg, some tsv files, etc. type git init to create a new .git folder. 

- Now we need to link this local repository to GitHub so if you go back to GitHub and look at the Code tab and then click on the green Code button you can get the https link which will look something like https://github.com/YourGitAccountName/YourGitProjectName.git - back in the terminal you can use the command git remote add origin https://github.com/YourGitAccountName/YourGitProjectName.git so in my case git remote add origin https://github.com/andyevl/Week2.git

- You can then add all the files to the list to be tracked with git add --all

- You can then commit the changes with an initial comment (feel free to modify the comment) with git commit -m 'starter project'

- You can check the status with git status. Status is helpful to tell you which files are being tracked, and which are untracked, which have been modified, etc.

   - Another helpful command with the October 2020 switch from 'master' to 'main' as the primary branch name is git branch. This gives you the name of the branch you are linked to. After October 2020 it should be 'main'. if it is 'master' then a handy command is git branch -mv master main to rename the branch to main in order to match Git's new default.

- Now we can finally push our code up onto GitHub with git push origin main

- This will be slightly unhappy since it wants a comment but you can just save out of the editor.

- If you now go back to GitHub and refresh the page you should see a lot more files there with their modification dates.

- Whenever you want to update the repository at GitHub with your latest version you would go to the top level of your project directory and type this sequence to update only those files that have changed (or added to the repository):

- git add --all
-
git commit -m 'new comment for this version'
-
git push origin main

Git has limits on file sizes (100 MB uploaded from the command line and 25 MB uploaded from the browser) and the maximum sizes of repositories (2 GB). The R code for the projects will be small, but the data files will be getting larger as the projects go on and you will need to make sure that you can upload your data files on GitHub, which will very likely mean trimming down the size of your data files, breaking them into pieces, and possibly storing them in a binary format, all of which will also help your dashboards run faster. This is another reason to get a running version of your code onto GitHub early so you don't need to deal with breaking up your data file(s) near the project deadline. GitLFS (large file storage) exists but has been unreliable in past classes so it is NOT an option for this course.



For the rest of this week I'd like people to get R and R studio installed, and if you have not used R before then start with Swirl https://swirlstats.com/

If you are familiar with R then take a look at the shiny video tutorials, which will be really helpful in getting ready for the course projects, at https://shiny.rstudio.com/tutorial/



By the end of this week you should get the evl weather example above running in your local copy of R Studio and then create a shinyapps.io account and move the file up there so you can see it running on the web. Add another page to your gradescope submission containing the URL of your shinyapps.io site.


Coming Next Time

The Basics

last revision 1/25/2022 - added a note about Anaconda compatibility issues if a version of python other than 3.7 is chosen during the creation of the R environment
1/20/2022 - tweaked the Jupyter example to be a bit nicer about labelling