Week 2

Introduction to R and Shiny



For the projects in this class we are going to use R to do the heavy lifting on the data side and Shiny to handle the front end so we can create interactive data visualization applications for the web.




Here are links to some webpages that I created with R and Shiny to show various ways they can be used to show data.
http://shiny.rstudio.com/gallery/


You should start by downloading R (2017-11-30 Kite-Eating Tree) from a site such as https://cran.cnr.berkeley.edu/
and then R-Studio (1.1.383) from https://www.rstudio.com/products/rstudio/download/

There is a very good set of video tutorials with source code here that take about 2-3 hours to complete:
https://shiny.rstudio.com/tutorial/ that will give you a very good grounding in the basics.

There is also Swirl for learning R - http://swirlstats.com/

Some useful links


Some useful libraries to install with install.packages():

e.g. install.packages("shiny")

I will be going through a demonstration of using these tools to visualize some local temperature data in class. For the last 12 years we have been collecting temperature data in the various rooms of the lab, trying to understand why the temperature changes so dramatically at times.

The relevant files are located here

What this code produces is interactive here

and a snapshot below:



Lets play a bit with the evl temp data using R and R studio

one nice way to do this is to load the following text into R studio as a new text file (sitting in the upper left in the standard layout) and then copy and paste the relevant lines into the console (lower left) one by one to see their affect on the current environment. I have made a copy of this file available here



set the working directory
setwd(dir)
getwd() to test
dir() to see the directory listing

note that tab works for auto-complete

read in one file
evl2006 <- read.table(file = "history_2006.tsv", sep = "\t", header = TRUE)  ... be careful of smart quotes

can do a quick plot of the raw data - note here that the dates are basically just strings
plot(evl2006$Date, evl2006$S1, xlab = "Month", ylab = "Temperature")

convert the dates to the internal format and remove the original dates
newDates <- as.Date(evl2006$Date, "%m/%d/%Y")
evl2006$newDate<-newDates
evl2006$Date <- NULL


its good to start out by taking a look at the range of the data and doing some quick plots to make sure the data is what you expect.

evl2011 <- read.table(file = "history_2011.tsv", sep = "\t", header = TRUE)
newDates <- as.Date(evl2011$Date, "%m/%d/%Y")
evl2011$newDate<-newDates
evl2011$Date <- NULL
plot(evl2011$newDate, evl2011$S1, xlab = "Month", ylab = "Temperature")

str() to get structure, summary(), head(), tail(), dim()


Some years (like 2011) have some bad values, which in this case is "32". We know it wasn't freezing in the room then, so we should remove those invalid values.

summary(evl2011)
allData <- subset(evl2011, S1 != "32" & S2 != "32" & S3 != "32" & S4 != "32" & S5 != "32" & S6 != "32" & S7 != "32" )
summary(allData)
plot(allData$newDate, allData$S1, xlab = "Month", ylab = "Temperature")



try doing some graphs and stats


plot all temps in room 4 using the built in plotting
plot(evl2006$newDate, evl2006$S4, xlab = "Month", ylab = "Temperature")

... and set the y axis to match for all temps in a given room
plot(evl2006$newDate, evl2006$S4, xlab = "Month", ylab = "Temperature", ylim=c(65, 90))


but the built in plotting isnt very powerful or very nice looking so now lets get the noon temp and plot it for one / all the rooms using ggplot
install.packages(“ggplot2”)
library(ggplot2)


we want values from evl2006 where $hour is 12
noons <- subset(evl2006, Hour == 12)

we can list all the noons for a particulate room
noons$S2

or plot them
ggplot(noons, aes(x=newDate, y=S2)) + geom_point(color="blue")

or plot them with some better labels
ggplot(noons, aes(x=newDate, y=S2)) + geom_point(color="blue") +  labs(title="Room Temperature in room ???", x="Day", y = "Degrees F")

or plot them and add a line through them in order

ggplot(noons, aes(x=newDate, y=S2)) + geom_point(color="blue") +  labs(title="Room Temperature in room ???", x="Day", y = "Degrees F") + geom_line()

we can set the min and max values

ggplot(noons, aes(x=newDate, y=S2)) + geom_point(color="blue") +  labs(title="Room Temperature in room ???", x="Day", y = "Degrees F") + geom_line() + coord_cartesian(ylim = c(65,90))

add a smooth line through the data

ggplot(noons, aes(x=newDate, y=S2)) + geom_point(color="blue") +  labs(title="Room Temperature in room ???", x="Day", y = "Degrees F") + geom_line() + coord_cartesian(ylim = c(65,90)) + geom_smooth()

no points just lines and the smooth curve

ggplot(noons, aes(x=newDate, y=noons$S2)) +  labs(title="Room Temperature in room ???", x="Day", y = "Degrees F") + geom_line() + coord_cartesian(ylim = c(65,90)) + geom_smooth()

just the smooth curve

ggplot(noons, aes(x=newDate, y=noons$S2)) +  labs(title="Room Temperature in room ???", x="Day", y = "Degrees F") + coord_cartesian(ylim = c(65,90)) + geom_smooth()

we can show smooth curves for all of the rooms at noon at the same time

ggplot(noons, aes(x=newDate)) +  labs(title="Room Temperature in room ???", x="Day", y = "Degrees F") + coord_cartesian(ylim = c(65,90)) + geom_smooth(aes(y=noons$S2)) + geom_smooth(aes(y=noons$S1)) + geom_smooth(aes(y=noons$S3)) + geom_smooth(aes(y=noons$S4)) + geom_smooth(aes(y=noons$S5)) + geom_smooth(aes(y=noons$S6))+ geom_smooth(aes(y=noons$S7))

show smooth curves for all of the rooms at all hours at the same time

ggplot(evl2006, aes(x=newDate)) +  labs(title="Room Temperature in room ???", x="Day", y = "Degrees F") + coord_cartesian(ylim = c(65,85)) + geom_smooth(aes(y=evl2006$S2)) + geom_smooth(aes(y=evl2006$S1)) + geom_smooth(aes(y=evl2006$S3)) + geom_smooth(aes(y=evl2006$S4)) + geom_smooth(aes(y=evl2006$S5)) + geom_smooth(aes(y=evl2006$S6))+ geom_smooth(aes(y=evl2006$S7))






we can create a bar chart for all the temps for given room
ggplot(evl2006, aes(x=factor(S4)))  + geom_bar(stat="count", width=0.7, fill="steelblue")

or just the noon temps (note that it only lists temps that existed so some temps may be ‘missing’ open the x axis (e.g. for room S4)
ggplot(noons, aes(x=factor(S6)))  + geom_bar(stat="count", fill="steelblue")

we can do a better bar chart that treats the temperatures as numbers
temperatures <- as.data.frame(table(noons[,5]))
temperatures$Var1 <- as.numeric(as.character(temperatures$Var1))

ggplot(temperatures, aes(x=Var1, y=Freq)) + geom_bar(stat="identity", fill="steelblue") +
 labs(x="Temperature (F)", y = "Count") + xlim(60,90)

we can get a summary of the temperature data
summary(temperatures)

and then could create a box and whisker plot of those values
ggplot(temperatures, aes(x = "", y = temperatures[,1])) + geom_boxplot() + labs(y="Temperature (F)", x="") + ylim(55,90)


so we have a lot of options here.


Shiny allows us to give a user access to do these things interactively on the web.


Some things to be careful of:

- be careful of smart quotes - R doesn't like them
- be careful of commas, especially in the shiny code
- remember to set your working directory in R Studio
- try clearing out your R studio session regularly and running your code to make sure your code is self-contained using rm(list=ls())
- be careful of groupings to get your lines to connect the right way in charts
- be careful what format your data is in - certain operations can only be performed on certain data types




Here is another data set to play with on Thursday in class as part of a data scavenger hunt

You should form a group of 3 or 4 people and try to find interesting trends and changes in those trends. One of the main ideas here is to get a feel for how people use visualization interactively to look for patterns and events and outliers in the data. In this case we will start with some familiar concepts of utility usage - electricity, water, and natural gas.

Create a web page linked off of one of the student pages with the names of all the members of your group, and document your findings with screen snapshots from the application and text describing what you think you found. By the end of class email the location of your group's page to andy.

Here is some background information:

Here is how to start playing with the data:

utility <- read.table(file = "utilitydata.tsv", sep = "\t", header = TRUE)

last row is NAs - lets remove any rows that have incomplete data
complete.cases(utility) — last row is FALSE, all others TRUE
utility[complete.cases(utility), ]
utility <- utility[complete.cases(utility), ]

lets convert the two year and month columns into a date
library(lubridate)
paste(utility$Year, utility$Month, "01", sep="-")
utility$newDate <- ymd(paste(utility$Year, utility$Month, "01", sep="-"))

library(ggplot2)
ggplot(utility, aes(x=newDate, y=Temp_F)) + geom_point(color="blue") + geom_line()
or
ggplot(utility, aes(x=newDate, y=Gas_Th_per_Day)) + geom_point(color="blue") + geom_line()

and set the y axis lower limit to 0
ggplot(utility, aes(x=newDate, y=E_kWh_per_Day)) + geom_point(color="blue") + geom_line() + coord_cartesian(ylim = c(0,80))

we could just look at june
junes <- subset(utility, Month == 6)
ggplot(junes, aes(x=newDate, y=E_kWh_per_Day)) + geom_point(color="blue") + geom_line() + coord_cartesian(ylim = c(0,80))

maybe I want to compare it to the temperature in june to see if there is a direct correlation
ggplot(junes, aes(x=newDate, y=E_kWh_per_Day)) + geom_point(color="blue") + geom_line() + coord_cartesian(ylim = c(0,80)) + geom_line(aes(y=junes$Temp_F))


Some things with long term affects you might have found evidence for in the data:
But you can also see shorter term events like possible long trips that reduced all utility usage, or conversely times when a bunch of people visit and usage may go up. Several people also found the massive water use in the summer of 2002 when we were trying to fight off an infection among the fish in our pond.

All of this from data on a 1-month time scale. As we move to this data being more available on smaller timescales it starts to become easier and easier to track people's behaviour, even down to knowing what room a person is likely in based on real time utility usage. We will talk more about privacy issues later in the course.



For the rest of this week I'd like people to get R and R studio installed, and if you have not used R before then start with Swirl http://swirlstats.com/

If you are familiar with R then take a look at the shiny video tutorials at http://shiny.rstudio.com/tutorial/

Fell free to work with partners and in groups in class to get familiar with the tools before moving onto work individually on project 1.



By the end of this week you should get the evl weather example above running in your local copy of R Studio and then create a shinyapps.io account and move the file up there. Add an obvious descriptive link to this visualization to the public webpage you created last week - hint breaking up your webpage into tabs or sections based on the week in class might be a good idea.


Coming Next Time

The Basics

last revision 1/30/18 - added some possible things that you might have found evidence for in your scavenger hunt