Week 8

Social Network Visualization & Common Data Transforms



Social Networks

Very nice overview of the display side of this work at: https://www.cmu.edu/joss/content/articles/volume1/Freeman.html

and a nice general introduction here: http://faculty.ucr.edu/~hanneman/nettext/index.html


Goal


Two major representations:

Graphs (more common)
Matrix
(as an aside, for more info about Bob and friends, see: http://www.imdb.com/title/tt0064100/ )



5 phases in how these images have been generated


Jacob Moreno 1930s

"We have first to visualize . . . A process of charting has been devised by the sociometrists, the sociogram, which is more than merely a method of presentation. It is first of all a method of exploration. It makes possible the exploration of sociometric facts. The proper placement of every individual and of all interrelations of individuals can be shown on a sociogram. It is at present the only available scheme which makes structural analysis of a community possible."

 "The fewer the number of lines crossing, the better the sociogram."

The most famous of his graphs is the friendship networks among elementary school students



Boys are represented as Triangles, girls as circles. The arrows show whether person A considers person B to be a friend (there is a line from A to B.) If both people consider the other to be a friend then there is a dash in the middle of the line.

another example shows both liking someone (red) and disliking someone (black) for the players on an American Football team. Note that no-one likes 5RB and several people actively dislike him. How well are they likely to block for him when he has the ball?



He would often try to position the points on the page in relation to their actual location in physical space. If he had no particular reason to put the nodes in particular locations he would default to a circle.

Moreno introduced five important ideas about the proper construction of images of social networks:



Lundberg and Steel 1930s emphasized the sociometric status of each node by making nodes with high status larger and placing them near the center of the graph

Northway 1940s created the target sociogram where nodes in the center are chosen more often than nodes further out and all the points in the same ring are chosen the same number of times. She emphasized that lines should be short. Here is her target sociogram of a first grade class.



Clearly this is all pretty straight forward when there aren't that many nodes, but as the number of nodes and edges increases the visualizations get crowded and confusing very quickly.

the visualization of steroid use in baseball is a nice more modern example: http://www.orgnet.com/steroids.html


These days research involves looking at how these social networks change over time

here is another diagram from the xkcd comic at https://xkcd.com/
xkcd character interaction maps - http://xkcd.com/

With more modern technology users should have the ability to move the nodes around, collapse and unroll hierarchies



Stanley Milgram 1967 - small-world phenomenon - Networks that exhibit this property are composed of a number of densely knit clusters of nodes, but at the same time, these clusters are well connected in that the average path length between any two randomly chosen nodes is 6 on average.

More on the small world experiment:
https://en.wikipedia.org/wiki/Small-world_experiment




There are many tools looking at Facebook friends

Here is a nice look at Facebook connections across the US - https://petewarden.com/2010/02/06/how-to-split-up-the-us/



enron emails -http://www.visualcomplexity.com/vc/project.cfm?id=393


example of dynamic social network visualization - vizster - http://hci.stanford.edu/jheer/projects/vizster/


lots of work trying to come up with a good twitter visualization -
https://twittertoolsbook.com/10-awesome-twitter-analytics-visualization-tools/

sentiment analysis visualization for twitter - https://www.csc2.ncsu.edu/faculty/healey/tweet_viz/tweet_app/

also https://www.trendsmap.com/

and a big list of them - https://blog.bufferapp.com/free-twitter-tools




big list of analysis tools at https://en.wikipedia.org/wiki/Social_network_analysis_software

https://www.kdnuggets.com/software/visualization.html




and some more datasets at the Stanford Large Network Dataset Collection: http://snap.stanford.edu/data/





R has the ability to read in and visualize these kinds of networks.

Here is some sample data from https://web.archive.org/web/20200205021609/http://www.users.miamioh.edu/shermalw/sociometryfiles/socio_are.htmlx  in tabular form

I have created a set of csv files for the nodes and the edges located at
https://www.evl.uic.edu/aej/424/social/social_nodes.csv
https://www.evl.uic.edu/aej/424/social/social_edges.csv

You should create a new Jupyter notebook and generally follow (and adapt) the instructions on this page:

http://pablobarbera.com/big-data-upf/html/02a-networks-intro-visualization.html

to visualize the DIRECTED network that I gave you using igraph. In particular you should try different graph layouts to try and make the network more clear. You should also try to color code the girls and the boys differently, based on your best guesses from the names, to see if that highlights any patterns - feel free to create a less binary coding.

Note that when you draw these networks they will be somewhat different each time they are drawn due to having a different random seed value each time. This would be annoying to a user, so you can also set a seed value for the random number generator with set.seed(#) to make the layout more predictable.

igraph installs really easily in R-studio. In anaconda, as of Jan 2022, it seems best to install it by clicking on your R environment and then changing the view on the right from installed packages to all packages, and then searching for igraph. You should then be able to check the box next to r-igraph, and apply to let anaconda load in the igraph package.

If you have issues getting it installed under anaconda, you can do your work in R-Studio and and take screen captures and upload those.

As usual, printout your notebook and turn in the PDF through gradescope.



Common Data Transforms




What should I do when I get a new dataset

Data mining By Jiawei Han, Micheline Kamber

Data Cleaning

Data Integration - How can I combine different data sets from different sources

Data Transformation
aggregation leads us into the more general concept of data reduction

Miles and Huberman (1994):

Data reduction is not something separate from analysis. It is part of analysis. The researcher's decisions on which data chunks to code and which to pull out, which evolving story to tell are all analytic choices. Data reduction is a form of analysis that sharpens, sorts, focuses, discards, and organizes data in such a way that final conclusions can be drawn and verified.

Data Reduction - gives you a reduced dataset that gives you similar analytical results




Data Discretization and Concept Hierarchy Generation - reduces the number of possible values for a given attribute by replace data values with ranges or higher level concepts (e.g. numerical age could become 0s, 10s, 20s, 30s, ... 80s, 90s, 100s or young, middle-age, old, addresses could become city or state or country.)

Here is a nice clustering example from thematic cartography on data of the percentage of the population that was foreign-born in Florida in 1990. Here is the original data:
Foreign Born in Florida Counties

Start by ordering the data in whatever order seems appropriate, in this case by increasing % foreign-born,  then there are several typical ways to try and categorize the data.

Equal Intervals: We take the range of data and divide it by the number of classes (5 in this case.) This is really easy to compute but doesn't take into account the distribution of the data.

Quantiles: Starting with the number of classes, 5 in this case, an equal number of data points are placed into each class. This is also easy to compute, and lets you easily see the top n% of the data, but again it fails to
take into account the distribution of the data.

Mean-Standard Deviation: Starting with the mean and standard deviation of the data, data points are placed in appropriate classes e.g. (less than mean - 2 standard deviations,
mean -2 standard deviations to mean -1 standard deviation, mean +/- 1 standard deviations, mean +1 standard deviation to mean +2 standard deviations, greater than mean + 2 standard deviations). This works well with data that follows a normal distribution, but not in cases like the one shown above.

and a quick refresher on standard deviation from Wikipedia:
https://en.wikipedia.org/wiki/Standard_deviation
for a normal distribution: 68% are within 1
standard deviation, 95% are within 2 standard deviations
 
Maximum Breaks: Starting with the number of classes and the differences between adjacent data points, the largest breaks are used to define the classes. Maximum breaks is course in that it only takes into account the breaks and not the distribution between the breaks. Natural breaks tries to finesse this by making the classifications more subjective.

Distribution
              of Foreign Born in Florida Counties Graphical Distribution
              of Foreign Born in Florida Counties

Here is a table summarizing the benefits of each approach.
Benefits of different binning
      strategies

and here is the overview from Information Graphics - a Comprehensive Illustrated Reference

Examples of
      Different Intervals

Here is some data from the US Census to play with - population estimates for the 50 US states in 2014

Alabama 4849377
Alaska 736732
Arizona 6731484
Arkansas 2966369
California 38802500
Colorado 5355866
Connecticut 3596677
Delaware 935614
Florida 19893297
Georgia 10097343
Hawaii 1419561
Idaho 1634464
Illinois 12880580
Indiana 6596855
Iowa 3107126
Kansas 2904021
Kentucky 4413457
Louisiana 4649676
Maine 1330089
Maryland 5976407
Massachusetts 6745408
Michigan 9909877
Minnesota 5457173
Mississippi 2994079
Missouri 6063589
Montana 1023579
Nebraska 1881503
Nevada 2839099
New Hampshire 1326813
New Jersey 8938175
New Mexico 2085572
New York 19746227
North Carolina 9943964
North Dakota 739482
Ohio 11594163
Oklahoma 3878051
Oregon 3970239
Pennsylvania 12787209
Rhode Island 1055173
South Carolina 4832482
South Dakota 853175
Tennessee 6549352
Texas 26956958
Utah 2942902
Vermont 626562
Virginia 8326289
Washington 7061530
West Virginia 1850326
Wisconsin 5757564
Wyoming 584153
Wyoming 584153
Vermont 626562
Alaska 736732
North Dakota 739482
South Dakota 853175
Delaware 935614
Montana 1023579
Rhode Island 1055173
New Hampshire 1326813
Maine 1330089
Hawaii 1419561
Idaho 1634464
West Virginia 1850326
Nebraska 1881503
New Mexico 2085572
Nevada 2839099
Kansas 2904021
Utah 2942902
Arkansas 2966369
Mississippi 2994079
Iowa 3107126
Connecticut 3596677
Oklahoma 3878051
Oregon 3970239
Kentucky 4413457
Louisiana 4649676
South Carolina 4832482
Alabama 4849377
Colorado 5355866
Minnesota 5457173
Wisconsin 5757564
Maryland 5976407
Missouri 6063589
Tennessee 6549352
Indiana 6596855
Arizona 6731484
Massachusetts 6745408
Washington 7061530
Virginia 8326289
New Jersey 8938175
Michigan 9909877
North Carolina 9943964
Georgia 10097343
Ohio 11594163
Pennsylvania 12787209
Illinois 12880580
New York 19746227
Florida 19893297
Texas 26956958
California 38802500



and some statistics

max 38802500


ave +2 SD 20522107
ave +1 SD 13443035
average 6363963
ave - 1 SD -715109


min 584153

and a CSV version is located at https://www.evl.uic.edu/aej/424/us_state_populations.csv


As of Jan 2022, the mapping package albersusa is not happy running in Jupyter so  for this part you should use R-Studio and take screen snapshots of your progress to combine into a pdf.

Read in the state population data and show the different breakpoints (equal intervals, quantiles, mean-standard deviation, maximum, and what you think are the natural breaks. Note that R has a quantile function, a standard deviation function, as well as the summary function. For the natural breaks you should plot the data (hist can be helpful here) and then decide on where you think the breaks should be and defend your decision.

https://www.r-bloggers.com/r-tutorial-series-summary-and-descriptive-statistics/

We can also color in the states based on this data. The visuals making use of albersusa at the link below show a similar process, and albersusa is a convenient way to map data on the 50 US states including Alaska and Hawaii. There are similar versions of albers that also include US territories. (note that albersusa needs to be installed with a special command: devtools::install_github("hrbrmstr/albersusa")

and if you don't have devtools installed then you need to install it with install.packages("devtools").

On Windows you may also first need to install Rtools from this page https://cran.r-project.org/bin/windows/Rtools/history.html and be sure to check the version number of R from the Anaconda terminal (at this point you will likely need the older Rtools35.exe)

The albersusa section and in particular the Fill (choropleth) subsection of this link will be helpful.

https://cfss.uchicago.edu/notes/vector-maps/


you can read in the alberusa library
library(albersusa)

you can look at the data
usa_sf()

and see that it has various information including population data from various years including pop_2014 which matches the dataset I provided above.

you can show a blank map
ggplot(data = usa_sf()) +geom_sf()

you can show a map colored by the 2014 population with
ggplot(data = usa_sf()) + geom_sf(aes(fill = pop_2014))

you could break that range of data into 6 discrete intervals with a bad color scheme
usa_sf() %>% mutate(pop_cut = cut_interval(pop_2014, n = 6)) %>%  ggplot() + geom_sf(aes(fill = pop_cut))

or if we add in the color brewer library

install.packages("RColorBrewer")
library(RColorBrewer)


we can see the visualization with a nicer color scheme
usa_sf() %>% mutate(pop_cut = cut_interval(pop_2014, n = 6)) %>%  ggplot() + geom_sf(aes(fill = pop_cut)) + scale_fill_brewer(palette = "Blues")

or break it into 6 discrete bins with roughly the same number of states in each bin
usa_sf() %>% mutate(pop_cut = cut_number(pop_2014, n = 6)) %>%  ggplot() + geom_sf(aes(fill = pop_cut)) + scale_fill_brewer(palette = "Greens")


Make maps of this 2014 population data using cut_number and cut_interval with values of 2, 3, 4, 5, 6, and a nice color scheme of your choice, and explain what you see in each case in terms of how the data is bring broken up and what that is showing on the map.

Then as usual submit your PDF via gradescope.




Provenance - data moves through several forms and filters on its way to being visualized and analyzed. Its important to keep track of who has done what to the data at each step so the validity of the final product can be ascertained, and if any issues arise with the original data collection or the intermediate steps then its easy to find which data products are affected.

You wouldn't just grab data off the web and assume that its correct, would you? would you?

A nice overview is given in https://legacy.cs.indiana.edu/ftp/techreports/TR618.pdf
where they give this list of applications for provenance:
The meta data that moves along with a dataset should give these details, and as the information moves from raw data through various stages of processing the meta data should be updated in sufficient detail.


Different computer hardware, operating systems, and versions of libraries can also affect the data products. Documenting these are important to be able to replicate the data manipulations. Containers allow us to create and package virtual versions of the machine with a given set of software, making it easy to do this replication.


Coming Next Time

Medical Visualization and Scientific Visualization


last revision: 3/3/2022 - more specifics on the data binning homework/classwork

2/24/2022 - updated dead link on sociogram example to wayback machine version