Week 9

Social Network Visualization & Common Data Transforms

Social Networks

Very nice overview of the display side of this work at: http://www.cmu.edu/joss/content/articles/volume1/Freeman.html

and a nice general introduction here: http://faculty.ucr.edu/~hanneman/nettext/index.html


Two major representations:

Graphs (more common)
(as an aside, for mow info about Bob and friends, see: http://www.imdb.com/title/tt0064100/ )

5 phases in how these images have been generated

Jacob Moreno 1930s

"We have first to visualize . . . A process of charting has been devised by the sociometrists, the sociogram, which is more than merely a method of presentation. It is first of all a method of exploration. It makes possible the exploration of sociometric facts. The proper placement of every individual and of all interrelations of individuals can be shown on a sociogram. It is at present the only available scheme which makes structural analysis of a community possible."

 "The fewer the number of lines crossing, the better the sociogram."

The most famous of his graphs is the friendship networks among elementary school students

Boys are represented as Triangles, girls as circles. The arrows show whether person A considers person B to be a friend (there is a line from A to B.) If both people consider the other to be a friend then there is a dash in the middle of the line.

another example shows both liking someone (red) and dislinking someone (black) for the players on an American Football team. Note that no-one likes 5RB and several people actively dislike him. How well are they likely to block for him when he has the ball?

He would often try to position the points on the page in relation to their actual location in physical space. If he had no particular reason to put the nodes in particular locations he would default to a circle.

Moreno introduced five important ideas about the proper construction of images of social networks:

Lundberg and Steel 1930s emphasized the sociometric status of each node by making nodes with high status larger and placing them near the center of the graph

Northway 1940s created the target sociogram where nodes in the center are chosen more often than nodes further out and all the points in the same ring are chosen the same number of times. She emphasized that lines should be short. Here is her target sociogram of a first grade class.

Stanley Milgram 1967 - small-world phenomenon - Networks that exhibit this property are composed of a number of densely knit clusters of nodes, but at the same time, these clusters are well connected in that the average path length between any two randomly chosen nodes is 6 on average.

More on the small world experiment:

These days research involves looking at how these social networks change over time

Clearly this is all pretty straight forward when there aren't that many nodes, but as the number of nodes and edges increases the visualizations get crowded and confusing very quickly.

Users should have the ability to move the nodes around, collapse and unroll hierarchies

several tools to look at Facebook friends

Here is a nice look at Facebook connections across the US - http://petewarden.typepad.com/searchbrowser/2010/02/how-to-split-up-the-us.html

enron emails -http://www.visualcomplexity.com/vc/project.cfm?id=393

nice interactive java demo - vizster - http://hci.stanford.edu/jheer/projects/vizster/

here is an interesting one for countries (focusing on borders and languages) from the CIA world factbook data - http://moritz.stefaner.eu/projects/relation-browser/

twitter visualization - http://twittertoolsbook.com/10-awesome-twitter-analytics-visualization-tools/

also http://trendsmap.com

orgnet - http://www.orgnet.com/cases.html
the steroids one is a nice example: http://www.orgnet.com/steroids.html
and the twitter one: http://www.orgnet.com/twitter.html
and the co-authorship one: http://www.orgnet.com/SN.html

big list of analysis tools at http://en.wikipedia.org/wiki/Social_network_analysis_software

http://www.graphviz.org/ graphviz - a nice example of a non-interactive tool

a new piece of software in 2010 is NODEXL - http://nodexl.codeplex.com/

Here is a good overview of dynamic social network visualization:
The Art and Science of Dynamic Network Visualization
Skye Bender-deMoll, skyebend@stanford.edu, Daniel A. McFarland, mcfarland@stanford.edu

and here is another diagram from the xkcd comic at http://xkcd.com/
      character interaction maps - http://xkcd.com/

and some more datasets at the Stanford Large Network Dataset Collection: http://snap.stanford.edu/data/

Common Data Transforms

What should I do when I get a new dataset

Data mining By Jiawei Han, Micheline Kamber

Data Cleaning

Data Integration - How can I combine different data sets from different sources

Data Transformation
aggregation leads us into the more general concept of data reduction

Miles and Huberman (1994):

Data reduction is not something separate from analysis. It is part of analysis. The researcher’s decisions—which data chunks to code and which to pull out, which evolving story to tell—are all analytic choices. Data reduction is a form of analysis that sharpens, sorts, focuses, discards, and organizes data in such a way that “final” conclusions can be drawn and verified.

Data Reduction - gives you a reduced dataset that gives you similar analytical results

Data Discretization and Concept Hierarchy Generation - reduces the number of possible values for a given attribute
by replace data values with ranges or higher level concepts (e.g. numerical age could become 0s, 10s, 20,s 30s, ... 80s, 90s, 100s or young, middle-age, old, addresses could become city or state or country.)

Here is a nice clustering example from thematic cartography on data of the percentage of the population that was foreign-born in Florida in 1990. Here is the original data:
Foreign Born in Florida Counties

Start by ordering the data in whatever order seems appropriate, in this case by increasing % foreign-born,  then there are several typical ways to try and categorize the data.

Equal Intervals: We take the range of data and divide it by the number of classes (5 in this case.) This is really easy to compute but doesn't take into account the distribution of the data.

Quantiles: Starting with the number of classes, 5 in this case, an equal number of data points are placed into each class. This is also easy to compute, and lets you easily see the top n% of the data, but again it fails to
take into account the distribution of the data.

Mean-Standard Deviation: Starting with the mean and standard deviation of the data, data points are placed in appropriate classes e.g. (less than mean - 2 standard deviations,
mean -2 standard deviations to mean -1 standard deviation, mean +/- 1 standard deviations, mean +1 standard deviation to mean +2 standard deviations, greater than mean + 2 standard deviations). This works well with data that follows a normal distribution, but not in cases like the one shown above.

and a quick refresher on standard deviation from Wikipedia:
for a normal distribution: 68% are within 1
standard deviations, 95% are within 2 standard deviations
Maximum Breaks: Starting with the number of classes and the differences between adjacent data points, the largest breaks are used to define the classes. Maximum breaks is course in that it only takes into account the breaks and not the distribution between the breaks. Natural breaks tries to finesse this by making the classifications more subjective.

Distribution of Foreign
      Born in Florida Counties

and here is how each of those would be visualized:
Graphical Distribution of Foreign
      Born in Florida Counties

Here is a table summarizing the benefits of each approach.
Benefits of different binning

and here is the overview from Information Graphics - a Comprehensive Illustrated Reference

Examples of
      Different Intervals

Here is some data from the US Census to play with - population estimates for the 50 US states in 2014

Alabama 4849377
Alaska 736732
Arizona 6731484
Arkansas 2966369
California 38802500
Colorado 5355866
Connecticut 3596677
Delaware 935614
Florida 19893297
Georgia 10097343
Hawaii 1419561
Idaho 1634464
Illinois 12880580
Indiana 6596855
Iowa 3107126
Kansas 2904021
Kentucky 4413457
Louisiana 4649676
Maine 1330089
Maryland 5976407
Massachusetts 6745408
Michigan 9909877
Minnesota 5457173
Mississippi 2994079
Missouri 6063589
Montana 1023579
Nebraska 1881503
Nevada 2839099
New Hampshire 1326813
New Jersey 8938175
New Mexico 2085572
New York 19746227
North Carolina 9943964
North Dakota 739482
Ohio 11594163
Oklahoma 3878051
Oregon 3970239
Pennsylvania 12787209
Rhode Island 1055173
South Carolina 4832482
South Dakota 853175
Tennessee 6549352
Texas 26956958
Utah 2942902
Vermont 626562
Virginia 8326289
Washington 7061530
West Virginia 1850326
Wisconsin 5757564
Wyoming 584153

and some statistics

max 38802500

ave +2 SD 20522107
ave +1 SD 13443035
average 6363963
ave - 1 SD -715109

min 584153

and a csv version is located here

Provenance - data moves through several forms and filters on its way to being visualized and analyzed. Its important to keep track of who has done what to the data at each step so the validity of the final product can be ascertained, and if any issues arise with the original data collection or the intermediate steps then its easy to find which data products are affected.

You wouldn't just grab data off the web and assume that its correct, would you? would you?

A nice overview is given in http://www.cs.indiana.edu/pub/techreports/TR618.pdf
where they give this list of applications for provenance:
The meta data that moves along with a dataset should give these details, and as the information moves from raw data through various stages of processing the meta data should be updated in sufficient detail.

Coming Next Time

Project 2 Presentations

last revision: 8/24/14