(as an aside, for more info
about Bob and friends, see: http://www.imdb.com/title/tt0064100/ ) 5
phases in how these images have been generated
1930s - images drawn by hand
1950s - computers used to
algorithmically compute where the points and lines should be
1970s - computers drawing
images on plotters
1980s - personal computers
displaying images on (low resolution) monitors in colour
1990s - interaction,
spring systems (force
directed placement),
common platforms (e.g. java) on WWW
Jacob Moreno 1930s
"We have first to visualize . . . A process of
charting has been devised by the sociometrists, the sociogram,
which is more than merely a method of presentation. It is first
of all a method of exploration. It makes possible the
exploration of sociometric facts. The proper placement of every
individual and of all interrelations of individuals can be shown
on a sociogram. It is at present the only available scheme which
makes structural analysis of a community possible."
"The fewer the number
of lines crossing, the better the sociogram."
The most famous of his graphs
is the friendship networks among elementary school students
Boys are represented as Triangles, girls as circles.
The arrows show whether person A considers person B to be a
friend (there is a line from A to B.) If both people consider
the other to be a friend then there is a dash in the middle of
the line.
another example shows both liking someone (red) and
disliking someone (black) for the players on an American
Football team. Note that no-one likes 5RB and several people
actively dislike him. How well are they likely to block for him
when he has the ball?
He would
often try to position the points on the page in relation to
their actual location in physical space. If he had no particular
reason to put the nodes in particular locations he would default
to a circle.
Moreno introduced five important
ideas about the proper construction of images of social networks:
he drew graphs
he drew directed graphs
he used colors to draw
multigraphs
he varied the shapes of
points to communicate characteristics of social actors
he showed that variations
in the locations of points could be used to stress important
structural features of the data.
Lundberg and Steel 1930s emphasized the sociometric
status of each node by making nodes with high status larger and
placing them near the center of the graph
Northway 1940s created the target sociogram where
nodes in the center are chosen more often than nodes further out
and all the points in the same ring are chosen the same number
of times. She emphasized that lines should be short. Here is her
target sociogram of a first grade class.
Clearly this is all pretty
straight forward when there aren't that many nodes, but as the
number of nodes and edges increases the visualizations get
crowded and confusing very quickly.
With more modern technology
users should have the ability to move the nodes around, collapse
and unroll hierarchies
Stanley Milgram 1967 - small-world phenomenon -
Networks that exhibit this property are composed of a number of
densely knit clusters of nodes, but at the same time, these
clusters are well connected in that the average path length
between any two randomly chosen nodes is 6 on average.
I have created a set of csv
files for the nodes and the edges located at
https://www.evl.uic.edu/aej/424/social/social_nodes.csv
https://www.evl.uic.edu/aej/424/social/social_edges.csv
You should create a new
Jupyter notebook and generally follow (and adapt) the
instructions on this page:
to visualize the DIRECTED
network that I gave you using igraph. In particular you should
try different graph layouts to try and make the network more
clear. You should also try to color code the girls and the boys
differently, based on your best guesses from the names, to see
if that highlights any patterns - feel free to create a less
binary coding.
As usual, printout your
notebook and turn in the PDF through gradescope.
Common Data
Transforms
What should I do when I get a new
dataset
look at the meta-data
(ideally the rules for this dataset should have been set up
before the data was collected and written down including the
formats, bounds, null values)
is the data for each
attribute within bounds?
are there values that are
missing?
are attributes that are
supposed to be unique really unique?
look at the data
distribution of the good data - are there any odd outliers?
Data mining By Jiawei Han,
Micheline Kamber
Data Cleaning
missing values
values out of range
inconsistent formats
Data Integration - How can I combine different data sets from
different sources
different date formats ( e.g. Jan-10-90 or 01.10.90
or 10.01.90 or 01/10/1990 or ...)
different units of
measure (metric vs imperial, lat/lon vs UTM)
different coverage areas
(some data at neighborhood level, county level, state level,
some collected per month and some per year)
Data Transformation
smoothing - emphasizes
longer trends over shorter duration changes
generalization -
replaces detailed concept with a general one (e.g. for each
data value, replace a 9 digit zip code with a state name, or
a specific age with an age range 20-30)
normalization -
depending how the data will be visualized you may need to
transform it to a given range (e.g. 0 .0 to 1.0)
aggregation - combines
/ summarizes data (e.g. add up all data for M, T, W, Th, F
and store the weekly total, or the monthly total, or the
yearly total, or average the data for all zip codes in a
state and store the state average)
aggregation leads us into the
more general concept of data reduction
Miles and Huberman (1994):
Data reduction is not something separate from
analysis. It is part of analysis. The researcher's decisions
on which data chunks to code and which to pull out, which
evolving story to tell are all analytic choices. Data
reduction is a form of analysis that sharpens, sorts, focuses,
discards, and organizes data in such a way that final
conclusions can be drawn and verified.
Data Reduction - gives you a
reduced dataset that gives you similar analytical results
reduce the number of dimensions
attribute subset selection
- one attribute (e.g. age) may be derived from another or
directly correlated to another, or might be irrelevant in
the work you are doing so those attributes can be removed
data cube aggregation - if you think of all
the data you collected as forming a multidimensional cube
which each attribute being an edge of the cube then you can
collapse various dimensions down by aggregating the values
(e.g. the example above taking data for each day of the week
and storing only the weekly total)
dimensionality reduction
- encoding used to reduce dataset size - may be lossy or
lossless - e.g. using principal component analysis,
wavelets, math increases rapidly here.
reduce the amount of collected/generated data
replace data by a model
that generates the data values,
clustering (e.g. replace
all of the attribute values collected between depths 10 and
15 with the average of those attribute values)
sampling (keep every nth
value, or one random value within each cluster)
Data Discretization and Concept
Hierarchy Generation - reduces the number of possible values for
a given attribute by
replace data values with ranges or higher level concepts (e.g.
numerical age could become 0s, 10s, 20s, 30s, ... 80s, 90s, 100s
or young, middle-age, old, addresses could become city or state
or country.)
Here is a nice clustering example from thematic
cartography on data of the percentage of the population that was
foreign-born in Florida in 1990. Here is the original data:
Start by ordering the data in whatever order seems
appropriate, in this case by increasing % foreign-born,
then there are several typical ways to try and categorize the
data.
Equal Intervals:
We take the range of data and divide it by the number of classes
(5 in this case.) This is really easy to compute but doesn't
take into account the distribution of the data.
Quantiles:
Starting with the number of classes, 5 in this case, an equal
number of data points are placed into each class. This is also
easy to compute, and lets you easily see the top n% of the data,
but again it fails to take into account the distribution of the data.
Mean-Standard
Deviation: Starting with the mean and standard
deviation of the data, data points are placed in appropriate
classes e.g. (less than mean - 2 standard deviations, mean -2 standard deviations to mean -1 standard deviation, mean +/- 1 standard deviations, mean +1 standard deviation to mean +2 standard deviations, greater than mean + 2 standard deviations). This works well with
data that follows a normal distribution, but not in cases like
the one shown above.
and a quick refresher on standard deviation from Wikipedia: https://en.wikipedia.org/wiki/Standard_deviation
for a normal distribution: 68% are within 1 standard deviation, 95% are within 2 standard deviations
Maximum Breaks:
Starting with the number of classes and the differences between
adjacent data points, the largest breaks are used to define the
classes. Maximum breaks is course in that it only takes into
account the breaks and not the distribution between the breaks.
Natural breaks
tries to finesse this by making the classifications more
subjective.
Here is a table summarizing the benefits of each approach.
and here is the overview from
Information Graphics - a Comprehensive Illustrated Reference
Here is some data
from the US Census to play with - population estimates for the
50 US states in 2014
As
usual you should create a new R based Jupyter notebook, read in
the state population data and show the different breakpoints
(equal intervals, quantiles, mean-standard deviation, maximum,
and what you think are the natural breaks. Note that R has a
quantile function, a standard deviation function, as well as the
summary function. For the natural breaks you should plot the
data and then decide on where you think the breaks should be.
you should then take this
further and color in the states based on this data. The visuals
making use of albersusa at the link below show a similar
process. (note that albersusa needs to be installed with a
special command: devtools::install_github("hrbrmstr/albersusa")
and if you don't have devtools installed then you need to
install it with install.packages("devtools"). You may want to install these from the
terminal window that you can bring up in the Anaconda R
environment (click on the triangle in your R environment and
select Open Terminal and then start R with 'R'). On Windows
you may also first need to install Rtools from this page
https://cran.r-project.org/bin/windows/Rtools/history.html and
be sure to check the version number of R from the Anaconda
terminal (at this point you will likely need the older
Rtools35.exe)
As usual you should print out
your Jupyter notebook when you are done and submit the PDF via
gradescope.
Provenance
- data moves through several forms and filters on its way
to being visualized and analyzed. Its important to keep track of
who has done what to the data at each step so the validity of
the final product can be ascertained, and if any issues arise
with the original data collection or the intermediate steps then
its easy to find which data products are affected.
You wouldn't just grab data off the web and assume that its
correct, would you? would you?
Data Quality: Lineage
can be used to estimate data quality and data reliability
based on the source data and transformations. It can also
provide proof statements on data derivation.
Audit Trail: Provenance
can be used to trace the audit trail of data, determine
resource usage, detect errors in data generation, help
determine who gets the patent.
Replication Recipes:
Detailed provenance information can allow repetition of data
derivation, help maintain its currency, and be a recipe for
replication.
Attribution: Pedigree
can establish the copyright and ownership of data, enable
its citation, and determine liability in case of erroneous
data.
Informational: A generic
use of lineage is to query based on lineage metadata for
data discovery. It can also be browsed to provide a context
to interpret data.
The meta
data that moves along with a dataset should give these details,
and as the information moves from raw data through various
stages of processing the meta data should be updated in
sufficient detail.
Different computer hardware, operating systems, and versions of
libraries can also affect the data products. Documenting these
are important to be able to replicate the data manipulations.
Containers allow us to create and package virtual versions of
the machine with a given set of software, making it easy to do
this replication.
Coming Next Time
Medical Visualization and Scientific Visualization
last
revision: 3/2/2020 - updated how to make Anaconda's version of R
happier