Week 3

The Basics

The purpose of visualization is to make it easy for the user to see the patterns, the similarities, the differences in the data.

This involves the variation in the data itself, the variation in the representation of that data, and the ability of a human being to perceive variation.

In general you do not want to let the computer use its default values. Unless you are using a specific program for a specific field the default values will not be right for your work. This is especially true of programs like Word and Excel (though both have improved a lot in the last couple years in this regard).

Tables and simple graphs are going to come up fairly often in visual analytics. Even with all the fancy new visualization options we have there are still very good reasons to use simple tables and charts when possible.

Lets start with tables - the format of a table can greatly enhance or reduce the readability.

Similar numeric data should have the same number of significant digits and be right justified if a whole number, or aligned to the decimal point if a real number - this makes it easier to see values getting bigger or smaller. Either use scientific notation, or don't, but don't mix scientific notation and standard notation. For big numbers in standard notation (e.g. 123482656) it can help to use commas to separate out the thousands, millions, billions places (e.g. 123,482,656).
Sans-serif fonts like Helvetica are easier to read on a computer screen than serif fonts like Times, though as screens gain resolution this is becoming less of an issue.
If you have a table with just Yes / No values for every cell then only show the most appropriate one (e.g. 'Y') and leave the other one blank or use a '-'. Its hard to scan a table with entries of Yes / No or Y / N quickly because the words/letters look too similar. Color can also be helpful in highlighting the values you want the viewer to pay attention to.
Have meaningful column headings and row labels. If you need multiple rows for the heading then break the words intelligently; don't let the program break up words where it feels like it
Standardize and use consistent abbreviations that are familiar to your target audience
Tables should not cross a page boundary. If you have a really long table then replicate the column headings at the top of the next page/slide. A reader should not have to look to a previous page/slide to see what the column headings are
If you have a big table it can be useful to alternate the background colour on adjacent rows with slightly different colours. This makes it easier to trace across a row.
Leave enough space between the table boundaries and the text

Here is a table from the US Environmental Protection Agency from 2002 - the Total Emissions column of data is centered making it very hard to compare the values within, the source sector is centered making it harder to read.

National Carbon_Monoxide Emissions in 2002
Source Sector	Total Emissions
Electricity Generation	652,314
Fires	14,520,530
Fossil Fuel Combustion	1,499,367
Industrial Processes	2,414,055
Miscellaneous	33,786
Non Road Equipment	22,414,896
On Road Vehicles	62,957,908
Residential Wood Combustion	2,704,197
Road Dust	0
Solvent Use	3,294
Waste Disposal	2,018,496

A better version of the table would be the following where both the sources and the amount of emissions are easier to see and quickly grasp:

National Carbon_Monoxide Emissions in 2002

Source Sector	Total Emissions
Electricity Generation	652,314
Fires	14,520,530
Fossil Fuel Combustion	1,499,367
Industrial Processes	2,414,055
Miscellaneous	33,786
Non Road Equipment	22,414,896
On Road Vehicles	62,957,908
Residential Wood Combustion	2,704,197
Road Dust	0
Solvent Use	3,294
Waste Disposal	2,018,496

Here is a made-up table - its hard to see any pattern in the Yes/No Values.

Yes	No	Yes
Yes	No	No
No	Yes	Yes
No	No	No
Yes	Yes	No

A better version (if all of the cells are filled with one of two values) would be:

Yes	-	Yes
Yes	-	-
-	Yes	Yes
-	-	-
Yes	Yes	-

A different better version of the table using colour to help highlight the pattern would be:

Yes	No	Yes
Yes	No	No
No	Yes	Yes
No	No	No
Yes	Yes	No

Here is a table from the Nielsen Games page:
https://www.nielsen.com/us/en/insights/reports/2018/us-games-360-report-2018.html

The Usage Min % column is hard to read because its left justified.

Original table for game console usage

This version below is easier to read because the right column of numbers is right justified. The decimal points align and bigger numbers look bigger. I also moved the text off the grid lines to make them more readable.

Revised table for
game console usage

for some more recent related data you can check out:

https://www.nielsen.com/us/en/insights/news/2015/game-consoles-in-2015-one-stop-shop-for-games-and-entertainment.html

Be careful of significant digits

Your table should not show more accuracy than the accuracy of the data collection. The computer will happily compute an average out to an alarming number of digits, but if you only took measurements to one decimal point then that's as far as you should show any derived (average, min, max, median, etc) values.

Programs may also reduce your significant digits by eliminating trailing zeros (turning 4.20 into 4.2) so you will want to force all the data of the same type collected in the same way to have the same number of significant digits.

For presentations, your tables should only show as much accuracy as needed to get your point across. If two values differ by 100 then you don't need to show those values to the third decimal place. The additional detail in the numbers gets in the way of seeing the bigger trend. You can keep another slide hidden in the slide morgue after the end of your talk that has all the explicit details in case someone is interested.

Here is another table from the same Nielsen page. Again left justifying the numbers makes things harder to read, but there are also an issue of significant digits. We can presume since they have been in the survey business a long time that they do have faith in their data out to that degree of significance, and very likely that number of digits is necessary to disambiguate data further down the table, but since they are just presenting the top 10, the extra digits get in the way.

Original table
for videogame playing time

The next version makes it easier to see the overall relationships. Another possible change would be to convert the data on minutes per week into data on hours per week. Its hard to have an intuitive sense of '546 minutes'. If you are telling a friend how long a movie you saw last night was do you say it was 140 minutes long, or do you say 2 hours and 20 minutes long?

Keep your audience in mind when creating a table. You will want to keep all of your data in its highest resolution form, but when you present it, present just the right amount of detail for the people you will be speaking to. More technical people will want more detail; less technical people will want the information at a higher level. Some people want to see detailed trends, others just overall trends. Don't reuse your charts for different audiences, create new ones targeted towards the specific audience.

I should point out that if I was creating these tables myself for these notes then I would use white text on a black background, since these web pages have a black background

Revised table for
videogame playing time

A bit more on text. You have several general choices of font styles to use

sans-serif (e.g. Verdana, Tahoma, Helvetica) good for on-screen text - e.g. 72 dpi
serif (e.g. Georgia) - good for printed text - e.g. 150-300 dpi. As screen resolutions increase serif fonts become more appropriate.
monospace (e.g. Courier) - good for certain occasions when you need exact alignment of the text, usually while coding
fantasy / cute / brush strokes / cursive / dripping blood - just say no, unless you are creating a party invitation

And one font, comic sans, deserves some mention on its own. Here is one good link (with profanity) about comic sans. In the summer of 2011 there were quite a few blog posts devoted to a 100 page US Army PowerPoint presentation using comic sans e.g. this one.

Scientists do this kind of thing as well. How long does it take for you to read the title screen here? - https://www.youtube.com/watch?v=nLacmrM5xQw

Since we are focusing on interactive computer-based visualizations, you should start with a sans-serif font like Helvetica and only change it if you have a very good reason, or you work with a graphic designer who picks an appropriate font since they are trained to know when to bend and break the rules.

Here is a nice infographic on type - https://s3.amazonaws.com/buzzfeed-media/Images/2011/08/2OVMi.png

And if you are really into this kind of thing, the 2007 documentary 'Helvetica' is much more interesting than you might expect - https://www.imdb.com/title/tt0847817/

Hellvetica is a font that intentionally breaks kerning rules - once you start to see kerning there are a fair number of signs out there that look like this - https://hellveticafont.com/

Familiar words are recognized by shape

lower case words are read faster than words in upper case
individual letters and nonsense words like UA1416 are read faster in upper case

O lny srmat poelpe can raed tihs.
I cdnuolt blveiee taht I cluod aulaclty uesdnatnrd waht I was rdanieg. The phaonmneal pweor of the hmuan mnid, aoccdrnig to a rscheearch at Cmabrigde Uinervtisy, it deosn't mttaer in waht oredr the ltteers in a wrod are, t he olny iprmoatnt tihng is taht the frist and lsat ltteer be in the rgh it pclae. The rset can be a taotl mses and you can sitll raed it wouthit a porbelm. Tihs is bcuseae the huamn mnid deos not raed ervey lteter by istlef, but the wrod as a wlohe. Amzanig huh? yaeh and I awlyas tghuhot slpeling was ipmorantt! if you can raed tihs psas it on !!"

Choosing the right font makes it easy for people to recognize the words by shape and read efficiently.

Simple charts

Avoid 3D - really. 3D may make a chart 'exciting' but it also makes it much harder to see relationships and can actively distort the data. Drop shadows are OK if you really want a 3D look, but if in doubt you should remove any bling
Avoid fully saturated colours (e.g. 255 0 0 red.) Look around you, most of the world is not bright primary colours. Pastels and colour mixtures are easier on the eyes (note the difference in Microsoft Excel's colour default pallets at the bottom of these notes). ggplot2 has a wide range of colors by name - http://sape.inf.usi.ch/quick-reference/ggplot2/colour
Choose an appropriate value for the axis to cross depending on what you want the user to see in the chart (e.g. should the smallest value be 0, which may minimize differences in the data, or closer to the low point in your dataset, which will emphasize the differences in the data, but may hide the overall picture)
The backdrop colour of the chart should match the backdrop of the page/slide you will put it on. For a paper the backdrop should be white. For my notes the backdrop should be black
If it is a line chart then make sure all the lines are visible against the backdrop - a yellow line is not visible against a white page and dark blue is not visible against black. Do not just accept the default colours. If certain lines are related to other lines then their colours should be related to each other
If it is a scatter plot of points then make sure the points are visible against the backdrop. If there is more than one point icon (e.g. circles and triangles, or red circles and blue circles) then make sure the viewer can see the difference from where they will be sitting in a room, or on the device they are most likely using. It can be very hard to see colour differences on small icons.
If you are labeling the data points, columns etc. with their actual values (which is often a good idea) then make sure the text is readable against any lines in the background and follow the rules for text above.
Make sure all axis are well labelled in terms of what the axis is and the units / categories
Make sure there is a meaningful title
Make sure there is a legend explaining what the various lines / points / colours mean. The legend should be ordered in a logical way - e.g. if the lines / points / colours are based on a value like temperature then the legend should be ordered from low to high (or high to low depending on the field). If the legend shows several independent things then the legend should be ordered in the same order as the general trend in the graph (i.e. things that are generally at the top of the graph should be at the top of the legend, things that are generally at the bottom of the graph should be at the bottom of the legend) to make it easier for the user to map between them.
If certain data values relate to one another then they should be next to each other or have a similar color, so its easier for the viewer to see the relationship, conversely things that are not related should have different colors.
If you have a series of graphs showing related results then try to keep the same axis extents, colours, patterns etc on all the charts so its easy to compare across them
It can be very important in certain fields to show error bars on your charts to show how accurate the measurements are

Charts are pretty ubiquitous in visualization and visual analytics, usually used in combination with several other representations of the same data (e.g. geographic) so its important to get the charts right. Its also important to think about the differences between a static chart used in print or on the web, and a dynamic chart that the user can manipulate by hovering over elements with a mouse or clicking on an item in the chart, or having the chart dynamically update based on interactions with other elements of the visualization.

Here is an example charting the population of the USA from 2000 to 2007. First up is an overly dynamic 3D bar chart with a hard to read set of population numbers and a trend that is made even more pronounced by the 3D viewpoint. Please do not create charts like this.

Here is a less exciting but much more useful version where the data is shown in a 2D bar chart and the population values have commas to make it easier to see what the numbers actually are. Another good possibility would be to make the vertical column "Population (in Millions)" and then have 270, 275, 280 etc as the vertical values.

Chromatic Colour

hue:

distinguishes between colours (e.g. red, blue, green)

saturation:

how far is the colour from a grey of equal intensity
vivid colours (bright red, royal blue) are highly saturated, further from grey
pastel colours (pink, sky blue) are lightly saturated, closer to grey

brightness:

perceived intensity of a luminous object

Unfortunately we can not generate all the colours that the eye can see using an RGB CRT or LCD or LED or OLED at this point. We also can not generate all the colours that the eye can see using photographic film (though it can display a larger part of the visible spectrum than our current displays)

Some advice on the use of colour:

Use colour conservatively
Limit the number of colours
Colour can speed recognition, or hinder it depending on what is coloured and how its coloured. Colour must support the task(s)
Colour can help in grouping related items
Colour can help in dense information displays
Colour coding should appear with minimal user effort and be under the user's control
Keep colour blindness in mind (see above)
Be consistent in the use of colour
Think about what certain colours commonly mean / represent (and this varies from culture to culture)
Be careful what colours are used together

e.g. bright red on bright blue is really really annoying

A really good place to get advice on what colors and sets of colors to use is https://colorbrewer2.org/

We will be talking about colour more during the class and how the choice of colours depends on the data you are representing. Are the colors showing different categories like the videogame consoles where the colors should explicitly not suggest that one color (or console) is 'more' or 'less' than another? Then pick a qualitative color scheme. Are the colors showing sequential or numerical data like the amount of rain expected or the temperature on a weather map where the colors should explicitly suggest that one color is more or less than another? Then pick a sequential color scheme. Are you trying to highlight the data that is more or less than a particular value like which areas are getting more or less that their average amount of rainfall? Then pick a diverging color scheme.

3 kinds of lies: lies, damn lies, and statistics (quote attributed to several different people)

Here is a comparison of 3 graphics of the same data.

The first is from Time Magazine (4/9/79) via Tufte

Oil Prices Represented as Barrels

The second is from the Sunday Times (12/16/79) via Tufte

Oil Prices with odd Y-axis

The modern graphic below from inflationdata.com is a much more truthful representation of the data. Both scales are linear and in easy to understand units. The source of the data is cited. Contextual information is given at interesting points in the graph. The chart on the left shows oil prices. The chart on the right shows gasoline prices, which is something people can relate to more.

Nice graphic, so of course we ask how would you enhance this visualization if it was software-based?

Here is another way to view the price of gasoline - geographically - as gasoline prices in the US as of January 2014 by county from gasbuddy.com. In general prices are pretty similar within each state showing some variety on a zip code basis. While this older 2014 map has issues with color blindness I prefer the more obvious county boundaries compared to the current online map.

https://www.gasbuddy.com/gaspricemap

The line chart tells one truth over time. The maps tell a different truth geographically.

Back to the Lie Factor:

the representation of numbers, as physically measured on the surface of the graphic itself, should be directly proportional to the numerical quantities represented
lie factor = size of effect shown in graphic vs size of effect in data
clear, detailed, and thorough labeling should be used to defeat graphical distortion and ambiguity. Write out explanations of the data on the graphic itself. Label important events in the data.
show data variation not design variation
the number of information carrying dimensions depicted should not exceed the number of dimensions in the data
graphics must not quote data out of context

another one from the New York Times, 8/9/78 via Tufte:
Fuel
Economy as a Perspective Road

The mileage standards rise from 18 to 27.5 which is a 53% increase, but the difference in the sizes of the lines representing those values from the New York Times is 783% which is almost 15 times larger ... dramatic, but not truthful. If we graph it without the extra perspective we see the following:

and another one from the Los Angeles Times (8/5/79) via Tufte:

Shrinking
Family Doctor Graphic

Here a 1D value is represented by a 2D image. The widths of the images are proportional to the values being represented, and the heights of the images are also proportional to the values, which makes the visual differences much greater than the differences in the actual data. If you need to use a series of 2D images to represent a series of 1D values then the 2D areas of those images should be proportional to the values.

A more classical (boring) way to look at the same data is below:

The two line charts here show two different ways to treat the X axis. In one case on the left '1964', '1975', and '1990' are treated as categorical values, like 'Xbox' and 'wii' and are equally spaced apart. On the right those values are treated as numbers and are spaced apart according to their values as numbers. The one on the right is more appropriate as it gives a truer indication of the rate of change. (note that a similar issue came up when we were looking at the temperature data last week where the temperatures could be considered categories or numbers)

here is a nice one from http://www.datavis.ca/gallery/lie-factor.php
Doctor's Income

There are some graphical embellishments but basically we have two bar charts showing two roughly linear series of data ... so what's wrong?

Below is a more truthful version of the data where the X-axis is spread out linearly:

Doctor's Income as a
Simple Chart

and finally what Tufte considered one of the worst

what is this chart telling us? It is telling us the percentage of college students that were under 25 from 1972 through 1976. That's only 5 values.

Year	Percentage
1972	72.0
1973	70.8
1974	67.2
1975	66.4
1976	67.0

On the left is a line chart from http://www.fao.org/worldfoodsituation/en/ showing food prices over the last four years with the years overlapping, which can help show common seasonal variations. On the right is a chart showing prices and inflation adjusted prices over the last 60 years. Pick the correct axis to investigate or highlight the trends you are interested in.

Here is another county-based map - Pop vs Soda vs Coke from http://www.popvssoda.com/

Pop-Soda-Coke
Distribution in the US

There is also a version of this data on a state-by-state basis. What trends would be hidden by a state-by-state view?

https://www.e-education.psu.edu/geog160/sites/www.e-education.psu.edu.geog160/files/image/Chapter03/Coke_Pop_Soda.png

While the gratuitous use of 3D is usually something to avoid, sometimes it can be very useful. Here is an interesting (flash based) map of US population from Time Magazine that uses elevation to try and show the extreme differences in population density across the US. This gives a much more visceral feeling to the magnitudes that can be hard to convey with color.

http://web.archive.org/web/20101130190242/http://www.time.com/time/covers/20061030/where_we_live/

Time Multimedia - This is Where
We Live

and an interactive version showing population for the whole planet: https://pudding.cool/2018/10/city_3d/

similar maps were helpful when graphing the spread of COVID.

another issue is how big to make your visualizations

1920 x 1080 is still pretty safe at the desktop level. We have higher resolution 4K etc displays but it can be hard to make use of all that resolution.

Given the movement to smartphones, the most common resolutions are back to the resolution that desktop computers were at in the mid 2000s, and now with wearables becoming more popular, they are back to the resolutions of desktop computers in the 1980s (though with much better colours).

Its very important to design for the specific platform in terms of physical size, resolution, colour representation, and where and when that platform will be used (indoors or outdoors, in bright sunlight or at night, etc.)

Responsive web design techniques and toolkits are usually pretty good about moving and scaling content but you need to make sure your visualizations are responsive as well, and that any interfaces to them are usable at that scale.

https://gs.statcounter.com/screen-resolution-stats

For the rest of class on Thursday, say 30-45 minutes, I want people to revisit the Jupyter utility data scavenger hunt from last Thursday, starting a new R notebook for your exploration, this time also thinking about legends, and proper colors, and proper axis in the way the visualizations are presented, given the discussion above. Look for evidence of some of the events listed in the Week 2 notes if you didn't do that in Week 2. Use markdown cells to add comments to the different parts of your notebook. Make sure your charts work with a color blindness checker and show a screenshot of at least one of your charts as seen through a color blindness checker in your notebook (you can add images to cells of type markdown).

Again you should print out a copy of your notebook to a PDF and turn that in via Gradescope.

This is also a good week to finish up the R and Shiny tutorials if you have not gone through them yet and start working with the data for Project 1.

Coming Next Time

Information Visualization

last revision 1/25/22