Week 3

The Basics



The purpose of visualization is to make it easy for the user to see the patterns, the similarities, the differences in the data.

This involves the variation in the data itself, the variation in the representation of that data,  and the ability of a human being to perceive variation.

In general you do not want to let the computer use its default values. Unless you are using a specific program for a specific field the default values will not be right for your work. This is especially true of programs like Word and Excel (though both have improved a lot in the last couple years in this regard).

Tables and simple graphs are going to come up fairly often in visual analytics. Even with all the fancy new visualization options we have there are still very good reasons to use simple tables and charts when possible.


Lets start with tables - the format of a table can greatly enhance or reduce the readability.


Here is a table from the US Environmental Protection Agency from 2002 - the Total Emissions column of data is centered making it very hard to compare the values within, the source sector is centered making it harder to read.

National Carbon_Monoxide Emissions in 2002
Source Sector Total Emissions
Electricity Generation 652,314
Fires 14,520,530
Fossil Fuel Combustion 1,499,367
Industrial Processes 2,414,055
Miscellaneous 33,786
Non Road Equipment 22,414,896
On Road Vehicles 62,957,908
Residential Wood Combustion 2,704,197
Road Dust 0
Solvent Use 3,294
Waste Disposal 2,018,496


A better version of the table would be the following where both the sources and the amount of emissions are easier to see and quickly grasp:

National Carbon_Monoxide Emissions in 2002
Source Sector  Total Emissions
 Electricity Generation 652,314 
 Fires 14,520,530 
 Fossil Fuel Combustion 1,499,367 
 Industrial Processes 2,414,055 
 Miscellaneous 33,786 
 Non Road Equipment 22,414,896 
 On Road Vehicles 62,957,908 
 Residential Wood Combustion 2,704,197 
 Road Dust
 Solvent Use 3,294 
 Waste Disposal 2,018,496 


Here is a made-up table - its hard to see any pattern in the Yes/No Values.


Yes
No
Yes
Yes
No
No
No
Yes
Yes
No
No
No
Yes
Yes
No


A better version (if all of the cells are filled with one of two values) would be:

Yes
-
Yes
Yes
-
-
-
Yes
Yes
-
-
-
Yes
Yes
-


A different better version of the table using colour to help highlight the pattern would be:

Yes
No
Yes
Yes
No
No
No
Yes
Yes
No
No
No
Yes
Yes
No


Here is a table from the Nielsen Games page:
https://www.nielsen.com/us/en/insights/reports/2018/us-games-360-report-2018.html

The Usage Min % column is hard to read because its left justified.

Original table for game console usage


This version below is easier to read because the right column of numbers is right justified. The decimal points align and bigger numbers look bigger. I also moved the text off the grid lines to make them more readable.

Revised table for
          game console usage

for some more recent related data you can check out:

https://www.nielsen.com/us/en/insights/news/2015/game-consoles-in-2015-one-stop-shop-for-games-and-entertainment.html


Be careful of significant digits

Your table should not show more accuracy than the accuracy of the data collection. The computer will happily compute an average out to an alarming number of digits, but if you only took measurements to one decimal point then that's as far as you should show any derived (average, min, max, median, etc) values.

Programs may also reduce your significant digits by eliminating trailing zeros (turning 4.20 into 4.2) so you will want to force all the data of the same type collected in the same way to have the same number of significant digits.

For presentations, your tables should only show as much accuracy as needed to get your point across. If two values differ by 100 then you don't need to show those values to the third decimal place. The additional detail in the numbers gets in the way of seeing the bigger trend. You can keep another slide hidden in the slide morgue after the end of your talk that has all the explicit details in case someone is interested.

Here is another table from the same Nielsen page. Again left justifying the numbers makes things harder to read, but there are also an issue of significant digits. We can presume since they have been in the survey business a long time that they do have faith in their data out to that degree of significance, and very likely that number of digits is necessary to disambiguate data further down the table, but since they are just presenting the top 10, the extra digits get in the way. Original table
                for videogame playing time


The next version makes it easier to see the overall relationships. Another possible change would be to convert the data on minutes per week into data on hours per week. Its hard to have an intuitive sense of '546 minutes'. If you are telling a friend how long a movie you saw last night was do you say it was 140 minutes long, or do you say 2 hours and 20 minutes long?

Keep your audience in mind when creating a table. You will want to keep all of your data in its highest resolution form, but when you present it, present just the right amount of detail for the people you will be speaking to. More technical people will want more detail; less technical people will want the information at a higher level. Some people want to see detailed trends, others just overall trends. Don't reuse your charts for different audiences, create new ones targeted towards the specific audience.


I should point out that if I was creating these tables myself for these notes then I would use white text on a black background, since these web pages have a black background

Revised table for
                    videogame playing time





A bit more on text. You have several general choices of font styles to use
And one font, comic sans, deserves some mention on its own. Here is one good link (with profanity) about comic sans. In the summer of 2011 there were quite a few blog posts devoted to a 100 page US Army PowerPoint presentation using comic sans e.g. this one.

Scientists do this kind of thing as well. How long does it take for you to read the title screen here? - https://www.youtube.com/watch?v=nLacmrM5xQw

Since we are focusing on interactive computer-based visualizations, you should start with a sans-serif font like Helvetica and only change it if you have a very good reason, or you work with a graphic designer who picks an appropriate font since they are trained to know when to bend and break the rules.

Here is a nice infographic on type -
https://s3.amazonaws.com/buzzfeed-media/Images/2011/08/2OVMi.png

And if you are really into this kind of thing, the 2007 documentary 'Helvetica' is much more interesting than you might expect - https://www.imdb.com/title/tt0847817/

Hellvetica is a font that intentionally breaks kerning rules - once you start to see kerning there are a fair number of signs out there that look like this - https://hellveticafont.com/

Familiar words are recognized by shape

O lny srmat poelpe can raed tihs.
I cdnuolt blveiee taht I cluod aulaclty uesdnatnrd waht I was rdanieg. The phaonmneal pweor of the hmuan mnid, aoccdrnig to a rscheearch at Cmabrigde Uinervtisy, it deosn't mttaer in waht oredr the ltteers in a wrod are, t he olny iprmoatnt tihng is taht the frist and lsat ltteer be in the rgh it pclae. The rset can be a taotl mses and you can sitll raed it wouthit a porbelm. Tihs is bcuseae the huamn mnid deos not raed ervey lteter by istlef, but the wrod as a wlohe. Amzanig huh? yaeh and I awlyas tghuhot slpeling was ipmorantt! if you can raed tihs psas it on !!"

Choosing the right font makes it easy for people to recognize the words by shape and read efficiently.






Simple charts


Charts are pretty ubiquitous in visualization and visual analytics, usually used in combination with several other representations of the same data (e.g. geographic) so its important to get the charts right. Its also important to think about the differences between a static chart used in print or on the web, and a dynamic chart that the user can manipulate by hovering over elements with a mouse or clicking on an item in the chart, or having the chart dynamically update based on interactions with other elements of the visualization.


Here is an example charting the population of the USA from 2000 to 2007. First up is an overly dynamic 3D bar chart with a hard to read set of population numbers and a trend that is made even more pronounced by the 3D viewpoint. Please do not create charts like this.

Overly 3D chart of US population

 


Here is a less exciting but much more useful version where the data is shown in a 2D bar chart and the population values have commas to make it easier to see what the numbers actually are. Another good possibility would be to make the vertical column "Population (in Millions)" and then have 270, 275, 280 etc as the vertical values.

More readable 2D chart of US
                population

Here are a couple variants using lines with the actual data points highlighted, so you know what data was collected and what was interpolated. The big difference is in the Y-axis. One chart suggests there is slow steady growth; the other suggests rapid steady growth. When you choose the values for the Y-axis here you are making a statement about what the user will see as their first impression - you can't escape that.

Vertical scale suggesting slow growth   Vertical scale suggesting rapid
        growth


Bar charts are good for showing discrete (or categorical) values while line charts are good for showing trends. Both are applicable here since the population of the US has a continuous value from 2000 to 2007, but it was measured at discrete intervals (years).

In all these cases a simple interaction is to allow the user to hover or click on one of the boxes and see the actual data values, or dynamically change the Y axis.


and now lets go back to the video game console data from above.

First let's see a couple charts from the older version of Microsoft Excel which just wasn't very good at making charts. While technically correct, the colours seem random and hurt your eyes, etc. Please do not create charts like this.


The newer versions of Excel are much better in dealing with colours and layout, but has also included lots of 3D bling that should be avoided. 3D distorts the data and adds in unnecessary details that make it harder to see what's really going on. Please do not create charts like this.

Instead we can display the data without the 3D. By default Excel with pick the colours for the various data values as seen above. If the data values are unrelated then the colours should be unrelated, but here we could also use the colour to relate consoles made by different manufacturers (blue for Sony, red for Nintendo, green for Microsoft, and grey for Other, with the more saturated colours for their latest releases.)

The pie chart makes it easier to see how each console compares to the whole, but the bar chart makes it easy to see how they compare to each other. A bar chart makes it easier to estimate the actual amount compared to a pie chart if you don't have the actual values displayed. Bar charts are better the more categories you have as the slices of the pie get smaller and harder to discern with more and more categories. In an analysis tool you may need both views simultaneously, and then additional visualizations to see the values over time.

Another option for the same data is a stacked bar chart, which makes it easier to estimate numbers from the chart and can make better use of space than a pie chart.

A line chart would not be appropriate to show this data because the data is categorical (an XBOX 360 is not 'more' or 'less' than a wii - its just a different category), and there is no natural ordering between the categories, and there is no continuous space between an Xbox 360 and a wii for there to be a trend shown by a line.


Here are a couple other pie chart examples. A good one comes from:
https://flowingdata.com/2008/09/19/pie-i-have-eaten-and-pie-i-have-not-eaten/


a bad one comes from our local fox news affiliate:
https://flowingdata.com/2009/11/26/fox-news-makes-the-best-pie-chart-ever/



There are many different kinds of charts


A really good book to look at for an introduction to this sort of thing is Edward Tufte's 'The Visual Display of Quantitative Information.'

Another good reference is Robert Harris' Information Graphics - A Comprehensive Illustrated Reference. Here is a nice overview of different kinds of charts

Another nice interactive icon based list is https://datavizcatalogue.com/

There are a variety of diagrams people have come up with to try and help people choose what kind of chart to use in different situations - a good list is available at https://policyviz.com/2014/10/06/graphic_continuum_inspiration/ - and just as many critiques.

In general we will try and stick to very common chart types in this class, ones you are most likely to encounter, and ones that are most likely to be easily understood by people you want to present to. As you move deeper into general data visualization, and especially data visualization in particular fields you will find very particular types of charts being used that are better for that particular type of data and usage.




Color

Before you create a chart you should know whether it will eventually appear in colour or greyscale. Color is more prevalent today than in the past since more people are getting their information in digital form, but some conferences and journals will still only print in greyscale, and some people still get their information through greyscale photocopies.

It would be good if the colours you choose also work for people who are colour blind.

8 percent of men
1 percent of women

Are you colour blind? You can see at https://www.color-blindness.com/ishiharas-test-for-colour-deficiency-38-plates-edition/

Here is an image of a color wheel seen with Protanope colour blindness.

  

You should at least make sure that you data doesn't blend together or disappear for people who are colour blind. A really good way is to avoid using green in your charts since red/green is the most common form of colour blindness, but that can be pretty limiting. Photoshop can be used to check images (View menu, Proof Setup, Color Blindness), and a  good web site to check your graphics is:  https://www.toptal.com/designers/colorfilter

and of course there are many apps for showing what color blind people see using your smartphone's camera. One fairly nice one is Chromatic Vision Simulator for ios and Android.

Try it out on a weather map like one: https://www.wunderground.com/maps/temperature/us-current

or https://www.wunderground.com/maps/temperature/us-current



Chromatic Colour

hue:
    distinguishes between colours (e.g. red, blue, green)

saturation:
    how far is the colour from a grey of equal intensity
    vivid colours (bright red, royal blue) are highly saturated, further from grey
    pastel colours (pink, sky blue) are lightly saturated, closer to grey

brightness:
perceived intensity of a luminous object

Unfortunately we can not generate all the colours that the eye can see using an RGB CRT or LCD or LED or OLED at this point. We also can not generate all the colours that the eye can see using photographic film (though it can display a larger part of the visible spectrum than our current displays)


Some advice on the use of colour:


A really good place to get advice on what colors and sets of colors to use is https://colorbrewer2.org/

We will be talking about colour more during the class and how the choice of colours depends on the data you are representing. Are the colors showing different categories like the videogame consoles where the colors should explicitly not suggest that one color (or console) is 'more' or 'less' than another? Then pick a qualitative color scheme. Are the colors showing sequential or numerical data like the amount of rain expected or the temperature on a weather map where the colors should explicitly suggest that one color is more or less than another? Then pick a sequential color scheme. Are you trying to highlight the data that is more or less than a particular value like which areas are getting more or less that their average amount of rainfall? Then pick a diverging color scheme.




3 kinds of lies: lies, damn lies, and statistics (quote attributed to several different people)

Here is a comparison of 3 graphics of the same data.

The first is from Time Magazine (4/9/79) via Tufte

Oil Prices Represented as Barrels

The second is from the Sunday Times (12/16/79) via Tufte

Oil Prices with odd Y-axis

The modern graphic below from inflationdata.com is a much more truthful representation of the data. Both scales are linear and in easy to understand units. The source of the data is cited. Contextual information is given at interesting points in the graph. The chart on the left shows oil prices. The chart on the right shows gasoline prices, which is something people can relate to more.

Better Visualization of Oil Prices

Nice graphic, so of course we ask how would you enhance this visualization if it was software-based?


Here is  another way to view the price of gasoline - geographically - as gasoline prices in the US as of January 2014 by county from gasbuddy.com.  In general prices are pretty similar within each state showing some variety on a zip code basis. While this older 2014 map has issues with color blindness I prefer the more obvious county boundaries compared to the current online map.


https://www.gasbuddy.com/gaspricemap

The line chart tells one truth over time. The maps tell a different truth geographically.



Back to the Lie Factor:



another one from the New York Times, 8/9/78 via Tufte:
Fuel
          Economy as a Perspective Road

The mileage standards rise from 18 to 27.5 which is a 53% increase, but the difference in the sizes of the lines representing those values from the New York Times is 783% which is almost 15 times larger ... dramatic, but not truthful. If we graph it without the extra perspective we see the following:




and another one from the Los Angeles Times (8/5/79) via Tufte:

Shrinking
          Family Doctor Graphic

Here a 1D value is represented by a 2D image. The widths of the images are proportional to the values being represented, and the heights of the images are also proportional to the values, which makes the visual differences much greater than the differences in the actual data. If you need to use a series of 2D images to represent a series of 1D values then the 2D areas of those images should be proportional to the values.

A more classical (boring) way to look at the same data is below:

The two line charts here show two different ways to treat the X axis. In one case on the left '1964', '1975', and '1990' are treated as categorical values, like 'Xbox' and 'wii' and are equally spaced apart. On the right those values are treated as numbers and are spaced apart according to their values as numbers. The one on the right is more appropriate as it gives a truer indication of the rate of change. (note that a similar issue came up when we were looking at the temperature data last week where the temperatures could be considered categories or numbers)


here is a nice one from http://www.datavis.ca/gallery/lie-factor.php
Doctor's Income

There are some graphical embellishments but basically we have two bar charts showing two roughly linear series of data ... so what's wrong?



Below is a more truthful version of the data where the X-axis is spread out linearly:

Doctor's Income as a
          Simple Chart


and finally what Tufte considered one of the worst



what is this chart telling us? It is telling us the percentage of college students that were under 25 from 1972 through 1976. That's only 5 values.

 Year
 Percentage
 1972
 72.0
 1973
 70.8
 1974
 67.2
 1975
 66.4
 1976
 67.0




On the left is a line chart from http://www.fao.org/worldfoodsituation/en/ showing food prices over the last four years with the years overlapping, which can help show common seasonal variations. On the right is a chart showing prices and inflation adjusted prices over the last 60 years. Pick the correct axis to investigate or highlight the trends you are interested in.

vs

Here is another county-based map - Pop vs Soda vs Coke from
http://www.popvssoda.com/

Pop-Soda-Coke
          Distribution in the US

There is also a version of this data on a state-by-state basis. What trends would be hidden by a state-by-state view?

https://www.e-education.psu.edu/geog160/sites/www.e-education.psu.edu.geog160/files/image/Chapter03/Coke_Pop_Soda.png



While the gratuitous use of 3D is usually something to avoid, sometimes it can be very useful. Here is an interesting (flash based) map of US population from Time Magazine that uses elevation to try and show the extreme differences in population density across the US. This gives a much more visceral feeling to the magnitudes that can be hard to convey with color.

http://web.archive.org/web/20101130190242/http://www.time.com/time/covers/20061030/where_we_live/


Time Multimedia - This is Where
        We Live

and an interactive version showing population for the whole planet: https://pudding.cool/2018/10/city_3d/

similar maps were helpful when graphing the spread of COVID.



another issue is how big to make your visualizations

1920 x 1080 is still pretty safe at the desktop level. We have higher resolution 4K etc displays but it can be hard to make use of all that resolution.

Given the movement to smartphones, the most common resolutions are back to the resolution that desktop computers were at in the mid 2000s, and now with wearables becoming more popular, they are back to the resolutions of desktop computers in the 1980s (though with much better colours).

Its very important to design for the specific platform in terms of physical size, resolution, colour representation, and where and when that platform will be used (indoors or outdoors, in bright sunlight or at night, etc.)

Responsive web design techniques and toolkits are usually pretty good about moving and scaling content but you need to make sure your visualizations are responsive as well, and that any interfaces to them are usable at that scale.

https://gs.statcounter.com/screen-resolution-stats



For the rest of class on Thursday I want people to revisit the Jupyter utility data scavenger hunt from last Thursday, starting a new notebook for your exploration, this time also thinking about legends, and proper colors, and proper axis in the way the visualizations are presented. Make sure your charts work with a color blindness checker.

Again you should print out a copy of your notebook to a PDF and turn that in via Gradescope.


Coming Next Time

Information  Visualization

last revision 12/10/20