Lecture 2

The Basics, Part I



Here is animage from Information Anxiety, P286. This time the focus is on over designing the graphic. Trying to make the graphic 'exciting' makes it harder to get information from it. 

over-angled graphic of US weather map

here are some more bad ones - http://www.math.yorku.ca/SCS/Gallery/say-something.html
Naturalness is an important design principle - better when the properties of the representation match the properties of the thing being represented. Representations that make use of spatial and perceptual relationships make more effective use of our brains. If these representations use arbitrary symbols then we need to use mental transformations, mental comparisons and other mental processes, forcing us to think reflectively. In experiential cognition we perceive and react efficiently. In reflective cognition we use our decision making skills.

3 kinds of lies: lies, damn lies, and statistics (quote atrributed to several different people)

Here is a page with nice examples: http://www.math.yorku.ca/SCS/Gallery/lie-factor.html

Here is a comparison of 3 graphics of the same data. 


Both have a high lie factor. Part of the lie in the first figure is not taking inflation into account, but the figure itself 'lies' by using 3 dimensional figures to represent a change in a single dimension. The extra dimensions make the difference seem larger - similar to starting a graph with the axis not at the origin. It also using foreshortening - pushing the past further back making it seem smaller than the present in the front

In the second figure the use of a line graph makes the data more truthful, but look at the labelling of the price axis - its not a linear scale. Also the second chart isnt really giving the price of crude oil - its giving the change in price after setting the price in 1972 to 100.

The modern graphic below from inflationdata.com is a much more truthful representation of the data. Both scales are linear and in easy to understand units. The source of the data is cited.




First - Pop vs Soda vs Coke from http://www.popvssoda.com/



- the representation of numbers, as physically measured on the surface of the graphic itself, should be directly proportional to the numerical quantities represented

        lie factor = size of effect shown in graphic vs size of effect in data

- clear, detailed, and thorough labeling should be used to defeat graphical distortion and ambiguity. Write out explanations of the data on the graphic itself. Label important events in the data.

- show data variation not design variation

- the number of information carrying dimensions depicted should not exceed the number of dimensions in the data

- graphics must not quote data out of context


The purpose of visualization is to make it easy for the user to see the patterns, the similarities, the differences in the data. This involves both the variation in the data itself and the ability of a human being to perceive variation.

In general you do not want to let the computer use its default values. Unless you are using a specific program for a specific field the default values will not be right for your work. This is especially true of programs like Word and Excel (though both have improved a lot in the last couple years in this regard.)

Lets start with tables - the format of a table can greatly enhance or reduce the readability.


Here is a table from the EPA - the Total Emissions column of data is centered making it very hard to compare the values within.

National Carbon_Monoxide Emissions in 2002
Source Sector Total Emissions
Electricity Generation 652,314
Fires 14,520,530
Fossil Fuel Combustion 1,499,367
Industrial Processes 2,414,055
Miscellaneous 33,786
Non Road Equipment 22,414,896
On Road Vehicles 62,957,908
Residential Wood Combustion 2,704,197
Road Dust 0
Solvent Use 3,294
Waste Disposal 2,018,496


A better version of the table would be the following where both the sources and the amount of emissions are easier to see and quickly grasp:

National Carbon_Monoxide Emissions in 2002
Source Sector  Total Emissions
Electricity Generation 652,314
Fires 14,520,530
Fossil Fuel Combustion 1,499,367
Industrial Processes 2,414,055
Miscellaneous 33,786
Non Road Equipment 22,414,896
On Road Vehicles 62,957,908
Residential Wood Combustion 2,704,197
Road Dust 0
Solvent Use 3,294
Waste Disposal 2,018,496


Here is a made-up table - its hard to see any pattern in the Yes/No Values.


Yes
No
Yes
Yes
No
No
No
Yes
Yes
No
No
No
Yes
Yes
No


A better version of the table would be:

Yes
-
Yes
Yes
-
-
-
Yes
Yes
-
-
-
Yes
Yes
-


A different better version of the table using colour to help highlight the pattern would be:

Yes
No
Yes
Yes
No
No
No
Yes
Yes
No
No
No
Yes
Yes
No


Here is a table from the Nielsen Games page:
http://blog.nielsen.com/nielsenwire/media_entertainment/top-pc-game-titles-and-consoles-october-2008/

The rightmost column of numbers is hard to read because its left justified.


This version is easier to read because the right column of numbers is right justified. The decimal points line and bigger numbers look bigger.

Be careful of significant digits

Your table should not show more accuracy than the accuracy of the data collection. The computer will happily compute an average out to an alarming number of digits, but if you only took measurements to one decimal point then that's as far as you should show any derived (average, min, max, median, etc) values.

Programs may also reduce your significant digits by eliminating trailing zeros (turning 4.20 into 4.2) so you will want to force all the data of the same type collected in the same way to have the same number of significant digits.

For presentations, your tables should only show as much accuracy as needed to get your point across. If two values differ by 100 then you dont need to show those values to the third decimal place. The additional detail in the numbers gets in the way of seeing the bigger trend.


Here is another table from the same Nielsen page. Again left justifying the numbers makes things harder to read, but there are also an issueof significant digits. We can presume since they have been in the survey business a long time that they do have faith in their data out to that degree of significance, and very likely that number of digits is necessary to disambiguate data further down the table, but since they are just presenting the top 10, the extra digits get in the way.



The next version makes it easier to see the overall relationships. Another possible change would be to convert the data on minutes per week into data on hours per week. If you are telling a friend how long a movie you saw last night was do you say it was 140 minutes long, or do you say 2 hours and 20 minutes long?

Keep your audience in mind when creating a table. You will want to keep all of your data in its highest resolution form, but when you present it, present just the right amount of detail for the people you will be speaking to. More technical people will want more detail; less technical people will want the information at a higher level. Some people want to see detailed trends, others just overall trends. Don't reuse your charts for different audiences, create new ones targetted towards the specific audience.


I should point  out that if I was creating these tables myself then I would use white text on a black background, since these web pages have a black background


Simple charts


Here is an example charting the population of the USA over the last 8 years. First up is an overly dynamic 3D chart with a hard ro read set of population numbers and a trend that is made even more pronounced by the 3D viewpoint. Please do not create charts like this.


Here is a less exciting but ,uch more useful version where the data is shown in 2D and the population values have commas to make it easier to see what the numbers actually are. Another good possibility would be to make the vertical column "Population (in Millions)" and then have 270, 275, 280 etc as the vertical values.

Here are a couple variants using lines with the actual data points highlighted. The big difference is in the Y-axis. One chart suggests there is slow steady growth; the other suggests rapid steady growth.



and now lets go back to the video game console data from above.

First let's see a couple charts from the older version of Excel. The older Excel just wasnt very good at making charts - the colours hurt your eyes, the odd grey background shouldn't be there, etc. Its best just to avoid using the older Excel to make charts. It takes too much time to fix everything that is wrong.


The latest version of Excel is much better in dealing with colours and layout, but has also included lots of 3D bling that should be avoided. 3D distorts the data and adds in unnecessary details that makes it harder to see what's really going on. Please do not create charts like this.

Instead we can display the data without the 3D. By default Excel with pick the colours for the various data values as seen above. If the data values are unrelated then thats fine, but here we could also use the colour to relate consoles made by different , manufacturers (blue for Sony,  red for Nintendo,  green for Microsoft, and grey for Other, with the more saturated colours for their latest releases.)


Before you create a chart you should know whether it will eventually appear in colour or greyscale. Colour is more prevalent today than in the past but some conferences and journals will still only print in greyscale.

It would be good if the colours you choose also work for people who are colour blind. You should at least make sure that you data doesnt blend together or disappear for people who are colour blind The colours I chose in the last couple graphs are OK, but an even better way is to avoid using green in your charts since red/geen is the most common form of colour blindness.  A good site to check your graphics is: http://colorfilter.wickline.org/


Coming Next Time

The Basics, Part II


last revision 2/3/09 (added some more notes on colour