Week 8

Privacy & Uncertainty



There is more and more data in digital form out there. Much of it is private, so there aren't many large public datasets to investigate. Most likely the data that you will be visualizing and analyzing will be from a specific project with specific rules on how it can be used.



Privacy


AOL search data release / fiasco regarding privacy concerns

20 million AOL queries ( 10 million unique) from 650,000 users from March 1st to May 31st, 2006
2GB of uncompressed tab-delimited files

G. Pass, A. Chowdhury, C. Torgeson,  "A Picture of Search"  The First International Conference on Scalable Information Systems, Hong Kong, June, 2006.

The authors do give a warning about sexually explicit language in the queries, but not about the credit card numbers and social security numbers:

"CAVEAT EMPTOR -- SEXUALLY EXPLICIT DATA!  Please be aware that these queries are not filtered to remove any content.  Pornography is prevalent on the Web and unfiltered search engine logs contain queries by users who are looking for pornographic material.  There are queries in this collection that use SEXUALLY EXPLICIT LANGUAGE.  This collection of data is intended for use by mature adults who are not easily offended by the use of pornographic search terms.  If you are offended by sexually explicit language you should not read through this data.  Also be aware that in some states it may be illegal to expose a minor to this data. Please understand that the data represents REAL WORLD USERS, unedited and randomly sampled, and that AOL is not the author of this data."


An article in the NY Times talks about using the data to figure out who user 4417749 is:
http://www.nytimes.com/2006/08/09/technology/09aol.html

http://en.wikipedia.org/wiki/AOL_search_data_scandal

The data is still available on the internet. Once something appears on the internet there is going to be a copy of it stored somewhere. However there are serious ethical issues about using it, and disagreements about under what circumstances its use could be ethical.



"Privacy" refers to our right to control access to ourselves and to our personal information. It means that we have the right to control the degree, the timing, and the conditions for sharing our bodies, thoughts, and experiences with others.

 "Confidentiality" refers to agreements made about how information that has been provided will be protected. These agreements may include descriptions about whether identifiers will be retained, who will have access to identifiable data, and what methods will be used to safeguard data, such as encrypted storage or locked files.

There is currently no consensus in the research community about whether online communications in open forums constitute private or public behavior. E-mail is not private

Another major issue is that its at best difficult if not impossible to verify the age of someone on the internet. In many cases data can not be collected from or about minors without their parent's consent and asking/forcing people to click on a "I am 18 or older" button is not enough of a guarantee.

in 2009 netflix released data to see if other groups could improve their recommendation system with a $1,000,000 prize if a group could get a 10% improvement. I used the data for one of the class projects that year. The data included movie or TV show title, ID of person who rented it, what rating they gave it, and when they rented it for a subset of the overall netflix database.

Netflix was going to do a second contest. According to the New York Times:
"The new contest is going to present the contestants with demographic and behavioral data, and they will be asked to model individuals’ “taste profiles,” the company said. The data set of more than 100 million entries will include information about renters’ ages, gender, ZIP codes, genre ratings and previously chosen movies. Unlike the first challenge, the contest will have no specific accuracy target. Instead, $500,000 will be awarded to the team in the lead after six months, and $500,000 to the leader after 18 months."

But then decided to cancel the contest.

Here is a paper looking at how much knowledge is needed to identify someone from the Netflix contest data. (100,000,000 ratings from 500,000 users) Section 5 goes into the Netflix example in detail.

"An adversary may have auxiliary information about some subscriber’s movie preferences: the titles of a few of the movies that this subscriber watched, whether she liked them or not, maybe even approximate dates when she watched them. Anonymity of the Netflix dataset thus depends on the answer to the following question: How much does the adversary need to know about a Netflix subscriber in order to identify her record in the dataset, and thus learn her complete movie viewing history?"

"Very little auxiliary information is needed for de-anonymize an average subscriber record from the Netflix Prize dataset. With 8 movie ratings (of which 2 may be completely wrong) and dates that may have a 14-day error, 99% of records be uniquely identified in the dataset. For 68%, two ratings and dates (with a 3-day error) are sufficient."

http://arxiv.org/PS_cache/cs/pdf/0610/0610105v2.pdf

more on the controversy here:
http://www.consumeraffairs.com/news04/2010/03/netflix_contest.html



Malte Spitz from the German Green party decided to publish his own data collected from August 2009 to February 2010. However, to even access the information, he had to file a suit against Deutsche Telekom.

http://www.zeit.de/digital/datenschutz/2011-03/data-protection-malte-spitz

http://www.zeit.de/datenschutz/malte-spitz-vorratsdaten


london bike sharing data gets too personal
http://qz.com/199209/londons-bike-share-program-unwittingly-revealed-its-cyclists-movements-for-the-world-to-see/




This is a major concern with medical data. Even removing a patient's name and social security number, the very data stored about a patient may be enough to identify him or her if a patient has a rare condition, or even a less rare condition in a sparsely populated area.

This means that 'raw' data may be unavailable, or that the raw data may need to be anonymized - e.g. instead of knowing what town a person lives in, maybe a zip code, or a county, or a state is given. Maybe instead of a particular age (e.g. 42) an age range is given (40-45)

General ways to safeguard data:

To de-identify medical data the following 18 identifiers must be removed:
  1. names
  2. all geographic subdivisions smaller than a state
  3. all elements of data (except year) for all dates directly related to the individual. For individuals > 89 years old the year must also be removed
  4. telephone number
  5. fax number
  6. e-mail address
  7. social security number
  8. medical record number
  9. health plan beneficiary numbers
  10. account numbers
  11. certification / license number
  12. vehicle identifiers and serial numbers (eg license plate numbers)
  13. device identifiers and serial numbers (ie for anything placed in the body)
  14. URLs
  15. IP addresses
  16. biometric identifiers (finger print, voice print)
  17. full face photographic images or comparable images
  18. any other unique identifying number characteristic or code


ENRON data

http://www.cs.cmu.edu/~enron/

raw data has 620,000 emails from 158 people, cleaned up data has 200,000 messages



Government data including a lot on money and elections

http://www.opensecrets.org/action/tools.php



US Census Data - http://www.census.gov/data/data-tools.html

IMDB data - http://www.imdb.com/interfaces#plain
Netflix Contest Data - no longer officially available online


DBpedia - http://en.wikipedia.org/wiki/DBpedia
Wikipedia Page Traffic statistics - http://developer.amazonwebservices.com/connect/entry.jspa?externalID=2596

Visible Male and Female (we'll talk about these in week 10) - http://en.wikipedia.org/wiki/Visible_Human_Project

Wireless trace data
Crawdad - Community Resource for Archiving Wireless Data At Dartmouth http://crawdad.cs.dartmouth.edu



Lots of webpages are front ends for databases that can be harvested with simple tools like wget even if the data isn't explicitly made available.

Right now its difficult to gain access to data and integrate it through common formats. It used to be really hard to do that with plain textual information ... then the web took the Memex ideas from the 40s and hypertext ideas from the 70s and made it work ... effortlessly.

How can we link data as easily? Here is a TED talk on that topic
http://www.ted.com/talks/tim_berners_lee_on_the_next_web.html






Uncertainty visualization is an important are that has been researched in particular domains but not in general and not thoroughly.

There are many different ways to try and formalize uncertainty from different fields.

One way comes from a Microsoft Research paper:
http://research.microsoft.com/pubs/64267/avi2008-uncertainty.pdf



level 1: Measurement Precision
       - imprecise measurements - might have explicit range of imprecision

level 2 Completeness
       - sampling strategy often used since its impossible to collect / simulate / compute / visualize 'all the data'
       - missing values
       - aggregation / summarization - detailed data is replaced by higher level concepts

       known knowns - information you know
       known unknowns - information you know exists but you don't have
       unknown unknowns - information you don't even know that you are missing - scary ones

level 3: Inferences - adding meaning to the data and using it to make decisions
       - modelling
       - prediction
       - extrapolation

disagreement
    - measurement - multiple measurements of the same value do not agree
    - completeness - overlapping but not identical datasets
    - inference - multiple models generate different results from the same input data or
        multiple people come to different conclusions from the same data

credibility
       - measuring instrument or source of data may lack credibility based on past performance
       - different investigators may rate different sources more or less credible
       - different investigators may rate other investigators as more or less credible


Here are some real world examples of uncertainty visualization.

Here is a rather nice visualization of the possible path of a hurricane


and another from http://www.click2houston.com/hurricanetracker/index.html



Hurricane Irene

here the previous path is known with a high degree of certainty and based on that path and many other variables the potential future paths are shown.



Here is a visualization of travel times from Circle out I-290 to Harlem for Fridays over the past 5 years, and this particular Friday, which did not follow the typical pattern. The average travel time and the 68 percent region show me the typical pattern, and the line for today shows me how well its currently been matching that pattern.


For comparison here is what a typical Sunday looks like.


Here a large amount of collected data is used to show the most likely traffic times for a given day. That allows the user to make some predictions. Having the known values for earlier in the current day tell the user how the current day is comparing to the typical day and should allow the user to make better predictions.

Similar charts can be found on the EPA site we looked at a few weeks ago - http://www.epa.gov/air/airtrends/2008/ and in particular http://www.epa.gov/air/airtrends/2008/report/ParticlePollution.pdf and http://www.epa.gov/air/airtrends/2008/report/NO2COSO2.pdf

Page 9 of this 2001 paper http://www.spatial.maine.edu/~worboys/SIE565/papers/pang%20viz%20uncert.pdf from Alex Pang at UC Santa Cruz has a nice example of using uncertainty values to modify a volumetric transfer function. We looked at transfer functions during the lecture on medical visualization where colour and transparency were used to better understand the features in a volumetric dataset. Here colour and transparency are used to show the data values and the uncertainty about those data values.

In this case:

This paper from 2003 http://www.cs.unh.edu/~sdb/rhodes/eg03.pdf looks at the rendering of isosurfaces that show uncertainty using hue and texture. They use hue to directly show the magnitude of the uncertainty on an otherwise colorless model, or if colour is already being used then overlay a texture pattern.

We can go back to the visual variables from the Thematic Cartography and Geovisualization book that we talked about in week 3 of the course.

size - The width of a line could be used to show uncertainty in a path, the size of a dot could be used to show uncertainty in a position. There is a danger here that the user might interpret the thicker line or the bigger dot as indicating 'more' rather than uncertainty so it may be a good idea to combine this with saturation, lightness, or transparency so the point or line is thicker, but also less saturated or lighter or more transparent.

saturation -  Saturation of a colour can be used to show uncertainty in a point, path, or area. A fully saturated hue could show certainty while a less saturated hue shows uncertainty, but the use of more than three levels of saturation is discouraged. Lightness could be used for similar purposes but like size one must be careful the user doesn't mistake darker for more.

transparency - as with saturation, a shape or hue could be more opaque to show certainty and more transparent to show uncertainty

crispness - for boundaries a crisp edge suggests a known / reliable boundary where a fuzzy edge suggests uncertainty. Similarly a high resolution edge suggests reliability where a low resolution edge suggests uncertainty.


another example comes from chapter 23 of the Thematic Cartography and Geovisualization book that we frequently turn to. In this case the visualizations are from a 2001 study

 

   
 
People using the Legend did the worst
Focus outperformed value as the best method



here is another nice reference page from Information Graphics - A Comprehensive Illustrative Reference that deals with numeric data and the common box and whisker (box plot, 5 number summary) format.


Wikipedia has a similar description: box and whisker - http://en.wikipedia.org/wiki/Box_and_whisker

and there is a nice simple example from statistics Canada here - http://www.statcan.gc.ca/edu/power-pouvoir/ch12/5214889-eng.htm

and an example



Coming Next Time

Social Network Visualization


last revision 10/14/2014