Project due at
This will be the third group project. This project is going to
of a larger base dataset (at over 50 gig) and supplement it with a
dynamic real-time data stream. The data will consist of tweets.
project will focus more on the data
processing side - how do you find meaning in the wide variety of
tweets, and condense a large source dataset into something that
used in an interactive visual analytics tool.
This project will be more open-ended than the previous three
each group more options in how you want to try to find meaning in
The twitter7 dataset at stanford
contains 475 million tweets from 17 million users from June
December 2009. Its broken up into seven files where each file is
gig of compressed text and up to 8 gigs when uncompressed. At 26
gigabytes the complete dataset starts to give you an idea of the
of datasets that are common today.
The files are composed of lines like the following:
think of me, if ever. Today is not a happy day.
and time of the tweet, the sender, and the tweet
itself (which are mostly in English, but also in a variety of
languages and character sets). As you can see there are some easy
to compress the data down, even in text ual form.
lots of people talking about lots of different things using
there is a lot of background noise and the popularity of different
topics can change rapidly. You should use this seven months of
better understand the signal to noise ratio in twitter - what were
most common topics, and how did they relate to events in the news?
did the number of tweets vary with time of day or day of the year?
you categorize users by how often they tweet? Can you link
and #hashtags into networks?
Here is one starting point:
The first part of the project is to organize that data into a
form and give the user a visual overview of this seven months of
from different perspectives, and let the user interact to
the data in further detail.
Given your experience with that dataset, the second part of the
will investigate at the stream of tweets flying by. The static
from 2009 only gives the time, user, and content of the tweet, but
can filter the dynamic stream in more ways.
available for processing is tweet stream which lets
you receive tweets based on location, or keywords, or sender.
large number of tweets being sent, you
be able to deal with the entire data set but will only be able to
at a small sample of the stream. Filtering by keywords or
allows you to focus on the tweets you are more likely to be
in. In part two you are going to look at what people are talking
in different locations. We will look at tweets
from New York, Chicago, and Los Angeles, or if you want a wider
you could try London, Toronto, and Chicago. Sticking with
English-speaking cities makes it easier to compare topics across
As with part one you should also investigate changes related to
time of day and day of
the week, and see how the topics in the dynamic stream
events in the news. In part 2 you should collect data on tweets
from each of the three locations for at least a week. This way you
the general background knowledge from last year, current knowledge
the previous week, and be able to use that to help classify and
the data currently flowing by.
The visualizations and interfaces for Part 1 and 2 should be well
trendingtopics is a good place to look for topics which are
gaining in popularity.
http://tweetstats.com/ is also a nice place to
look for some initial data and there is a list of other possible
starting points at
you need to
you need to add
- Visualize and allow the user to interact with and
static dataset form Part 1
you need to add
- Enhance the interface from Part I to visualize and allow the
to interact with and investigate the dynamic dataset from Part
compare the tweets from the three chosen cities, and relate
them to the
knowledge you gained from the static dataset.
- something that impresses me.
- use your tool to do some analysis of the tweets. What do you
As with projects 2 and 3, each Friday each member of the team
description of what they did on the project to the project web
This time your group should also produce a 2 minute youtube video
showing off the capabilities of your tool. The easiest way to do
is to use a screen-capture tool while interacting with the
application, though you will most likely find its useful to do
editing afterwards to tighten the video up. Its also a good idea
have a video like this available as a backup during your
page describing your work on the
application. This time instead of embedding the processing
you should have a link so people can download your application
(and the necessary data files) to run your application. Please
sure that your application is Mac / Windows / Linux compatible.
can get your app to run online through a browser then do include
version as well. The web page should describe the contribution
team member (ie who worked on which interface elements, who
converting the data into a more usable form, etc.)
image of your visualization for the
should be named p4.<someone_in_your_groups_last_name>.jpg.
me a private email ranking
your coworkers on the project on a scale from 1 (low) to 5
terms of how good a coworker they were on the project. If you
want to work with them again, give them a 1. If this person
would be a
first choice for a partner on a future project then give them a
they did what was expected but nothing particularly good or bad
give them a 3. By default your score should be 3 unless you have
particular reason to increase or decrease the number. Please
your responses to 1, 2, 3, 4, 5 and no 1/3ds or .5s please. I
average out all these scores for projects 2 through 4 and keep
class and describe
its features to the rest of the
class. This allows everyone to see a variety of solutions to the
problem, and a variety of implementations.
we have six groups to go through in 75 minutes, each group will
for 8 minutes plus 4 minutes for questions from the other
presentation time should be evenly split among all members of
group. Each group sitting in the audience will be allowed one
for the group currently presenting, so come up with a good one.
answering the questions the group presenting should be concise
of the other groups have a chance to ask questions.
last revision 11/3/10
- corrected the file sizes