This
gives
you
the
date
and time of the tweet, the sender, and the tweet
itself (which are mostly in English, but also in a variety of
other
languages and character sets). As you can see there are some easy
ways
to compress the data down, even in text ual form.
There are
lots of people talking about lots of different things using
twitter so
there is a lot of background noise and the popularity of different
topics can change rapidly. You should use this seven months of
data to
better understand the signal to noise ratio in twitter - what were
the
most common topics, and how did they relate to events in the news?
How
did the number of tweets vary with time of day or day of the year?
Can
you categorize users by how often they tweet? Can you link
@usernames
and #hashtags into networks?
Here is one starting point:
http://blog.twitter.com/2009/12/top-twitter-trends-of-2009.html
The first part of the project is to organize that data into a
browsable
form and give the user a visual overview of this seven months of
data
from different perspectives, and let the user interact to
investigate
the data in further detail.
Given your experience with that dataset, the second part of the
project
will investigate at the stream of tweets flying by. The static
dataset
from 2009 only gives the time, user, and content of the tweet, but
we
can filter the dynamic stream in more ways.
One
of
the
libraries
available for processing is tweet stream which lets
you receive tweets based on location, or keywords, or sender.
Given the
large number of tweets being sent, you
won't
be able to deal with the entire data set but will only be able to
look
at a small sample of the stream. Filtering by keywords or
locations
allows you to focus on the tweets you are more likely to be
interested
in. In part two you are going to look at what people are talking
about
in different locations. We will look at tweets
from New York, Chicago, and Los Angeles, or if you want a wider
variety
you could try London, Toronto, and Chicago. Sticking with
predominantly
English-speaking cities makes it easier to compare topics across
different cities.
As with part one you should also investigate changes related to
the
time of day and day of
the week, and see how the topics in the dynamic stream
relate to
events in the news. In part 2 you should collect data on tweets
coming
from each of the three locations for at least a week. This way you
have
the general background knowledge from last year, current knowledge
from
the previous week, and be able to use that to help classify and
compare
the data currently flowing by.
The visualizations and interfaces for Part 1 and 2 should be well
integrated
trendingtopics is a good place to look for topics which are
currently
gaining in popularity.
http://tweetstats.com/ is also a nice place to
look for some initial data and there is a list of other possible
starting points at
http://www.readwriteweb.com/archives/7_top_twitter_topic_trackers.php
For a
C
you need to
- Visualize and allow the user to interact with and
investigate the
static dataset form Part 1
For a
B
you need to add
- Enhance the interface from Part I to visualize and allow the
user
to interact with and investigate the dynamic dataset from Part
2 to
compare the tweets from the three chosen cities, and relate
them to the
knowledge you gained from the static dataset.
For
an A
you need to add
- something that impresses me.
- use your tool to do some analysis of the tweets. What do you
find?
As with projects 2 and 3, each Friday each member of the team
should
post a
description of what they did on the project to the project web
site.
This time your group should also produce a 2 minute youtube video
showing off the capabilities of your tool. The easiest way to do
this
is to use a screen-capture tool while interacting with the
application, though you will most likely find its useful to do
some
editing afterwards to tighten the video up. Its also a good idea
to
have a video like this available as a backup during your
presentation.