Analyzing URL Chatter on Twitter

From Knoesis wiki
Revision as of 16:51, 26 May 2010 by Pablo (Talk | contribs)

(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to: navigation, search

This is a class project by User:Pavan, supervised by User:Pablo. It has been shown that social networks play a crucial role in the spread of information <ref name=contagion>Kristina Lerman, Rumi Ghosh. ICWSM'10. Information Contagion: an Empirical Study of the Spread of News on Digg and Twitter Social Networks</ref> In this project we explore mentions of Web documents on Twitter to analyze the thematic, spatial and temporal aspects of content recommendation.

Project Description

This project helps analyzing the URLs available in the tweets with the theme. The data (tweets) crawled for Twitris project are being used (see #Data_Analysed).

Introduction(Objectives and Motivation)

Twitter is a free social networking and microblogging service. This enables user to put up their thoughts on an event, what they see, they do etc.. with around 140 characters which are termed as tweets. The twitris project developed by the knoesis center is a semantic Web application(uses tweets) that facilitates browsing for news and information, using social perceptions as the fulcrum . The Twitris project does

  • Crawling of tweets
  • Spatio Temporal Thematic Analysis
  • Browsing using social signals as the fulcrum.

This project is an extension to the Twitris project which uses the tweets containing any url for analysis. The crawled data for the twitris project is used and also some of the functionalities are adapted from the twitris system. The objective of this project is twofold: first, to analyze the association between Web documents and themes extracted from tweets. Secondly to evaluate this association over the time dimension. In other words, we're interested in what is being said about a document. This requires that we first extract document mentions (URLs) from tweets. The Url is an address for a document or a resource on the world wide web. The document/Resource is owned/published, read and searched. Here we get three perspectives for analysing a url. Publisher perspective, User Perspective and the Search Engine perspective. The publisher gets to know where and how the document he has published is being viewed, the User can choose the urls which are most talked about regarding his theme of interest and the search engine can use the analysis for better search. Since microblogging restricts input to 140 characters, the use of URL shortening services is very common. Short is anywhere between 25 to 30 characters (Pavan, can you please confirm this? We now have the extracted URLs and can count how long they are in average).


During the course of project there were a lot of challenges faced, few of which were dealt with and few are kept are further work. The first and foremost challenge was with the regex which is used to extract the urls. This still has a lot of scope for improvement. The urls in the tweets are typed in many ways and informally. For Example, " a wiki page", here the URL has been trailed with the words without a space in between. There would be URLs where the protocols are not mentioned and hence the protocols or the domains are to be completed. Filtering valid URLs is also a challenge yet to be solved. The extraction code was implemented on Hadoop in order to take advantage of parallelism.

There are many ways to refer to the same document on the Web. Through URL redirection, short URLs, etc. users may use different URLs that point to the same document. Resolving all these "aliases" is important if we want to know which document is the user referring to in a tweet. Two approaches come to mind:

  • 1. For every URL extracted, we make a HTTP connection and get the landing document.
  • 2. Since HTTP connections take a long time, we only resolve short URIs. The first step then is to recognize URLs as long or short. There are a lot of services to shorten the url but no global service to do the converse. We kept a list of known "shorteners" and only resolved those.

Software Architecture

Code is available at svn/classprojects/CitizenSensor/URLExtractorWeb/.

The front end is done using HTML, CSS, XML and Java Script. The [Exhibit Timeline] javascript is used to show our analysis. The javascript takes an XML as the input and provides a graphical interactive object on the Web page.

The backend is composed by a data layer, a processing layer and an API layer. See also: Lotter

  • Data layer: connects and queries the DB.
  • Processing layer: Contains classes that extract information from the raw data so that we can then store the processed information in the DB.
    • URLExtractor: The twiris data was used for extracting the urls. Java regular expressions (regex) are used to extract and this is still under the scope for improvement. Since the users type urls in various forms the regex wont provide a 100 percent accuracy.
    • URLResolver: The resolver takes the urls extracted and verifies whether the url is short or long. If short it sends a HTTP request to the short service domain which then sends a response which is a redirect.
    • EntityExtractor: Uses a dictionary to extract named entity mentions.

The location attribute of the response gives the original url which is then stored in the DB.

  • API layer: servlets which forms an programmatic interface to the frontend. It takes the requests from the front end, calls the Db connector for the data and responses by xml. Four servlet classes are used for different purposes. Describe the servlets here.
    • Tagger: this service takes in the text for a tweet and calls the appropriate sequence of classes from the processing layer to extract URLs and entities. Returns an RDFa annotated HTML snippet.

Data Analysed

You can get a description on how we performed the Data Analysis at our internal project report page.


  • Total Tweets used : 3,188,262
  • URLs Extracted : 1,477,356
  • Unique Short URls : 827,750
  • Unique Long(original) Urls : 649,165

Top Urls

Top 30 Urls


  • Freebase Entities : 8,712,104
  • Entities Extracted from 3,188,262 tweets : 25,393,165
  • Unique Entities Extracted : 244,383 (The Entities should be filtered since during Extraction we do not filter URLs which contains entities such as "htt","bit","ly" which has occurred most number of times.

Top Entities

  • rt 574012
  • obama 371041
  • health 328798
  • iran 262052
  • care 257012
  • healthcare 239117
  • health care 224927
  • halloween 221393
  • flu 199835
  • michael 194474

Top 30 Entities


  • Timeline Perspective: The main interface of Twitris was able to show data from a fixed time point spread over geography (GoogleMap). We added a timeline perspective to Twitris so that data can be shown spread over time as well. timeline demo
  • Twitris on the LOD: We extracted mentions to entities that are already on the LOD cloud and added links from the tweets to those entities. We therefore made Twitris part of the LOD cloud. lod demo
  • Flexible query interface: We installed Cuebee to allow users to query the information extracted from tweets with total freedom to explore the geographical, time or thematic dimensions. flexible querying demo