Difference between revisions of "Analyzing URL Chatter on Twitter"

From Knoesis wiki
Jump to: navigation, search
(URls)
(Entities)
Line 60: Line 60:
  
 
===Entities===
 
===Entities===
*Freebase Entities : 8712104
+
*Freebase Entities : 8,712,104
*Entities Extracted from 8712104 tweets : 30688428
+
*Entities Extracted from 8,712,104 tweets : 30,688,428
*Unique Entities Extracted : 244383 (The Entities should be filtered since during Extraction we do not filter URLs which contains entities such as "htt","bit","ly" which has occurred most number of times.
+
*Unique Entities Extracted : 244,383 (The Entities should be filtered since during Extraction we do not filter URLs which contains entities such as "htt","bit","ly" which has occurred most number of times.
  
 
== Results ==
 
== Results ==

Revision as of 17:47, 12 January 2010

Project Description

This project helps analyzing the URLs available in the tweets with the theme. The data(Tweets) crawled for Twitris project is being used.

Introduction(Objectives and Motivation)

Twitter is a free social networking and microblogging service. This enables user to put up their thoughts on an event, what they see, they do etc.. with around 140 characters which are termed as tweets. The twitris project developed by the knoesis center is a semantic Web application(uses tweets) that facilitates browsing for news and information, using social perceptions as the fulcrum . The twitris project does

  • Crawling of tweets
  • Spatio Temporal Thematic Analysis
  • Browsing using social signals as the fulcrum.

This project is an extension to the Twitris project which uses the tweets containing any url for analysis. The crawled data for the twitris project is used and also some of the functionalities are adapted from the twitris system.

Since twitter is microblogging which just provides 140 characters to input, the space should be managed appropriately. Hence there are services which transform a long url into short. Short is anywhere between 25 to 30 characters. These are short urls.

The Url is an address for a document or a resource on the world wide web. The document/Resource is owned/published, read and searched. Here we get three perspectives for analysing a url. Publisher perspective, User Perspective and the Search Engine perspective. The publisher gets to know where and how the document he has published is being viewed, the User can choose the urls which are most talked about regarding his theme of interest and the search engine can use the analysis for better search.

The objective of this project is twofold: first, to analyze the association between Web documents and themes extracted from tweets. Secondly to evaluate this association over the time dimension.

Challenges

During the course of project there were a lot of challenges faced, few of which were dealt with and few are kept are further work. The first and foremost challenge was with the regex which is used to extract the urls. This still has a lot of scope for improvement. The urls in the tweets are typed in many ways and informally. For Example, "http://wiki.knoesis.org/index.phpis a wiki page", here the URL has been trailed with the words without a space in between. There would be URLs where the protocols are not mentioned and hence the protocols or the domains are to be completed. Filtering valid URLs is also a challenge yet to be solved. The extraction code was implemented on Hadoop in order to take advantage of parallelism.

There are many ways to refer to the same document on the Web. Through URL redirection, short URLs, etc. users may use different URLs that point to the same document. Resolving all these "aliases" is important if we want to know which document is the user referring to in a tweet. Two approaches come to mind:

  • 1. For every URL extracted, we make a HTTP connection and get the landing document.
  • 2. Since HTTP connections take a long time, we only resolve short URIs. The first step then is to recognize URLs as long or short. There are a lot of services to shorten the url but no global service to do the converse. We kept a list of known "shorteners" and only resolved those.

Software

Code is available at svn/classprojects/CitizenSensor/URLExtractorWeb/.

UrlExtraction

The twiris data was used for extracting the urls. Java regular expressions (regex) are used to extract and this is still under the scope for improvement. Since the users type urls in various forms the regex wont provide a 100 percent accuracy.

UrlResolving

The resolver takes the urls extracted and verifies whether the url is short or long. If short it sends a HTTP request to the short service domain which then sends a response which is a redirect. The location attribute of the response gives the original url which is then stored in the DB.

FrontEnd

The front end is done using HTML, CSS, XML and Java Script. The [Exhibit Timeline] javascript is used to show our analysis. The javascript takes an XML as the input and provides a graphical interactive object on the Web page.

BackEnd

The backend architecture is composed by a data layer and an API layer. Data layer: connects and queries the DB. API layer: servlets which forms an programmatic interface to the frontend. It takes the requests from the front end, calls the Db connector for the data and responses by xml. Four servlet classes are used for different purposes.

Data Analysed

The work on Unix for the Data Analysis mainly uses the following commands on Unix (Will put up all the commands with the process)

  • 1. Join
  • 2. Sort
  • 3. Cut
  • 4. Split
  • 5. Cat

Also used Pig for finding the number of occurrences of the Entities. (Need to post the Script)

URls

  • Tweets : 3,188,262
  • URLs Extracted : 1,560,142
  • Unique Short URls : 827,750
  • Unique Long(original) Urls : 645,026

Entities

  • Freebase Entities : 8,712,104
  • Entities Extracted from 8,712,104 tweets : 30,688,428
  • Unique Entities Extracted : 244,383 (The Entities should be filtered since during Extraction we do not filter URLs which contains entities such as "htt","bit","ly" which has occurred most number of times.

Results

  • Timeline Perspective: The main interface of Twitris was able to show data from a fixed time point spread over geography (GoogleMap). We added a timeline perspective to Twitris so that data can be shown spread over time as well.
  • Twitris on the LOD: We extracted mentions to entities that are already on the LOD cloud and added links from the tweets to those entities. We therefore made Twitris part of the LOD cloud.
  • Flexible query interface: We installed Cuebee to allow users to query the information extracted from tweets with total freedom to explore the geographical, time or thematic dimensions.

Future work

  • 1. Make sure the URL is not short by checking it recursively
  • 2. Themes-Entity extraction from the tweets in addition to using the present available themes
  • 3. Link tweets to the Linked Open Data cloud.
  • 3.1. Expose RDF describing tweets and their relationships with entities, themes, time, geospatial and URLs
  • 3.2. Link to URIs in the cloud. (depends on item 2)
  • 5. Compare themes/entities extracted from tweets with entities extracted from search query logs