Analyzing URL Chatter on Twitter

From Knoesis wiki
Revision as of 17:11, 21 November 2009 by Pavan (Talk | contribs) (Project Description)

Jump to: navigation, search

Project Description

This project helps analyzing the urls available in the tweets with the theme. The data(Tweets) crawled for twitris project is being used.

Introduction(Objectives and Motivation)

Twitter is a free social networking and microblogging service. This enables user to put up their thoughts on an event, what they see, they do etc.. with around 140 characters which are termed as tweets. The twitris project developed by the knoesis center is a semantic Web application(uses tweets) that facilitates browsing for news and information, using social perceptions as the fulcrum . The twitris project does

  • Crawling of tweets
  • Spatio Temporal Thematic Analysis
  • Browsing using social signals as the fulcrum.

This project is an extension to the twitris project which uses the tweets containing any url for analysis. The crawled data for the twitris project is used and also some of the functionalities are adapted from the twitris system.

Since twitter is microblogging which just provides 140 characters to input, the space should be managed appropriately. Hence there are services which transform a long url into short. Short is anywhere between 25 to 30 characters. These are short urls.

The Url is an address for a document or a resource on the world wide web. The document/Resource is owned/published, read and searched. Here we get three perspectives for analysing a url. Publisher perspective, User Perspective and the Search Engine perspective. The publisher gets to know where and how the document he has published is being viewed, the User can choose the urls which are most talked about regarding his theme of interest and the search engine can use the analysis for better search.

The project analyzes the urls in the tweets over the time and the themes which are extracted during the twitris project.

Challenges

During the course of project there were a lot of challenges faced, few of which were dealt with and few are kept are further work. The first and foremost challenge was with the regex which is used to extract the urls. This still has a lot of scope for improvement. The urls in the tweets are typed in many ways and informally. The informal urls are not easy for extractions and there are a lot just tweeted for fun where the hosts does not exists. Filtering the Urls is also a challenge yet to be solved.

Once the urls are extracted, The urls had to be recognized as long or short. There are a lot of services to shorten the url but no global service to do the vice versa. The Extraction code was also implemented on hadoop for better performance.

Contributions

UrlExtraction

The twiris data was used for extracting the urls. Regex is used to extract and this is still under the scope for improvement. Since the users type urls in various forms the regex wont provide a 100 percent accuracy.

UrlResolving

The resolver takes the urls extracted and verifies whether the url is short or long. If short it sends a htttp request to the short service domain which then sends a response which is a redirect. The location attribute of the response gives the original url which is then stored in the DB

FrontEnd

The front end is done using HTML, CSS, XML and Java Script. The Timeline javascript is used to show our analysis. We found similie timeline by mit to be most suitable displaying our analysis. The similie timeline takes an XML as the input and providing a graph on the Web page.

BackEnd

The back ends has an architecture comprising a layer which connects and queries the DB and another layer of the servlets which forms an interface to the frontend. It takes the requests from the front end, calls the Db connector for the data and responses by xml. Four servlet classes are used for different purposes.

Status

Week 1

Programming Language:Java

  • Extracting the Urls from the tweets - Done
  • Recognizing the short/tiny Urls and transforming it into the long Urls. - Done

Week 2

Programming Language:Java

  • Creating a table for the Urls and the tweets to store the urls - Done
  • Analysing the urls with the presently available themes - Done

Week 3

Languages: Sql, Java(Servlets), JavaScript(Jquery), HTML, XML

  • Queries for performing the related operations
  • Working around with the Timeline javascript to integrate with the project

Week 4

  • Integrating the code to show the desired results

Future work

  • 1. Make sure the Url is not short by checking it recursively
  • 2. Themes-Entity extraction from the tweets rather than using the present available themes
  • 3. Provisions in the DB to know the popularity of the url at that particular theme.