Difference between revisions of "Analyzing URL Chatter on Twitter"

From Knoesis wiki
Jump to: navigation, search
(Introduction(Objectives and Motivation))
Line 16: Line 16:
 
The Url is an address for a document or a resource on the world wide web. The document/Resource is owned/published, read and searched. Here we get three perspectives for analysing a url. Publisher perspective,  User Perspective and the Search Engine perspective. The publisher gets to know where and how  the document he has published is being viewed, the User can choose the urls which are most talked about regarding his theme of interest and the search engine can use the analysis for better search.
 
The Url is an address for a document or a resource on the world wide web. The document/Resource is owned/published, read and searched. Here we get three perspectives for analysing a url. Publisher perspective,  User Perspective and the Search Engine perspective. The publisher gets to know where and how  the document he has published is being viewed, the User can choose the urls which are most talked about regarding his theme of interest and the search engine can use the analysis for better search.
  
The project analyzes the urls in the tweets over the time and the themes which are extracted during the twitris project.
+
'''The objective of this project is twofold: first, to analyze the association between Web documents and themes extracted from tweets. Secondly to evaluate this association over the time dimension.'''
  
 
===Challenges===
 
===Challenges===

Revision as of 17:45, 17 December 2009

Project Description

This project helps analyzing the URLs available in the tweets with the theme. The data(Tweets) crawled for Twitris project is being used.

Introduction(Objectives and Motivation)

Twitter is a free social networking and microblogging service. This enables user to put up their thoughts on an event, what they see, they do etc.. with around 140 characters which are termed as tweets. The twitris project developed by the knoesis center is a semantic Web application(uses tweets) that facilitates browsing for news and information, using social perceptions as the fulcrum . The twitris project does

  • Crawling of tweets
  • Spatio Temporal Thematic Analysis
  • Browsing using social signals as the fulcrum.

This project is an extension to the Twitris project which uses the tweets containing any url for analysis. The crawled data for the twitris project is used and also some of the functionalities are adapted from the twitris system.

Since twitter is microblogging which just provides 140 characters to input, the space should be managed appropriately. Hence there are services which transform a long url into short. Short is anywhere between 25 to 30 characters. These are short urls.

The Url is an address for a document or a resource on the world wide web. The document/Resource is owned/published, read and searched. Here we get three perspectives for analysing a url. Publisher perspective, User Perspective and the Search Engine perspective. The publisher gets to know where and how the document he has published is being viewed, the User can choose the urls which are most talked about regarding his theme of interest and the search engine can use the analysis for better search.

The objective of this project is twofold: first, to analyze the association between Web documents and themes extracted from tweets. Secondly to evaluate this association over the time dimension.

Challenges

During the course of project there were a lot of challenges faced, few of which were dealt with and few are kept are further work. The first and foremost challenge was with the regex which is used to extract the urls. Link here to that page that describes some challenges This still has a lot of scope for improvement. The urls in the tweets are typed in many ways and informally. give a couple of hard examples here. when there is no space to the next word, when protocol is missing, etc. Filtering valid URLs is also a challenge yet to be solved. The extraction code was implemented on Hadoop in order to take advantage of parallelism.

There are many ways to refer to the same document on the Web. Through URL redirection, short URLs, etc. users may use different URLs that point to the same document. Resolving all these "aliases" is important if we want to know which document is the user referring to in a tweet. Two approaches come to mind:

  • 1. For every URL extracted, we make a HTTP connection and get the landing document.
  • 2. Since HTTP connections take a long time, we only resolve short URIs. The first step then is to recognize URLs as long or short. There are a lot of services to shorten the url but no global service to do the converse. We kept a list of known "shorteners" and only resolved those.

Contributions

UrlExtraction

The twiris data was used for extracting the urls. Java regular expressions (regex) are used to extract and this is still under the scope for improvement. Since the users type urls in various forms the regex wont provide a 100 percent accuracy.

UrlResolving

The resolver takes the urls extracted and verifies whether the url is short or long. If short it sends a HTTP request to the short service domain which then sends a response which is a redirect. The location attribute of the response gives the original url which is then stored in the DB.

FrontEnd

The front end is done using HTML, CSS, XML and Java Script. The [Exhibit Timeline] javascript is used to show our analysis. The javascript takes an XML as the input and provides a graphical interactive object on the Web page.

BackEnd

The backend architecture is composed by a data layer and an API layer. Data layer: connects and queries the DB. API layer: servlets which forms an programmatic interface to the frontend. It takes the requests from the front end, calls the Db connector for the data and responses by xml. Four servlet classes are used for different purposes.

Status

Week 1

Programming Language:Java

  • Extracting the Urls from the tweets - Done
  • Recognizing the short/tiny Urls and transforming it into the long Urls. - Done

Week 2

Programming Language:Java

  • Creating a table for the Urls and the tweets to store the urls - Done
  • Analysing the urls with the presently available themes - Done

Week 3

Languages: Sql, Java(Servlets), JavaScript(Jquery), HTML, XML

  • Queries for performing the related operations
  • Working around with the Timeline javascript to integrate with the project

Week 4

  • Integrating the code to show the desired results

Future work

  • 1. Make sure the URL is not short by checking it recursively
  • 2. Themes-Entity extraction from the tweets in addition to using the present available themes
  • 3. Link tweets to the Linked Open Data cloud.
  • 3.1. Expose RDF describing tweets and their relationships with entities, themes, time, geospatial and URLs
  • 3.2. Link to URIs in the cloud. (depends on item 2)
  • 5. Compare themes/entities extracted from tweets with entities extracted from search query logs