Difference between revisions of "Location Prediction of Twitter Users"

From Knoesis wiki
Jump to: navigation, search
(Introduction)
Line 3: Line 3:
  
 
The existing approaches to predict the location of a Twitter user can be broadly grouped in two categories:
 
The existing approaches to predict the location of a Twitter user can be broadly grouped in two categories:
(1) based on the network of the user and  
+
# based on the network of the user and  
(2) based on the content of the tweets of a user
+
# based on the content of the tweets of a user
  
 
Network based approaches consider the follower-followee information and their interactions with a user to predict a user's location. This approach is feasible only when a user has other users in his/her network who have published their actual location. On the other hand, the content based approaches rely on a large training dataset to determine the spatial distribution of words across the geographic area of their interest. For example, (cite) found that the word phillies was tweeted the most from users in Philadelphia. The disadvantage of this approach is that it needs a clean training data set containing representative tweets from all cities. This collection process can be tedious and time consuming.  Our approach is based exclusively on the contents of the tweets of a user. Furthermore, we use Wikipedia to identify local entities, eliminating the need for a training dataset.
 
Network based approaches consider the follower-followee information and their interactions with a user to predict a user's location. This approach is feasible only when a user has other users in his/her network who have published their actual location. On the other hand, the content based approaches rely on a large training dataset to determine the spatial distribution of words across the geographic area of their interest. For example, (cite) found that the word phillies was tweeted the most from users in Philadelphia. The disadvantage of this approach is that it needs a clean training data set containing representative tweets from all cities. This collection process can be tedious and time consuming.  Our approach is based exclusively on the contents of the tweets of a user. Furthermore, we use Wikipedia to identify local entities, eliminating the need for a training dataset.
 +
 +
=Architecture=
 +
Our approach comprises of three primary components:
 +
# <b>Knowledge Base Generator</b> extracts local entities for each city from Wikipedia and scores them based on their relevance to the city
 +
# <b>User Profile Generator</b> extracts the Wikipedia entities from the tweets of a user
 +
# <b>Location Predictor</b> uses the output of Knowledge Base Generator and User Profile Generator to predict the location of a user

Revision as of 17:45, 23 June 2014

Introduction

With the advent of social media, many applications like brand management, personalization and recommendation systems, real time event detection and crisis management are based on insights obtained from user generated content on Twitter. The geographical location of a Twitter user is key to these applications. Recent studies(cite) have shown that less than 4% of tweets are tagged with latitude and longitude information. Existing approaches to predict the location of a Twitter user are statistical and need large training data sets to create models that predict the location of a user. In this work, we leverage Wikipedia to determine local entities of a city and use these entities to predict a user's geographic location.

The existing approaches to predict the location of a Twitter user can be broadly grouped in two categories:

  1. based on the network of the user and
  2. based on the content of the tweets of a user

Network based approaches consider the follower-followee information and their interactions with a user to predict a user's location. This approach is feasible only when a user has other users in his/her network who have published their actual location. On the other hand, the content based approaches rely on a large training dataset to determine the spatial distribution of words across the geographic area of their interest. For example, (cite) found that the word phillies was tweeted the most from users in Philadelphia. The disadvantage of this approach is that it needs a clean training data set containing representative tweets from all cities. This collection process can be tedious and time consuming. Our approach is based exclusively on the contents of the tweets of a user. Furthermore, we use Wikipedia to identify local entities, eliminating the need for a training dataset.

Architecture

Our approach comprises of three primary components:

  1. Knowledge Base Generator extracts local entities for each city from Wikipedia and scores them based on their relevance to the city
  2. User Profile Generator extracts the Wikipedia entities from the tweets of a user
  3. Location Predictor uses the output of Knowledge Base Generator and User Profile Generator to predict the location of a user