Difference between revisions of "Location Prediction of Twitter Users"

From Knoesis wiki
Jump to: navigation, search
Line 1: Line 1:
 
=Introduction=
 
=Introduction=
 
With the advent of social media, many applications like brand management, personalization and recommendation systems, real time event detection and crisis management are based on insights obtained from user generated content on Twitter. The geographical location of a Twitter user is key to these applications. Recent studies(cite) have shown that less than 4% of tweets are tagged with latitude and longitude information. Existing approaches to predict the location of a Twitter user are statistical and need large training data sets to create models that predict the location of a user. In this work, we leverage Wikipedia to determine local entities of a city and use these entities to predict a user's geographic location.
 
With the advent of social media, many applications like brand management, personalization and recommendation systems, real time event detection and crisis management are based on insights obtained from user generated content on Twitter. The geographical location of a Twitter user is key to these applications. Recent studies(cite) have shown that less than 4% of tweets are tagged with latitude and longitude information. Existing approaches to predict the location of a Twitter user are statistical and need large training data sets to create models that predict the location of a user. In this work, we leverage Wikipedia to determine local entities of a city and use these entities to predict a user's geographic location.
 +
 +
The existing approaches to predict the location of a Twitter user can be broadly grouped in two categories:
 +
# based on the network of the user and
 +
# based on the content of the tweets of a user
 +
 +
Network based approaches consider the follower-followee information and their interactions with a user to predict a user's location. This approach is feasible only when a user has other users in his/her network who have published their actual location. On the other hand, the content based approaches rely on a large training dataset to determine the spatial distribution of words across the geographic area of their interest. For example, (cite) found that the word phillies was tweeted the most from users in Philadelphia. The disadvantage of this approach is that it needs a clean training data set containing representative tweets from all cities. This collection process can be tedious and time consuming.  Our approach is based exclusively on the contents of the tweets of a user. Furthermore, we use Wikipedia to identify local entities, eliminating the need for a training dataset.

Revision as of 17:39, 23 June 2014

Introduction

With the advent of social media, many applications like brand management, personalization and recommendation systems, real time event detection and crisis management are based on insights obtained from user generated content on Twitter. The geographical location of a Twitter user is key to these applications. Recent studies(cite) have shown that less than 4% of tweets are tagged with latitude and longitude information. Existing approaches to predict the location of a Twitter user are statistical and need large training data sets to create models that predict the location of a user. In this work, we leverage Wikipedia to determine local entities of a city and use these entities to predict a user's geographic location.

The existing approaches to predict the location of a Twitter user can be broadly grouped in two categories:

# based on the network of the user and 
# based on the content of the tweets of a user

Network based approaches consider the follower-followee information and their interactions with a user to predict a user's location. This approach is feasible only when a user has other users in his/her network who have published their actual location. On the other hand, the content based approaches rely on a large training dataset to determine the spatial distribution of words across the geographic area of their interest. For example, (cite) found that the word phillies was tweeted the most from users in Philadelphia. The disadvantage of this approach is that it needs a clean training data set containing representative tweets from all cities. This collection process can be tedious and time consuming. Our approach is based exclusively on the contents of the tweets of a user. Furthermore, we use Wikipedia to identify local entities, eliminating the need for a training dataset.