Location Prediction of Twitter Users

From Knoesis wiki
Revision as of 13:59, 8 July 2014 by Revathy (Talk | contribs) (Results)

Jump to: navigation, search

PAGE UNDER CONSTRUCTION

Introduction

With the advent of social media, many applications like brand management, personalization and recommendation systems, real time event detection and crisis management are based on insights obtained from user generated content on Twitter. The geographical location of a Twitter user is key to these applications. Recent studies(cite) have shown that less than 4% of tweets are tagged with latitude and longitude information. Existing approaches to predict the location of a Twitter user are statistical and need large training data sets to create models that predict the location of a user. In this work, we leverage Wikipedia to determine local entities of a city and use these entities to predict a user's geographic location.

The existing approaches to predict the location of a Twitter user can be broadly grouped in two categories:

  1. Network based solutions
  2. Content based solutions
Network based approaches consider the follower-followee information and their interactions with a user to predict a user's location. This approach is feasible only when a user has other users in his/her network who have published their actual location. On the other hand, the content based approaches rely on a large training dataset to determine the spatial distribution of words across the geographic area of their interest. For example, (cite) found that the word phillies was tweeted the most from users in Philadelphia. The disadvantage of this approach is that it needs a clean training data set containing representative tweets from all cities. This collection process can be tedious and time consuming. Our approach is based exclusively on the contents of the tweets of a user. Furthermore, we use Wikipedia to identify local entities, eliminating the need for a training dataset.

Architecture

Our approach comprises of three primary components:

  1. Knowledge Base Generator extracts local entities for each city from Wikipedia and scores them based on their relevance to the city
  2. User Profile Generator extracts the Wikipedia entities from the tweets of a user
  3. Location Predictor uses the output of Knowledge Base Generator and User Profile Generator to predict the location of a user

Knowledge Base Generator

Wikipedia is a large publicly available encyclopedia. Links to internal Wikipedia pages from a given page are an important feature of all Wikipedia pages. The aim of these links is to increase the understanding of a user about the given page. For instance, the Wikipedia page of New York City mentions Statue of Liberty. It also contains a link to the Wikipedia page of Statue of Liberty. Our approach is based on the assumption that the internal links represent entities that are local to the city, while varying in the degree of their localness.

We use the following four measures to score the local entities of a city, with respect to the city:

  • Pointwise Mutual Information

In information theory, pointwise mutual information of two random variables is a measure of their mutual dependence. We use this idea to determine the association between a city and its local entities.

  • Betweenness Centrality

We build a directed graph for each city using its internal links. The internal links correspond to the nodes of a graph. For a link from the Wikipedia page of one local entity to another, we draw an edge from the former to the latter in this graph. For example, in the graph of New York City an edge between Statue of Liberty and Manhattan indicates a link from the Wikipedia page of Statue of Liberty to the Wikipedia page of Manhattan. The betweenness centrality of each node (representing a local entity) gives the importance of the node relative to the rest of the nodes in the graph.

  • Semantic Overlap Measures

We use the hyperlink structure of Wikipedia to compute the semantic relatedness of a city and its local entities. We use the following set based measures to compute the semantic overlap between a city and its local entities:

  1. Jaccard Index is a symmetric, set based measure that defines the similarity of two sets in terms of their overlap and is normalized for their sizes. We use this measure to find the similarity between a city and its local entities.
  2. Tversky Index is an asymmetric measure of given two sets. While the Jaccard Index determines the similarity between a city and a local entity, a local entity generally represents a part of the city. Thus we use Tversky Index which is a unidirectional measure of similarity of the local entity to the city.

User Profile Generator

In order to use the local entities from our knowledge base to predict a user's location, we need to map the entities from the user's tweets to Wikipedia articles. Linking entities in tweets to Wikipedia articles has been well researched. This involves mapping named entities mentioned in tweets to be linked to the corresponding real world entities in Wikipedia. We use Zemanta [1] for this task. We chose Zemanta because of their relatively superior performance and the rate limit extension (10,000 requests per day) provided for research purposes.

Location Predictor

To predict the location of a user, we compute a score for each city with overlapping local entities from the tweets of a user as a product of the score of the local entity with respect to the city and the frequency of occurrence of the local entity in the tweets of the user. Further, by ranking the scores in descending order, the top k cities for the user are predicted.

Evaluation

We conducted our experiments on the test data set created by Cheng et al. This data set was created in 2010 and contains 5119 active users from the continental United States, with 1000+ tweets of each user. Their locations are listed in the form of latitude and longitude co-ordinates which are generally more reliable than the location information from Twitter profile. To create the knowledge-base, we used all the cities listed in the 2012 US Census with a population estimate greater than 5000. We extracted the hyperlink structure of Wikipedia using the XML dump [2]. Finally, we had 4661 cities and 500714 local entities in our knowledge-base.

Evaluation Metrics

We evaluated our approach using two metrics - Accuracy and Average Error Distance. Accuracy (ACC) is the percentage of users identified within 100 miles of their actual location. Error distance is the distance between the actual location of the user and the estimated location by our algorithm. Average Error Distance (AED) is the average error distance across all users in the dataset.

Results

The following table shows the results of our approach:

Table 1: Location Prediction using Local Entities
Method ACC AED (in Miles) ACC@2 ACC@3 ACC@5
PMI 38.48 599.40 49.85 56.06 64.15
BC 47.91 478.14 57.39 62.18 66.98
JC 53.21 433.62 67.41 73.56 78.84
TI 54.48 429.00 68.72 74.68 79.99


Approach

References

People