Location prediction of Twitter users using Wikipedia as a knowledge-base

Introduction

With the advent of social media, many applications like brand management, personalization and recommendation systems, real time event detection and crisis management are based on insights obtained from user generated content on Twitter. The geographical location of a Twitter user is key to these applications. Recent studies<ref>Morstatter, Fred, et al. "Is the Sample Good Enough? Comparing Data from Twitter's Streaming API with Twitter's Firehose." ICWSM. 2013.</ref> <ref>Saurabh Khanwalkar, Marc Seldin, Amit Srivastava, Anoop Kumar, and Sean Colbath. Content-based geo-location detection for placing tweets pertaining to trending news on map.</ref> have shown that less than 4% of tweets are tagged with latitude and longitude information. Existing approaches to predict the location of a Twitter user are statistical and need large training data sets to create models that predict the location of a user. In this work, we leverage Wikipedia to determine local entities of a city and use these entities to predict a user's geographic location.

The existing approaches to predict the location of a Twitter user can be broadly grouped in two categories:

Network based solutions
Content based solutions

Network based approaches consider the follower-followee information and their interactions with a user to predict a user's location. This approach is feasible only when a user has other users in his/her network who have published their actual location. On the other hand, the content based approaches rely on a large training dataset to determine the spatial distribution of words across the geographic area of their interest. For example, (cite) found that the word phillies was tweeted the most from users in Philadelphia. The disadvantage of this approach is that it needs a clean training data set containing representative tweets from all cities. This collection process can be tedious and time consuming. Our approach is based exclusively on the contents of the tweets of a user. Furthermore, we use Wikipedia to identify local entities, eliminating the need for a training dataset.

Architecture

Our approach comprises of three primary components:

Knowledge Base Generator extracts local entities for each city from Wikipedia and scores them based on their relevance to the city
User Profile Generator extracts the Wikipedia entities from the tweets of a user
Location Predictor uses the output of Knowledge Base Generator and User Profile Generator to predict the location of a user

Architecture

Knowledge Base Generator

Wikipedia is a large publicly available encyclopedia. Links to internal Wikipedia pages from a given page are an important feature of all Wikipedia pages. The aim of these links is to increase the understanding of a user about the given page. For instance, the Wikipedia page of New York City mentions Statue of Liberty. It also contains a link to the Wikipedia page of Statue of Liberty. Our approach is based on the assumption that the internal links represent entities that are local to the city, while varying in the degree of their localness.

We use the following four measures to score the local entities of a city, with respect to the city:

Pointwise Mutual Information

In information theory, pointwise mutual information of two random variables is a measure of their mutual dependence. We use this idea to determine the association between a city and its local entities.

Betweenness Centrality

We build a directed graph for each city using its internal links. The internal links correspond to the nodes of a graph. For a link from the Wikipedia page of one local entity to another, we draw an edge from the former to the latter in this graph. For example, in the graph of New York City an edge between Statue of Liberty and Manhattan indicates a link from the Wikipedia page of Statue of Liberty to the Wikipedia page of Manhattan. The betweenness centrality of each node (representing a local entity) gives the importance of the node relative to the rest of the nodes in the graph.

Semantic Overlap Measures

We use the hyperlink structure of Wikipedia to compute the semantic relatedness of a city and its local entities. We use the following set based measures to compute the semantic overlap between a city and its local entities:

Jaccard Index is a symmetric, set based measure that defines the similarity of two sets in terms of their overlap and is normalized for their sizes. We use this measure to find the similarity between a city and its local entities.
Tversky Index is an asymmetric measure of given two sets. While the Jaccard Index determines the similarity between a city and a local entity, a local entity generally represents a part of the city. Thus we use Tversky Index which is a unidirectional measure of similarity of the local entity to the city.

User Profile Generator

In order to use the local entities from our knowledge base to predict a user's location, we need to map the entities from the user's tweets to Wikipedia articles. Linking entities in tweets to Wikipedia articles has been well researched. This involves mapping named entities mentioned in tweets to be linked to the corresponding real world entities in Wikipedia. We use Zemanta [1] for this task. We chose Zemanta because of their relatively superior performance and the rate limit extension (10,000 requests per day) provided for research purposes.

Location Predictor

To predict the location of a user, we compute a score for each city with overlapping local entities from the tweets of a user as a product of the score of the local entity with respect to the city and the frequency of occurrence of the local entity in the tweets of the user. Further, by ranking the scores in descending order, the top k cities for the user are predicted.

Evaluation

We conducted our experiments on the test data set created by Cheng et al. This data set was created in 2010 and contains 5119 active users from the continental United States, with 1000+ tweets of each user. Their locations are listed in the form of latitude and longitude co-ordinates which are generally more reliable than the location information from Twitter profile. To create the knowledge-base, we used all the cities listed in the 2012 US Census with a population estimate greater than 5000. We extracted the hyperlink structure of Wikipedia using the XML dump [2]. Finally, we had 4661 cities and 500714 local entities in our knowledge-base.

Evaluation Metrics

We evaluated our approach using two metrics - Accuracy and Average Error Distance. Accuracy (ACC) is the percentage of users identified within 100 miles of their actual location. Error distance is the distance between the actual location of the user and the estimated location by our algorithm. Average Error Distance (AED) is the average error distance across all users in the dataset.

Results

Table 1 shows the results of our approach, based on ranking of local entities using Pointwise Mutual Information (PMI), Betweenness Centrality (BC), Jaccard Index (JI) and Tversky Index (TI). The results show that the local entities ranked using Tversky Index are the most accurate in predicting the location of a user. Our approach also performs better than two other approaches tested on the same dataset. Cheng et al. <ref>Cheng, Zhiyuan, James Caverlee, and Kyumin Lee. "You are where you tweet: a content-based approach to geo-locating twitter users." Proceedings of the 19th ACM international conference on Information and knowledge management. ACM, 2010.</ref> showed 51% accuracy and 535.564 miles of average error distance. Chang et al. <ref>Chang, Hau-wen, et al. "@ Phillies Tweeting from Philly? Predicting Twitter User Locations with Spatial Word Usage." Proceedings of the 2012 International Conference on Advances in Social Networks Analysis and Mining (ASONAM 2012). IEEE Computer Society, 2012.</ref> showed 49.9% accuracy and 509.3 miles of average error distance

Table 1: Location Prediction using Local Entities
Method	ACC (%)	AED (in Miles)	ACC@2	ACC@3	ACC@5
PMI	38.48	599.40	49.85	56.06	64.15
BC	47.91	478.14	57.39	62.18	66.98
JC	53.21	433.62	67.41	73.56	78.84
TI	54.48	429.00	68.72	74.68	79.99

Figure 1: Accuracy of Location Prediction Algorithm

Figure 1 shows the accuracy of our algorithm at different miles of radius. As shown, we can locate 27% of the users within 10 miles of their actual location.

We also applied our algorithm for users in the top 100 most populated cities of United States. In the test dataset, there are 2172 users from these cities. We were able to locate 54.65% of these users exactly at the city level. Furthermore, we were able to locate 60.63% of these users within 50 miles of their actual location.

Figure 2: Local Entities of San Francisco

Figure 2 shows the local entities of San Francisco scored using Tversky Index.

Source Code is available at [3]

References

Location Prediction of Twitter Users

Contents

Introduction

Architecture

Knowledge Base Generator

User Profile Generator

Location Predictor

Evaluation

Evaluation Metrics

Results

References

People

Navigation menu

Views

Personal tools

Navigation

Homepage

Search

Tools