Difference between revisions of "Location Prediction of Twitter Users"

From Knoesis wiki
Jump to: navigation, search
Line 23: Line 23:
 
* <b>Semantic Overlap Measures</b>
 
* <b>Semantic Overlap Measures</b>
 
We use the hyperlink structure of Wikipedia to compute the semantic relatedness of a city and its local entities. We use the following set based measures to compute the semantic overlap between a city and its local entities.
 
We use the hyperlink structure of Wikipedia to compute the semantic relatedness of a city and its local entities. We use the following set based measures to compute the semantic overlap between a city and its local entities.
** <b>Jaccard Index </b>
+
# <b>Jaccard Index </b>
 
The Jaccard Index is a symmetric, set based measure that defines the similarity of two sets in terms of their overlap and is normalized for their sizes. We use this measure to find the similarity between a city and its local entities.
 
The Jaccard Index is a symmetric, set based measure that defines the similarity of two sets in terms of their overlap and is normalized for their sizes. We use this measure to find the similarity between a city and its local entities.
** <b>Tversky Index </b>
+
# <b>Tversky Index </b>
 
The Tversky Index is an assymmetric measure of given sets. While the Jaccard Index determines the similarity between a city and a local entity, a local entity generally represents a part of the city. Thus we use Tversky Index which is a unidirectional measure of similarity of the local entity to the city.
 
The Tversky Index is an assymmetric measure of given sets. While the Jaccard Index determines the similarity between a city and a local entity, a local entity generally represents a part of the city. Thus we use Tversky Index which is a unidirectional measure of similarity of the local entity to the city.
 
==<b>User Profile Generator</b>==
 
==<b>User Profile Generator</b>==

Revision as of 18:04, 23 June 2014

Introduction

With the advent of social media, many applications like brand management, personalization and recommendation systems, real time event detection and crisis management are based on insights obtained from user generated content on Twitter. The geographical location of a Twitter user is key to these applications. Recent studies(cite) have shown that less than 4% of tweets are tagged with latitude and longitude information. Existing approaches to predict the location of a Twitter user are statistical and need large training data sets to create models that predict the location of a user. In this work, we leverage Wikipedia to determine local entities of a city and use these entities to predict a user's geographic location.

The existing approaches to predict the location of a Twitter user can be broadly grouped in two categories:

  1. based on the Twitter network of the user and
  2. based on the content of the tweets of a user

Network based approaches consider the follower-followee information and their interactions with a user to predict a user's location. This approach is feasible only when a user has other users in his/her network who have published their actual location. On the other hand, the content based approaches rely on a large training dataset to determine the spatial distribution of words across the geographic area of their interest. For example, (cite) found that the word phillies was tweeted the most from users in Philadelphia. The disadvantage of this approach is that it needs a clean training data set containing representative tweets from all cities. This collection process can be tedious and time consuming. Our approach is based exclusively on the contents of the tweets of a user. Furthermore, we use Wikipedia to identify local entities, eliminating the need for a training dataset.

Architecture

Our approach comprises of three primary components:

  1. Knowledge Base Generator extracts local entities for each city from Wikipedia and scores them based on their relevance to the city
  2. User Profile Generator extracts the Wikipedia entities from the tweets of a user
  3. Location Predictor uses the output of Knowledge Base Generator and User Profile Generator to predict the location of a user

Knowledge Base Generator

Wikipedia is a large publicly available encyclopedia. Links to internal Wikipedia pages from a given page are an important feature of all Wikipedia pages. The aim of these links is to increase the understanding of a user about the given page. For instance, the Wikipedia page of New York City mentions Statue of Liberty. It also contains a link to the Wikipedia page of Statue of Liberty. Our approach is based on the assumption that the internal links represent entities that are local to the city, while varying in the degree of their localness. We use the following four measures to score the local entities of a city, with respect to the city:

  • Pointwise Mutual Information

In information theory, pointwise mutual information of two random variables is a measure of their mutual dependence. We use this idea to determine the association between a city and its local entities.

  • Betweenness Centrality

We build a direct graph for each city using its internal links. The internal links correspond to the nodes of a graph. For each link from the Wikipedia page of one local entity to another, we draw an edge from the former to the latter. For example, an edge between Statue of Liberty and Manhattan indicates a link from the Wikipedia page of Statue of Liberty to the Wikipedia page of Manhattan. The betweenness centrality of each node (representing a local entity) gives the importance of the node relative to the rest of the nodes in the graph.

  • Semantic Overlap Measures

We use the hyperlink structure of Wikipedia to compute the semantic relatedness of a city and its local entities. We use the following set based measures to compute the semantic overlap between a city and its local entities.

  1. Jaccard Index

The Jaccard Index is a symmetric, set based measure that defines the similarity of two sets in terms of their overlap and is normalized for their sizes. We use this measure to find the similarity between a city and its local entities.

  1. Tversky Index

The Tversky Index is an assymmetric measure of given sets. While the Jaccard Index determines the similarity between a city and a local entity, a local entity generally represents a part of the city. Thus we use Tversky Index which is a unidirectional measure of similarity of the local entity to the city.

User Profile Generator

Location Predictor

Evaluation

References

People