Difference between revisions of "ElectionPrediction"

From Knoesis wiki
Jump to: navigation, search
(Architecture)
(Milestones)
Line 37: Line 37:
  
 
[[File:ElectionPrediction Table4.PNG|center]]
 
[[File:ElectionPrediction Table4.PNG|center]]
 
=== <span style="font-size:15pt">Milestones</span> ===
 
 
*<span style="font-size:11.5pt">Implement Key Phrases Extraction from the event-specific corpus.</span>
 
*<span style="font-size:11.5pt">Define Event Ontology, in which the concepts and relationships concerned with events are formulated.</span>
 
*<span style="font-size:11.5pt">Implement Entity Recognition, and associate meaning to single key phrase.</span>
 
*<span style="font-size:11.5pt">Implement Relationship Extraction, and connect key phrases with their relationships in the context of the event.</span>
 
<br/>
 
  
 
== <span style="font-size:18pt">Demonstration</span> ==
 
== <span style="font-size:18pt">Demonstration</span> ==

Revision as of 19:09, 23 July 2012

Examining the Predictive Power of Different User Groups in Predicting 2012 U.S. Republican Presidential Primaries

Lu Chen, Wenbo Wang, and Amit P. Sheth

Introduction

Existing studies using social data to predict election results have focused on obtaining the measures/indicators (e.g., mention counts or sentiment of a party or candidate) from social data to perform the prediction. They treat all the users equally, and ignore the fact that social media users engage in the elections in different ways and with different levels of involvement. A recent study <ref>Mustafaraj, E. and Finn, S. and Whitlock, C. and Metaxas, P.T.: Vocal minority versus silent majority: Discovering the opionions of the long tail. In: Proceedings of the IEEE 3rd International Confernece on Social Computing, pp. 103--110 (2011)</ref> has shown that social media users from different groups (e.g., "silent majority" vs. "vocal minority") have significant differences in the generated content and tweeting behavior. However, the effect of these differences on predicting election results has not been exploited yet. For example, in our study, 56.07% of Twitter users who participate in the discussion of 2012 U.S. Republican Primaries post only one tweet. The identification of the voting intent of these users could be more challenging than that of the users who post more tweets. Will such differences lead to different prediction performance? Furthermore, the users participating in the discussion may have different political preference. Is it the case that the prediction based on the right-leaning users will be more accurate than that based on the left-leaning users, since it is the Republican Primaries? Exploring these questions can expand our understanding of social media based prediction, and shed light on using user sampling to further improve the prediction performance.

Here, we study different groups of social media users who engage in the discussions of elections, and compare the predictive power among these user groups. Specifically, we chose the 2012 U.S. Republican Presidential Primaries on Super Tuesday [1] among four candidates: Newt Gingrich, Ron Paul, Mitt Romney and Rick Santorum. We collected 6,008,062 tweets from 933,343 users talking about these four candidates in an eight week period before the elections. All the users are characterized across four dimensions: engagement degree, tweet mode, content type, and political preference. We first investigated the user categorization on each dimension, and then compared different groups of users with the task of predicting the results of Super Tuesday races in 10 states. Instead of using tweet volume or the overall sentiment of tweet corpus as the predictor, we estimated the "vote" of each user by analyzing his/her tweets, and predicted the results based on "vote-counting". The results were evaluated in two ways: (1) the accuracy of predicting winners, and (2) the error rate between the predicted votes and the actual votes for each candidate.

User Categorization

Using Twitter Streaming API, we collected tweets that contain the words "gingrich", "romney", "ron paul", or "santorum" from January 10th to March 5th (Super Tuesday was March 6th). Totally, the dataset comprises 6,008,062 tweets from 933,343 users. In this section, we discuss user categorization on four dimensions, and study the participation behaviors of different user groups.

Categorizing Users by Engagement Degree

We use the number of tweets posted by a user to measure his/her engagement degree. The less tweets a user posts, the more challenging the user's voting intent can be predicted. An extreme example is to predict the voting intent of a user who posted only one tweet. Thus, we want to examine the predictive power of different user groups with various engagement degrees. Specifically, we divided users into the following five groups: the users who post only one tweet (very low), 2-10 tweets (low), 11-50 tweets (medium), 51-300 tweets (high), and more than 300 tweets (very high). Table I shows the distribution of users and tweets over five engagement categories. We found that more than half of the users in the dataset belong to the very low group, which contributes only 8.71% of the tweet volume, while the very highly engaged group contributes 23.73% of the tweet volume with only 0.23% of all the users. It raises the question of whether the tweet volume is a proper predictor, given that a small group of users can produce a large amount of tweets.

To further study the behaviors of the users on different engagement levels, we examined the usage of hashtags and URLs in different user groups (see Table II. We found that the users who are more engaged in the discussion use more hashtags and URLs in their tweets. Since hashtags and URLs are frequently used in Twitter as ways of promotion, e.g, hashtags can be used to create trending topics, the usage of hashtags and URLs reflects the users' intent to attract people's attention on the topic they discuss. The more engaged users show stronger such intent and are more involved in the election event. Specifically, only 22.95% of all tweets created by very lowly engaged users contain hashtags, this proportion increases to 39.4% in the very high engagement group. In addition, the average number of hashtags per tweet (among the tweets that contain hashtags) is 1.43 in the very low engagement group, and this number is 2.68 for the very highly engaged users. The users who are more engaged also use more URLs, and generate less tweets that are only text (not containing any hashtag or URL). We will see whether and how such differences among user engagement groups will lead to varied results in predicting the elections later.

ElectionPrediction Table1&2.PNG

Categorizing Users by Tweet Mode

There are two main ways of producing a tweet, i.e., creating the tweet by the user himself/herself (original tweet) or forwarding another user's tweet (retweet). Original tweets are considered to reflect the users' attitude, however, the reason for retweeting can be varied, e.g., to inform or entertain the users' followers, to be friendly to the one who created the tweet, etc., thus retweets do not necessarily reflect the users' thoughts. It may lead to different prediction performance between the users who post more original tweets and the users who have more retweets, since the voting intent of the latter is more difficult to recognize.

According to users' preference on generating their tweets, i.e., tweet mode, we classified the users as original tweet-dominant, original tweet-prone, balanced, retweet-prone and retweet-dominant. A user is classified as original tweet-dominant if less than 20% of all his/her tweets are retweets. Each user from retweet-dominant group has more than 80% of all his/her tweets that are retweets. In Table III, we illustrate the categorization, the user distribution over the five categories, and the tweet mode of users in different engagement groups. It is interesting to find that the original tweet-dominant group accounts for the biggest proportion of users in every user engagement group, and this proportion declines with the increasing degree of user engagement (e.g., 55.32% of very lowly engaged users are original tweet-dominant, while only 31.89% of very highly engaged users are original tweet-dominant). It is also worth noting that a significant number of users (34.71% of all the users) belong to the retweet-dominant group, whose voting intent might be difficult to detect.

ElectionPrediction Table3.PNG

Categorizing Users by Content Type

Based on content, tweets can be classified into two classes -- opinion and information (i.e., subjective and objective). Studying the difference between the users who post more information and the users who are keen to express their opinions could provide us with another perspective in understanding the effect of using these two types of content in electoral prediction.

We first identified whether a tweet represents positive or negative opinion about an election candidate using the method proposed in a recent study. <ref>Chen, L. and Wang, W. and Nagarajan, M. and Wang, S. and Sheth, A.P.: Extracting Diverse Sentiment Expressions with Target-dependent Polarity from Twitter. In: Proceedings of the 6th International AAAI Conference on Weblogs and Social Media. (2012)</ref> The tweets that are positive or negative about any candidate are considered opinion tweets, and the tweets that are neutral about all the candidates are considered information tweets. We also used a five-point scale to classify the users based on whether they post more opinion or information with their tweets: opinion-dominant, opinion-prone, balanced, information-prone and information-dominant. Table IV shows the user distribution among all the users, and the users in different engagement groups categorized by content type.

The users from the very low engagement group have only one tweet, so they either belong to opinion-dominant (39%) or information dominant (61%). With users' engagement degree increasing from low to very high, the proportions of opinion-dominant, opinion-prone and information-dominant users dramatically decrease from 11.09% to 0.05%, 11.75% to 0.42%, and 27.40% to 0.66%, respectively. In contrast, the proportions of balanced and information-prone users grow. In high and very high engagement groups, the balanced and information-prone users together accounted for more than 95% of all users. It shows the tendency that more engaged users post a mixture of two types of content, with similar proportion of opinion and information, or larger proportion of information}.

ElectionPrediction Table4.PNG

Demonstration

Extracted Key Phrases for Several Events

The work of key phrases extraction has been integrated into Twitris[2].

Event Ontology

Event Ontology
Figure 2: An event ontology which defines concepts and relationships concerned with an event.

Accomplishments

  • I got the idea of this work during taking the Semantic Web course. I was focusing on the text mining and natural language processing, which can be seen as learning implicit knowledge from the data and using this knowledge for deriving target information from text, and didn't work or think much on the formal or explicit knowledge (e.g., the data which is represented as RDF or OWL). Now I realize that my research can benefit from both ways, and this project is a start point.

  • I am always interested in the research of relationship, including its representation, extraction, etc. In this project, I am starting to do some work on the relationship involved in the context of event. Based on this work, I might look deeper into this area in the future.


References