Continuous Semantic Crawling Events

From Knoesis wiki
Revision as of 08:08, 25 December 2012 by Pavan (Talk | contribs) (Approach)

Jump to: navigation, search

Abstract

The need to tap into the wisdom of the crowd" via social networks in real-time has already been demonstrated during critical events such as the Arab Spring and the recently concluded US Elections. As Twitter becomes a platform of choice for streaming event related information in real-time, we face several challenges in the related to filtering, realtime monitoring and tracking of the dynamic evolution of an event. We present a novel approach to continuously track an evolving event on Twitter by leveraging hashtags that are filtered using an evolving background knowledge (Wikipedia). Our approach (1) collects evolving hashtags by adapting tag co-occurrence information; (2) exploits the semantics of events for selecting hashtags by monitoring and leveraging the corresponding Wikipedia event pages; and (3) filters tweets using hashtags that are determined to be semantically relevant to the event. We evaluated our approach on two recent events: United States Presidential Elections 2012 and Hurricane Sandy. The results demonstrate that Wikipedia can be leveraged to determine, rank, and evolve small, high quality event-related hashtags in real-time to filter event-relevant tweets stream.

Hashtag Analysis

We performed a preliminary analysis of hashtags, prior to architect a solution to this problem. The analysis includes answering a couple of questions

  • How many hashtags contribute in retrieving the event-related tweets?
  • Can these hashtags be detected automatically?

In order to answer these questions, we utilized the dataset for two events from Twitris<ref>A. Jadhav, H. Purohit, P. Kapanipathi, P. Ananthram, A. Ranabahu, V. Nguyen, P.N. Mendes, A.G. Smith, M. Cooney, and A. Sheth. Twitris 2.0: Semantically empowered system for understanding perceptions from social data. Semantic Web Challenge, 2010.</ref> system. The two events are (1) Occupy Wall Street (OWS) (2) Colorado Shooting (CMS). The details of the dataset is provided below table.

Dataset for Analysis from Twitris
Event Tweets Hashtags (Distinct) Start Date End Date
CMS 122062 192512 (12350) 7/20/12 9/10/12
OWS 6077378 15963209 (191602) 9/29/11 9/20/12
Total 6199440 16155721

How many hashtags contribute in retrieving the event-related tweets?

We analyzed the frequency of hashtags in the event-relevant tweets and discovered that the hashtag frequencies follow a power law <ref>Zipf, G.k. Human Behavior and the Priciple of Least Effort, 1949</ref> as shown in the below Figure. Although, the hashtags involved in the event are many, as shown in above table, the number of hashtags that can be used to index the whole dataset are fewer. In other words, the distinct hashtags in the descending order of frequencies, that are sufficient as search queries to retrieve the whole dataset (Hashtag Queries) are (1) 7763 for CMS and (2) 21314 for OWS. The majority of the rest of the hashtags co-occur with one of these Hashtag Queries. However, less than 1% of these Hashtag Queries actually make a significant impact in retrieving the tweets, i.e., on an average more that 85% of the tweets can be retrieved using the one percent of the top Hashtag Queries. We refer to these hashtags as Impacting Hashtags.

caption=Power Law of Hashtag Frequencies

Can these hashtags be detected automatically?

We employed Tag co-occurrence technique to analyze the Impacting Hashtags. The Tag co-occurrence networks for both the events are as shown in the below figures. We discovered that, the impacting hashtags that are relevant to the event co-occur with at-least one other impacting hashtag.

Ows-cluster.png
Cms-cluster.png

Intuitively, from the Figures above, we can note that more relevant hashtags for the event are towards the center and well clustered than the hashtags at the periphery. To formalize this, we utilized Average Clustering Co-efficient(AvgCC)<ref>S. Wasserman and K. Faust. Social network analysis: Methods and applications, volume 8. Cambridge university press, 1994.</ref> for co-occurrence networks of hashtags. We determined the AvgCC by incrementing the number of top hashtags by 0.1% in the network. By this analysis we found that, the top hashtags are better clustered with each other than the addition of hashtags with lower frequencies. Therefore, starting with a popular hashtag for an event, we will be able to find the other popular hashtags easily than the other less occurring hashtags. The analysis of the AvgCC is shown in the below figure.

Avgcc.png

Approach

By leveraging the hashtag analysis in the previous section, we present a novel approach to detect hashtags in real-time to continuously monitor an event. In order to detect semantically relevant hashtags in real-time, we need an evolving background knowledge that is updated with the latest happenings of the event. Therefore, we use Wikipedia as a graph structure that is continuously updated by the crowd based on the changes in the event. The Figure below shows the architecture of our approach. We use tag co-occurrence in streaming mode to detect candidate tags that has to be further filtered as an event relevant tag.

The whole approach can be explained in two phases. (1) Processing Background Knowledge (Event Wiki Processor) (2) Determining semantic similarity for Hashtags (Hashtag Analyzer). Once the background knowledge is processed by leveraging the Wikipedia event page, a stream with manually input hashtags as the initial �filtering hashtag set is set up. The system then adopts an expand and reduce paradigm to �nd hashtags to be added to the �filtering hashtag set as shown in the Figure 5. Firstly, we expand our choices of hashtags (candidate tags) by employing the tag co-occurrence technique with the input hashtags in the stream and later reduce these candidate tags to only the relevant ones by determining its semantic similarity with the event. The semantic similarity is determined by leveraging the background knowledge of the corresponding event on Wikipedia. Finally, the hashtags used for fi�ltering are updated for streaming more timely relevant tweets. As shown in the Figure 5 the above process tracks the evolving event. Also, the hashtags in the �ltering hashtag set are periodically checked for semantic relevancy to the event to remove the hashtags that are outdated and are crawling tweets that are irrelevant to the event.

Processing Wikipedia Event Page

Filtering Semanticaly Relevant Hashtags

Continuous-flow.png

Evaluation

References

<references />