Revision as of 06:20, 25 December 2012

Abstract

The need to tap into the wisdom of the crowd" via social networks in real-time has already been demonstrated during critical events such as the Arab Spring and the recently concluded US Elections. As Twitter becomes a platform of choice for streaming event related information in real-time, we face several challenges in the related to filtering, realtime monitoring and tracking of the dynamic evolution of an event. We present a novel approach to continuously track an evolving event on Twitter by leveraging hashtags that are filtered using an evolving background knowledge (Wikipedia). Our approach (1) collects evolving hashtags by adapting tag co-occurrence information; (2) exploits the semantics of events for selecting hashtags by monitoring and leveraging the corresponding Wikipedia event pages; and (3) filters tweets using hashtags that are determined to be semantically relevant to the event. We evaluated our approach on two recent events: United States Presidential Elections 2012 and Hurricane Sandy. The results demonstrate that Wikipedia can be leveraged to determine, rank, and evolve small, high quality event-related hashtags in real-time to filter event-relevant tweets stream.

Hashtag Analysis

We performed a preliminary analysis of hashtags, prior to architect a solution to this problem. The analysis includes answering a couple of questions

How many hashtags contribute in retrieving the event-related tweets?
Can these hashtags be detected automatically?

In order to answer these questions, we utilized the dataset for two events from Twitris system. The two events are (1) Occupy Wall Street (OWS) (2) Colorado Shooting (CMS). The details of the dataset is provided below table.

Dataset for Analysis from Twitris
Event	Tweets	Hashtags (Distinct)	Start Date	End Date
CMS	122062	192512 (12350)	7/20/12	9/10/12
OWS	6077378	15963209 (191602)	9/29/11	9/20/12
Total	6199440	16155721

How many hashtags contribute in retrieving the event-related tweets?

We analyzed the frequency of hashtags in the event-relevant tweets and discovered that the hashtag frequencies follow a power law <ref>Zipf, G.k. Human Behavior and the Priciple of Least Effort, 1949</ref> as shown in the below Figure. Although, the hashtags involved in the event are many, as shown in above table, the number of hashtags that can be used to index the whole dataset are fewer. In other words, the distinct hashtags in the descending order of frequencies, that are sufficient as search queries to retrieve the whole dataset (Hashtag Queries) are (1) 7763 for CMS and (2) 21314 for OWS. The majority of the rest of the hashtags co-occur with one of these Hashtag Queries. However, less than 1% of these Hashtag Queries actually make a significant impact in retrieving the tweets, i.e., on an average more that 85% of the tweets can be retrieved using the one percent of the top Hashtag Queries. We refer to these hashtags as Impacting Hashtags.

Can these hashtags be detected automatically?

We employed Tag co-occurrence mechanism to analyze the Impacting Hashtags. The Tag co-occurrence networks for both the events are as shown in the below figures.

Approach

Evaluation

@@ Line 29: / Line 29: @@
 == Can these hashtags be detected automatically?==
 We employed Tag co-occurrence mechanism to analyze the <i>Impacting Hashtags</i>. The Tag co-occurrence networks for both the events are as shown in the below figures.
-<gallery>
+<gallery widths=400px>
-File:ows-cluster.png|center|400px
+File:ows-cluster.png
-File:cms-cluster.png|center|400px
+File:cms-cluster.png
 </gallery>

Difference between revisions of "Continuous Semantic Crawling Events"

Revision as of 06:20, 25 December 2012

Contents

Abstract

Hashtag Analysis

How many hashtags contribute in retrieving the event-related tweets?

Can these hashtags be detected automatically?

Approach

Evaluation

Navigation menu

Views

Personal tools

Navigation

Homepage

Search

Tools