Difference between revisions of "Hierarchical Interest Graph"

From Knoesis wiki
Jump to: navigation, search
(Results and Analysis)
(Results and Analysis)
Line 95: Line 95:
  
 
==Results and Analysis==
 
==Results and Analysis==
[[image:Overall-evaluation-hig.png|right|thumb|725px|Evaluation of the three activation functions using Graded Precision, Mean Average Precision, Mean Reciprocal Rank]]
+
[[image:Overall-evaluation-hig.png|right|thumb|730px|Evaluation of the three activation functions using Graded Precision, Mean Average Precision, Mean Reciprocal Rank]]
 
===Comparison of Hierarchical Level Distribution===
 
===Comparison of Hierarchical Level Distribution===
 
[[image:Hierarchical-level-distribution.png|left|thumb|600px|Average Frequency of Categories based on Hierarchical Levels at Top-50]]
 
[[image:Hierarchical-level-distribution.png|left|thumb|600px|Average Frequency of Categories based on Hierarchical Levels at Top-50]]

Revision as of 16:34, 16 October 2013

Abstact

Industry and researchers have identified numerous ways to monetize microblogs for personalization and recommendation. A common challenge across these different works is identification of user interests. Although techniques have been developed to address this challenge, a flexible approach that spans multiple levels of granularity in user interests has not been forthcoming.

In this work, we focus on exploiting hierarchical semantics of concepts to infer richer user interests expressed as Hierarchical Interest Graph. To create such graphs, we utilize user's Twitter data to first ground potential user interests to structured background knowledge such as Wikipedia Category Graph. We then use an adaptation of spreading activation theory to assign user interest score (or weights) to each category in the hierarchy. The Hierarchical Interest Graph not only comprises of user's explicitly mentioned interests determined from Twitter, but also their implicit interest categories inferred from the background knowledge source. We demonstrate the effectiveness of our approach through a user study which shows an average of approximately eight of the top ten weighted categories in the graph being relevant to a given user's interests.

Approach

The goal of our approach is to construct a Hierarchical Interest Graph (HIG) for a Twitter user. To accomplish this our system performs the following steps:

  1. Hierarchy Pre-processor converts the Wikipedia Category Graph (WCG) into a hierarchy
  2. User Interests Generator spots and scores the Primitive Interests from tweets of a user.
  3. Interest Hierarchy Generator maps the scored Primitive Interests to Wikipedia Hierarchy to infer the HIG of the user.

Step 1 of our approach is performed if and when there are updates to Wikipedia Category Graph. Step 2 and 3 are performed for the set of tweets of each user to generate the corresponding HIG. Below Figure illustrates our approach.

Approach

Implementation

The system is implemented in Java. It is modular to plug and play different ontologies, entity resolution services (Zemanta, Alchemy and Dbpedia Spotlight), spreading activation functions (Bell, Bell Log and Priority Intersect). Our present implementation is on Wikipedia and Zemanta with all the activation functions. Each of these are explained in the below sections.

Hierarchy Preprocessor

In order to generate a Hierarchical Interest Graph it was necessary to transform the Wikipedia Category Graph to a Hierarchy. First, creating the Wikipedia Category Graph was non trivial. In the below section we detail the challenges involved. Further we explain the transformation that comprises of assigning hierarchical levels to each category in the hierarchy and removing the non-hierarchical nodes.

Preparing Wikipedia Category Graph

Totally, Wikipedia Category graph comprises of 884,838 categories with 2,074,173 category-subcategory relationships among them. Representing and creating it as a directed graph was difficult. We started with Dbpedia category system that is freely available for download. Once we constructed a directed graph of the category system, we realized that around 4K categories were disconnected from the graph. We reaffirmed the disconnections by extracting the category system directly from the, then latest Wikipedia dump<ref>http://en.wikipedia.org/wiki/Wikipedia:Database_download</ref> dated April 2013. In order to keep the category graph uptodate, for each of the disconnected categories we used the Wikipedia API11 to get its hierarchy. We found that, approximately 43% of the disconnected categories had connections to the category system, where as the rest 57% were not present in Wikipedia. A justification we can think of is that the disconnected categories were deleted from Wikipedia after the creation of Dbpedia version 3.8 or the dump of Wikipedia. As shown in Table 1 (first row) we ended up with 4.6M articles, 884,838 Categories with 2.074M links among categories.

Wikipedia administration<ref>http://en.wikipedia.org/wiki/Category:Wikipedia_administration</ref> have included a set of categories in the category structure to maintain Wikipedia9. However, these categories are not necessary and are hence excluded from the category structure. The articles are filtered based on the following strings in their labels<ref>S. P. Ponzetto and M. Strube. Deriving a large scale taxonomy from wikipedia. In Proceedings of the 22nd national conference on Artificial intelligence - Volume 2, AAAI’07, pages 1440–1445. AAAI Press, 2007.</ref>: wikipedia, wikiprojects, lists, mediawiki, template, user, portal, categories, articles, pages. To summarize, around 64,362 categories with 151,732 cat-subcategory relationships were removed from the category graph of Wikipedia. The category structure does not contain any of the hidden categories10 assigned by Wikipedia.

Hierarchy Assignment

Transforming to a hierarchy

Wikipedia Category Graph comprises of cycles and hence determining the abstractness of each category in the graph is non trivial. For example, to know whether Category:Baseball is more abstract than Category:Major League Baseball automatically is not easy in the Wikipedia Category Graph. Therefore, we first use the rooted spanning tree algorithm based on the hypothesis that the closest distance from the root to the category is the hierarchical level (abstractness) of the category. The root node from Wikipedia is the Category:Main Topic Classifications. As shown in the Figure, from the root node (node A) we assign the hierarchical levels of each node based on the distance to the root node. Then we remove the non-hierarchical links. The non hierarchical links are those that are from a specific node to an abstract node. For example: Node E with hierarchical level 4 in the figure has a link to Node B with hierarchical level 2, which means that E is a category of B. This contradicts our hypothesis, hence we delete such links. This outcome of this process is a hierarchy with a height of 15 and 802,194 categories with 1,177,558 links among them.

User Interest Generator

In this module we generate the primitive interests of the user and score them to reflect the extent of user's interests. We use the user's tweets and perform entity resolution on them. The entities (Wikipedia articles) extracted from the tweets are considered to be the interests of the user. Since this is not the main task of our work, we use an external web service. Although our work is modular to use any external web services, we use Zemanta because of its precision (58%) and the rate limit the web service provides (10,000 per day). Details of the evaluation of entity resolution web services can be found in work by Derczynski et.al <ref>L. Derczynski, D. Maynard, N. Aswani, and K. Bontcheva. Microblog-genre noise and impact on semantic annotation accuracy. In Proceedings of the 24th ACM Conference on Hypertext and Social Media, HT ’13, pages 21–30, New York, NY, USA, 2013. ACM.</ref> We have also implemented a dictionary based Trie (Ternary Interval Search Tree) extractor that spots Wikipedia entities. The Trie extractor was implemented to overcome the rate limit problem and do a large scale analysis. However due to the low precision (31%) we decided on using Zemanta.

To score the entities, we perform a simple normalized frequency of the entities mentioned by the user. The normalization is done using the maximum occurring entity in the user's tweets. Therefore each entity will have a interest score between 0-1. We term the interests generated from this module as Primitive Interests.

Interest Hierarchy Generation

Given the primitive interests (User Interest Generator) and the Wikipedia Hierarchy (Hierarchy Preprocessor) , in this module we generate the subgraph (hierarchy) of user interests. The hierarchy comprises of categories with appropriate scores that represent the interest level of the user. First the primitive user interests are linked to appropriate categories in the Wikipedia Hierarchy. Further, it is not difficult to infer the categories of the primitive interests to get the hierarchy. However, the propagation of the scores from the primitive interests to select the most appropriate nodes in the hierarchy is a non trivial problem. In order to solve this problem we implemented an adaptation of the Spreading Activation Theory<ref>M. R. Quilian. Semantic Memory. In: M. Minski (ed.). Semantic Information Processing. MIT Press, Cambridge, MA, 1968.</ref>. We use a breadth based spreading activation where the interest score propagates from the higher hierarchical level nodes to lower hierarchical level nodes constrained by an activation function. We constrained the activation based on different parameters that includes:

  • No Constraint: Propagating the scores of primitive interest up the hierarchy
  • The node distribution in the hierarchy: Since few nodes that have many categories in their child-hierarchical-level received higher scores that were propagated to higher. We used the raw distribution of nodes and log scale distribution to experiment.
  • Preferential Path: Since each category had many categories, we use Wiki category order (left-right) to prioritize the most suitable category.
  • Intersect Booster: Boosting categories that were the interesection of user's primitive interests with an intuition that those categories formed the hubs of interests gained better performance.

The combination of these parameters as activation function are experimented and evaluated as explained in Evaluation

Evaluation

The evaluation was done with three variations of spreading activation theory used in our approach. The three activation functions are as follows;

  • Bell: Normalizes the propagated value based on the number of nodes in the next level of specificity.
  • Bell Log: Builds upon Bell by reducing the impact with Log based reduction.
  • Priority Intersect: Builds upon Bell Log with prioritizing the most suitable categories and boosting the entity-intersecting nodes.

In order to evaluate our system we conducted a users study. Attracting participants for a user study was difficult. Although 40 people agreed to participate in the user study, 37 of those responded. We created a set of the top 50 interest categories that were the outcome of the above mentioned variations of our approach. The users marked yes/no/maybe based on their level of interest for the appropriate categories.

User Study Analysis

The user study stats are shown in the below table. Following the stats we discuss some analysis of the user study data.

User Study Stats
Users Tweets Entities Distinct Entities Tweets with Entities Interests Categories
Total 37 31,927 29,146 13,150 16,464 111,535
Average 864 787 355 445 3014

Distribution of Resolved Entities

Distribution of Entities from User Study

As shown in the above table, we were able to resolve around 30K entities using Zemanta<ref>http://developer.zemanta.com/</ref>. The distribution of these extracted entities are shown in the Figure on the left (Media:Entity-distribution.png). The graph is a log-log scale with x-axis representing the rank of the entity based on its occurrence and the y-axis represents the frequency of its occurrence among the 37 users. Not surprisingly the distribution follows a power law<ref>Adamic, Lada A. "Zipf, power-laws, and pareto-a ranking tutorial." Xerox Palo Alto Research Center, Palo Alto, CA</ref>.

Users-Tweets-Entities-Categories Distribution

Distribution of Entities from User Study

In the Figure on the right, we can see the distribution based on users. The x-axis are users ordered based on the number of tweets (ascending). The secondary y-axis is for Categories in the Hierarchy where as the primary y-axis if for the rest of the series in the graph. The graph comprises of the distribution of the following:

  1. Number of Tweets of each user
  2. Tweets comprising atleast one entity per user
  3. Entities per user
  4. Distinct Entities per user
  5. Categories of Interest generated by our best variation (Priority Intersect)

We can derive some insights from the figure related to our approach:

  • The total number of categories generated by the system does not show a proportional increase with that of the number of tweets with a user. However, we can see a co-relation (better to provide formal evidence) between the distinct entities of each user to that of the number of categories generated for the user. This shows that the number of interest categories inferred from Wikipedia Hierarchy (bigger sub-graph) is based on the diversity of topics the user discusses in his/her tweets.

Results and Analysis

Evaluation of the three activation functions using Graded Precision, Mean Average Precision, Mean Reciprocal Rank

Comparison of Hierarchical Level Distribution

Average Frequency of Categories based on Hierarchical Levels at Top-50

In this analysis, we examine the hierarchical levels of the top-50 categories generated by each of our approach. As shown in the Figure to the left, we can see that bell-log and priority-intersect has a bell curve where the maximum number of categories that has been spotted are between the hierarchical level 4-7. Whereas Twopics<ref>Michelson, Matthew, and Sofus A. Macskassy. "Discovering users' topics of interest on twitter: a first look." Proceedings of the fourth workshop on Analytics for noisy unstructured text data. ACM, 2010.</ref>, a early implementation to generate user interest categories from Wikipedia, ranks most of the abstract categories higher. As we can see, Twopics has ranked the categories of level 2-3 most in top-50. This is similar to our no-weight no-decay approach.

Another aspect to note is that, Wikipedia has around 36 categories at Level-2 (Most abstract categories of interests). Further, Twopics on an average ranks 50% (18/36) of the most abstract Wikipedia categories (level 2) in top-50 as user interests, which is intuitively unlikely.

Datasets

Wikipedia Category System

Dataset Used in our approach: In order to generate the Wikipedia Category Graph, we first started with the Dbpedia Category System the can be downloaded from the following links Dbpedia-Category-System, Dbpedia-Instance-Categories with details on Dbpedia 39 downloads. The Dbpedia-Category-System comprises of links between categories and the Dbpedia-Instance-Categories comprises of links between Wikipedia Articles and Categories. Once we found that the Dbpedia-Category-System is incomplete, we used the Wikipedia Dump to re-confirm the disconnected categories. The Wikipedia Dump can be downloaded from this link with details on wikipedia database page. Finally we used the Wikipedia API to complete the Wikipedia Category System. Further, this Wikipedia Category Graph was used to generate the Wikipedia Hierarchy that forms the base of each Hierarchical Interest Graph generated by our approach.

Dataset Generated by the approach: The list of Wikipedia admin categories that are irrelevant in Wikipedia Category System can be downloaded here. The admin categories constituted to around 67K categories. Further, the Wikipedia Hierarchy generated by our approach with priority of each links can be downloaded as a tab separated file. The format is <subcategory><tab><category><tab><priority>.

Entity Resolution Evaluation Dataset

The evaluation of entity resolution, specifically for the Trie extractor was done using the dataset created by Meij et al<ref> Edgar Meij, Wouter Weerkamp, and Maarten de Rijke. 2012. Adding semantics to microblog posts. In Proceedings of the fifth ACM international conference on Web search and data mining (WSDM '12). ACM</ref>.

References

<references />

People