Domain Specific Graph Selection From Linked Open Data

Abstract

With the Linked Open Data (LOD) initiative, there has been a great deal of work in developing a variety of open knowledge graphs freely accessible on the Web such as DBPedia, Yago and Freebase. Lately, the focus is moving greatly towards leveraging knowledge graphs as a background source for various applications. Knowledge graphs can provide a valuable source of information to improve tasks such as recommendations, (semantic) similarity calculation and named entity disambiguation. Knowledge graphs induces semantics with explicit relationships between entities that are exploited by these applications. For example, a recommendation system recommends items to a user based on the relatedness of items already accessed by the user to those that aren't. Existing approaches in literature consider the neighborhood graph from the items within predefined number of hops for this purpose. However, given a domain not all entities are relevant. For instance, in the movie domain, military unit and death place of a movie director will have significantly less importance compared to other movies directed or awards won by the same director. In this work, we propose a schema driven probabilistic approach to select the domain specific sub graph that scores and ranks the properties and classes of the neighborhood graph given the context of the domain (e.g movie). We evaluate using the Movielens dataset with DBPedia knowledge graph to show the effectiveness our approach.

Introduction

Contribution

Sarasi Write the contributions better

Propose a method to create a domain specific subgraph given the DBpedia type to represent the domain
Evaluate the domain specific subgraph in the movie domain, to see its effectiveness for movie recommendation

Approach

todo:need to define the sub graph properly here Proposed approach exploits the schema of the knowledge graph primarily to identify the potential candidates for sub graph (hop types and properties) and then leverages the instance statistics to rank and filter the domain specific hop types and properties. We use the PMI as a measure to rank the properties in the context of the domain of interest to find the co occurrence of properties with the type (eg:- movie).

Rank Properties

Rank first hop properties

PMI(p_1i, t₁) = log { Prob(p_1i,t₁) / {(Prob(p_1i) * Prob(t₁)}}

Prob(t₁) = no of instance of type t₁ / sum of the no instance of each type

Prob(p_1i) = no of triples with p_1i as a property / Total number of triples

Prob(p_1i,t₁) = no of triples with p_1i as a property when Sub or Obj is of type t₁

Rank second hop properties

PMI(p_2ik, t₁ p_1i t_2i) = log { Prob(p_2ik| t₁ p_1i t_2i) / Prob(p_2ik)} = log{ {Prob(p_2ik intersect t₁ p_1i t_2i} / {Prob(p_2ik) * Prob(t₁ p_1i t_2i)} }

Prob(p_2ik intersect t₁ p_1i t_2i = no of triples with p₂ with t₁ p_1i t_2i / number of triples with any p_i t₁ p_1i t_2i

Alt Approach to calculate to Prob(p_2ik intersect t₁ p_1i t_2i:

Prob(p_2ik intersect t₁ p_1i t_2i = no of triples with p₂ appear with at least one t₁ p_1i t_2i / Total number of p_2ik triples

Prob(p_2ik) = no of triples with p_2ik / total no of triples

Prob(t₁ p_1i t_2i) = no of triples with t₁ p_1i t_2i / total no of triples

Related Work

Linked Open Data has been popular The related work can be divided into the following tasks:

Semantic Similarity using Linked Open Data
Linked Open Data based recommendations

Evaluation

In order to evaluate the effectiveness of domain specific sub graph resulted from our approach, we have chosen the task of recommendations out of other applications that can be benefited from - such as (semantic) similarity calculation and named entity disambiguation. We considered the data set by Movielens which is widely exploited for recommender algorithm evaluation. Movielens data set consists of 1,000,209 ratings for 3883 movies by 6040 users. As the sub graph is extracted from DBpedia, it is necessary that each entity in the data set should be present in the DBpedia. For this reason, we have preprocessed Movielens data set and identified that only 3148 movies out of 3883 have such presence. In addition, as the recommendation task requires sufficient content for an user, we eliminated users who has ratings for less than 20 movies. After these filtering steps, our data set contains 3148 movies rated by 5886 users.

Considering the above problems in the existing Top N recommendations evaluation, Cremonesi et al. has presented an evaluation approach for evaluating the recommender algorithms performance. We adopted the similar approach for the evaluation. After the filtering steps applied on original Movielens data set, we randomly selected 1.4% of the whole data set and used it as test set 'T'(|T| = 9805), and remaining 98.6% items are used as training set 'M'. For each of the test item i with 5-stars in T, we performed the following.

We randomly select the 1000 items unrated by the user - with an assumption that those were dis interesting to user.
A ranked list of 1001 items(i.e. 1000 unrated plus the test item i) is formed by predicting using our algorithm.
Then, we evaluate by picking the top N items. If the test item i is identified in the top N list, we have a hit, or a miss otherwise.

The probability of hit increases as N increases. We consider overall recall as the ratio of total hits and the size of the test items i.e. |T|.

Current Plan for the evaluation

We plan to use PageRank as our recommendation algorithm as proposed by various other existing work such as [1].

Basline Approach

Consider all the paths connected two movies via up to four hops
Calculate the semantic similarity of the two movies based on the paths
Use the semantic similarity values to generate the transition probability matrix(mxm) for the PageRank algorithm
Use the PageRank algorithm to rank the movies based on each user

Our Approach 1

Reduce the number of paths we consider using the domain specific graph from our approach, this will reduce the number

of movies we are considering.

Follow the same set of steps outlined in the Baseline to rank the the movies
According to our hypothesis, our recommendation has to be improved or being in par with the baseline results

Our Approach 2

Change the way we calculate the semantic similarity based on on the paths
Follow the same set of steps to recommend the movies

Concerns

PMI is based on the co occurence of the properties with the given type. One of the disadvantage of using PMI is, higher could mean just a one occurence of the type with that property. PMI captures specificity of the property given the type. What kind of a justification do we provide just for considering the specificity. How about the properties that are generic but still domain specific. We did not consider conditional independence since it only captures the popularity
As of now, we consider P(p_2i, t₁ p₁ t) and need to analyze whether we should consider P(p_2i, t₁). In the second case, when ranking two hop properties too get only the hop 1 type t₁ as the context
We need to justify our approach versus semantic similarity
Possibility of using material science domain also for the evaluation
Comparison can be done using,
- Page rank with equal probability as the transitional probability for four hops
- Page rank with PMI probability as the transitional probability for four hops
- Page rank with conditional probability as the transitional probability for four hops

To Do

Sarasi

Complete the second hop property ranking
Calculate the conditional independence
Need to make sure there is no intersection between first hop property set and second hop property set

Pavan

Siva

Take all the movies on Dbpedia. Get a graph with the number of entities connected to it in one hop, two hops, three hops and four hops. I assume this should cover most of the entities in the Wikipedia dataset.

References

[1] Musto, Cataldo, et al. "Linked Open Data-enabled Strategies for Top-N Recommendations." CBRecSys 2014 (2014): 49.

Domain Specific Graph Selection From Linked Open Data

Contents

Abstract

Introduction

Contribution

Approach

Rank Properties

Related Work

Evaluation

Current Plan for the evaluation

Concerns

To Do

Sarasi

Pavan

Siva

References

Navigation menu

Views

Personal tools

Navigation

Homepage

Search

Tools