# Domain Specific Graph Selection From Linked Open Data

## Contents

## Abstract

With the Linked Open Data (LOD) initiative, there has been a great deal of work in developing a variety of open knowledge graphs freely accessible on the Web such as DBPedia, Yago and Freebase. Lately, the focus is moving greatly towards leveraging knowledge graphs as a background source for various applications. Knowledge graphs can provide a valuable source of information to improve tasks such as recommendations, (semantic) similarity calculation and named entity disambiguation. Knowledge graphs induces semantics with explicit relationships between entities that are exploited by these applications. For example, a recommendation system recommends items to a user based on the relatedness of items already accessed by the user to those that aren't. Existing approaches in literature consider the neighborhood graph from the items within predefined number of hops for this purpose. However, given a domain not all entities are relevant. For instance, in the movie domain, military unit and death place of a movie director will have significantly less importance compared to other movies directed or awards won by the same director. In this work, we propose a schema driven probabilistic approach to select the domain specific sub graph that scores and ranks the properties and classes of the neighborhood graph given the context of the domain (e.g movie). We evaluate using the Movielens dataset with DBPedia knowledge graph to show the effectiveness our approach.

## Introduction

## Contribution

**Sarasi Write the contributions better**

- Propose a method to create a domain specific subgraph given the DBpedia type to represent the domain
- Evaluate the domain specific subgraph in the movie domain, to see its effectiveness for movie recommendation

## Approach

todo:need to define the sub graph properly here Proposed approach exploits the schema of the knowledge graph primarily to identify the potential candidates for sub graph (hop types and properties) and then leverages the instance statistics to rank and filter the domain specific hop types and properties. We use the PMI as a measure to rank the properties in the context of the domain of interest to find the co occurrence of properties with the type (eg:- movie).

#### Rank Properties

- Rank first hop properties

PMI(p_{1i}, t_{1}) = log { Prob(p_{1i},t_{1}) / {(Prob(p_{1i}) * Prob(t_{1})}}

Prob(t_{1}) = no of instance of type t_{1} / sum of the no instance of each type

Prob(p_{1i}) = no of triples with p_{1i} as a property / Total number of triples

Prob(p_{1i},t_{1}) = no of triples with p_{1i} as a property when Sub or Obj is of type t_{1}

- Rank second hop properties

PMI(p_{2ik}, t_{1} p_{1i} t_{2i}) = log { Prob(p_{2ik}| t_{1} p_{1i} t_{2i}) / Prob(p_{2ik})} = log{ {Prob(p_{2ik} intersect t_{1} p_{1i} t_{2i}} / {Prob(p_{2ik}) * Prob(t_{1} p_{1i} t_{2i})} }

Prob(p_{2ik} intersect t_{1} p_{1i} t_{2i} = no of triples with p_{2} with t_{1} p_{1i} t_{2i} / number of triples with any p_{i} t_{1} p_{1i} t_{2i}

Alt Approach to calculate to Prob(p_{2ik} intersect t_{1} p_{1i} t_{2i}:

Prob(p_{2ik} intersect t_{1} p_{1i} t_{2i} = no of triples with p_{2} appear with at least one t_{1} p_{1i} t_{2i} / Total number of p_{2ik} triples

Prob(p_{2ik}) = no of triples with p_{2ik} / total no of triples

Prob(t_{1} p_{1i} t_{2i}) = no of triples with t_{1} p_{1i} t_{2i} / total no of triples

## Related Work

Linked Open Data has been popular The related work can be divided into the following tasks:

- Semantic Similarity using Linked Open Data
- Linked Open Data based recommendations

## Evaluation

In order to evaluate the effectiveness of domain specific sub graph resulted from our approach, we have chosen the task of recommendations out of other applications that can be benefited from - such as (semantic) similarity calculation and named entity disambiguation. We considered the data set by Movielens which is widely exploited for recommender algorithm evaluation. Movielens data set consists of 1,000,209 ratings for 3883 movies by 6040 users. As the sub graph is extracted from DBpedia, it is necessary that each entity in the data set should be present in the DBpedia. For this reason, we have preprocessed Movielens data set and identified that only 3148 movies out of 3883 have such presence. In addition, as the recommendation task requires sufficient content for an user, we eliminated users who has ratings for less than 20 movies. After these filtering steps, our data set contains 3148 movies rated by 5886 users.

<Paragraph Describing Problems in existing approach - I will do this : Siva>

Considering the above problems in the existing Top N recommendations evaluation, Cremonesi et al. has presented an evaluation approach for evaluating the recommender algorithms performance. We adopted the similar approach for the evaluation. After the filtering steps applied on original Movielens data set, we randomly selected 1.4% of the whole data set and used it as test set 'T'(|T| = 9805), and remaining 98.6% items are used as training set 'M'. For each of the test item i with 5-stars in T, we performed the following.

- We randomly select the 1000 items unrated by the user - with an assumption that those were dis interesting to user.
- A ranked list of 1001 items(i.e. 1000 unrated plus the test item i) is formed by predicting using our algorithm.
- Then, we evaluate by picking the top N items. If the test item i is identified in the top N list, we have a hit, or a miss otherwise.

The probability of hit increases as N increases. We consider overall recall as the ratio of total hits and the size of the test items i.e. |T|.

#### Current Plan for the evaluation

We plan to use PageRank as our recommendation algorithm as proposed by various other existing work such as [1].

**Basline Approach**

- Consider all the paths connected two movies via up to four hops
- Calculate the semantic similarity of the two movies based on the paths
- Use the semantic similarity values to generate the transition probability matrix(mxm) for the PageRank algorithm
- Use the PageRank algorithm to rank the movies based on each user

**Our Approach 1 **

- Reduce the number of paths we consider using the domain specific graph from our approach, this will reduce the number

of movies we are considering.

- Follow the same set of steps outlined in the Baseline to rank the the movies
- According to our hypothesis, our recommendation has to be improved or being in par with the baseline results

**Our Approach 2**

- Change the way we calculate the semantic similarity based on on the paths
- Follow the same set of steps to recommend the movies

## Concerns

- PMI is based on the co occurence of the properties with the given type. One of the disadvantage of using PMI is, higher could mean just a one occurence of the type with that property. PMI captures specificity of the property given the type. What kind of a justification do we provide just for considering the specificity. How about the properties that are generic but still domain specific. We did not consider conditional independence since it only captures the popularity
- As of now, we consider P(p
_{2i}, t_{1}p_{1}t_{}) and need to analyze whether we should consider P(p_{2i}, t_{1}). In the second case, when ranking two hop properties too get only the hop 1 type t_{1}as the context - We need to justify our approach versus semantic similarity
- Possibility of using material science domain also for the evaluation
- Comparison can be done using,
- Page rank with equal probability as the transitional probability for four hops
- Page rank with PMI probability as the transitional probability for four hops
- Page rank with conditional probability as the transitional probability for four hops

## To Do

#### Sarasi

- Complete the second hop property ranking
- Calculate the conditional independence
- Need to make sure there is no intersection between first hop property set and second hop property set

#### Pavan

#### Siva

- Take all the movies on Dbpedia. Get a graph with the number of entities connected to it in one hop, two hops, three hops and four hops. I assume this should cover most of the entities in the Wikipedia dataset.

## References

[1] Musto, Cataldo, et al. "Linked Open Data-enabled Strategies for Top-N Recommendations." CBRecSys 2014 (2014): 49.