Property Alignment

From Knoesis wiki
Jump to: navigation, search

Property Alignment on Linked Datasets

Property alignment in Linked Open Data (LOD) or linked datasets is a non-trivial task because of the complex data representations. Concept (class) and instance level alignment possibilities have been investigated in the recent past but property alignment has not received much attention yet. Therefore, we propose an approach that can handle complex data representations and also achieve higher correct matching ratio. Our approach is based on utilizing fundamental building block of the interlinked datasets (e.g., LOD) which is known as Entity Co-Reference (ECR) links. We try to match property extensions to come up with a measurement to approximate owl:equivalent property. We use ECR links to findout equivalent instances for a particular property extension and then accumulate the matching number of extensions to decide on a matching property pair between two datasets.


Approach

In this initial experiment, we explored property extension matching using owl:sameAs and skos:exactMatch interlinking relationships (as ECR links). We will explore other less restrictive links as skos:closeMatch and some links like rdf:seeAlso links used in certain datasets for their requirements later and check the performance.


Fig. 1. Matching mechanism of the extension based approach

Figure 1 shows how the matching process work in our extension based algorithm. Each property pair is matched separately of others by extensions by analyzing each instance associated with that property (in the extensions slots). The algorithm needs to process subject instances from starting dataset and it extracts triples from each subject instance and finds out the relevant subject instance in the second dataset by traversing through an ECR link. Then object values for the property pair is matched using ECR links again. The final result of this matching process can be illustrated by an example presented in Figure 2. We keep track of statistical measures for deciding the final matching pairs as described in the paper (to appear in isemantics 2013) as MatchCount and Co-appearanceCount as described in Figure 2. These measures help to reduce incorrect mappings such as "birth_place" and "place_of_birth".

Fig. 2. Matching example for the extension based approach

Experiment and Datasets

To evaluate our approach in the linked datasets, we have used sample datasets from Linked Open Data (LOD). The datasets we used are DBpedia, Freebase, LinkedMDB, DBLP L3S and DBLP RKB_Explorer datasets. Dbpedia and Freebase are multi domain major hubs in LOD connecting other datasets together. LinkedMDB is a specialized dataset in the movie domain and DBLP L3S and DBLP RKB_Explorer datasets are specialized datasets for scientific publications. Therefore, in our evaluation, we are covering all types of datasets and their alignments. DBpedia and Freebase alignment presents multi domain alignment whereas DBpedia and LinedMDB alignment shows mutli domain and specific domain dataset alignment. The two DBLP dataset allignment represents specific domain to specific domain alignment task.

For this experiment, we selected person, film and software domains between DBpedia and Freebase datasets because these domains have more complex data representations and variations. Films in LinkedMDB are aligned with films in DBpedia. Scientific articles in DBLP datasets are aligned together. Each of the five experiments had 5000 instances analyzed to come up with the alignment decisions.

The results achieved are presented below using precision and recall.

Measure DBpedia-Freebase (Person) DBpedia-Freebase (Film) DBpedia-Freebase (Software) DBpedia-LinkedMDB (Film) DBLP_RKB-DBLP_L3S (Articles) Average
Our Algorithm Precision 0.8758 0.9737 0.6478 0.7560 1.0000 0.8427
Recall 0.8089 0.5138 0.4339 0.8157 1.0000 0.7145
F measure 0.8410 0.6727 0.5197 0.7848 1.0000 0.7656
Dice Similarity Precision 0.8064 0.9666 0.7659 1.0000 0.0000 0.7078
Recall 0.4777 0.4027 0.3396 0.3421 0.0000 0.3124
F measure 0.6000 0.5686 0.4705 0.5098 0.0000 0.4298
Jaro Similarity Precision 0.6774 0.8809 0.7755 0.9411 0.0000 0.6550
Recall 0.5350 0.5138 0.3584 0.4210 0.0000 0.3625
F measure 0.5978 0.6491 0.4903 0.5818 0.0000 0.4638
WordNet Similarity Precision 0.5200 0.8620 0.7619 0.8823 1.0000 0.8052
Recall 0.4140 0.3472 0.3018 0.3947 0.3333 0.3582
F measure 0.4609 0.4950 0.4324 0.5454 0.5000 0.4867

Analysis

In this section we will analyze the different types of properties we can identify in the linked datasets by examining our sample datasets used for the alignment experiment.

Types of properties

Property pairs between a linked dataset can be categorized by two orthogonal sets. That is based on,

1. Their semantics.
The inherent semantics between a property pair determines this kind of category and it can further de divided into two sub categories as follows,
1.1 Equivalent properties
Where the meaning of the two properties in the property pair is the same or intended to mean the same.
1.2 Property - sub property relationships
Where one property from the property pair is a sub property of the other.
2. Techniques and/or tools required to determine the inter-relationships or alignment.
This type of property categorization is based on the process of identification. Based on the required level of techniques,tools or expertise decide which category a property pair belongs to. Based on this criteria, we can devide properties into two groups as follows.
2.1 Simple properties
These properties have similar or very common word phrases in their property names. Hence, they have high syntactic similarity. They may consist of common prefix, common suffix, adjectives or different ordering of the same/similar words in the property names. Therefore, they can be identified using various string manipulation techniques.
2.2 Opaque properties
These properties have the same meaning or intended to have same meaning but have different word choices in their property names. These properties can be further devided into two sub categories as follows.
(a) Synonymous properties
The similarity of these property pairs can be decided by analyzing the meaning of words or word phrases using an external dictionary.
(b) Complex properties
These proerties require more than simple or synonym based techniques to determine their similarity. These additional information can be analyzing extensions, domain and range, etc.

We present an analysis of property type 2 described above in the following section.

Analysis & Observations

The analysis is based on manually identified and categorized property pairs from the above experiment. The percentages of matching is based on how many of the manually identified property pairs in each type were identified by each ap[proach (like recall for each property type).

Fig. 3. Property type breakdown

Figure 3 shows the breakdown of the property types (based on techniques used for alignment) available on the selected sample of linked datasets. Our analysis is based on the selected linked datasets from LOD cloud and we asume that it represents the nature of the properties in LOD and in general linked datasets. Observations based on Figure 3 as follows,

  • Many of the property pairs are simple. That is, they should be identified using string manipulation techniques.
  • Even though, it is expected to represent many synonymous pairs as similar linked datasets are expected to have similarity in their representations, complex property pairs outnumber synonymous pairs.
  • According to Figure 3(b), many of the properties are not having exactly the same naming for properties between two datasets. This also reflects the nature of linked datasets where dataset publishers are independent to each other and may use different names.
Fig. 4. Property alignment performances for each technique

Figure 4 shows the performance of each technique we experimented for each property pair type (simple, complex and synonymous). Based on the figure, following observations were made.

  • Extension based algorithm outperformed othertechniques in matching all property types.
  • WordNet based approach performed the second best but failed to identify many synonymous property pairs which are expected to be identified by the approach.
  • Extension and WordNet approaches performed equally for experiment 5 (DBLP articles) for simple property pairs but WordNet approach failed in identifying complex pairs.
  • Both string similarity based approaches failed to identify any of the complex or synonymous property pairs, which is obvious.

Conclusion

Based on the above mentioned graphs and observations, it is conclusive that extension based approach is identifying many property pairs, which cannot be identified by existing approaches. Most current approaches in aligning property pairs try to use string manipulation techniques. But we showed above that basic string manipulation techniques alone will not suffice to handle the varying and complex data representations found in the linked datasets. Our extension based approach tries to solve this problem and uses additional important information present at the property extension level as heuristics to decide on a match. But there are still some property pairs that can be identified by the syntactic or synonym based approaches where extension based approach fails because of lack of data representation to make the correct decision. In that sense, we can further improve the extension based approach by utilizing syntactic or synonym based techniques.


Project Details

Members and Contributors

Current Members
Kalpa Gunaratna
Dr. T.K. Prasad
Dr. Amit Sheth

Past Members
Prateek Jain
Sanjaya Wijeratne

Acknowledgements

This project is supported by National Science Foundation under grant # IIS-1143717 as "EAGER: Expressive Scalable Queries over Linked Open Data".


Publications

  • Kalpa Gunaratna, Krishnaprasad Thirunarayan, Prateek Jain,Amit Sheth and Sanjaya Wijeratne, A Statistical and Schema Independent Approach to Identify Equivalent Properties on Linked Data. , In: Proc. 9th International Conference on Semantic Systems (ACM 2013), Messe Graz, Austria, 2013. download