FACES

From Knoesis wiki
Jump to: navigation, search

FACES stands for FACeted Entity Summarization. It is an entity summarization system that combines three dimensions : diversity, popularity, and uniqueness. Initial implementation of FACES approach is presented at AAAI 2015 [4]. We extend FACES by computing types for datatype property values and then considering both object and datatype properties for faceted entity summary creation. The major contribution of this extension is the computation of types for datatype property values so that datatype properties can be incorporated in the FACES entity summarization process. The datasets for both FACES evaluation and Typing evaluation can be downloaded by following the links presented in the Datasets section. This work is accepted and will be presented at ESWC 2016 research track [5].

Preliminaries

Problem Statement - An entity is usually described using a conceptually different set of facts to improve coverage. We want to select a ‘representative’ subset of this set in a good summary to uniquely identify the entity.

Definitions 1-4 define basic notions related to entity summaries. They are as stated in [1].

Definition 1 : A data graph is a digraph G = V, A, LblV , LblA where (i) V is a finite set of nodes, (ii) A is a finite set of directed edges where each a ∈ A has a source node Src(a) ∈ V, a target node Tgt(a) ∈ V, (iii) LlbV : V → E ∪ L, and (iv) LblA : A → P are labeling functions that map nodes to entities or literals and edges to properties.

Definition 2 : A feature f is a property-value pair where Prop(f ) ∈ P and Val(f ) ∈ E ∪ L denote the property and the value, respectively. An entity e has a feature f in a data graph G = V, A, LblV , LblA if there exists a ∈ A such that LblA (a) = Prop(f ), LblV (Src(a)) = e and LblV (Tgt(a)) = Val(f ).

Definition 3 : Given a data graph G, the feature set of an entity e, denoted by FS(e), is the set of all features of e that can be found in G.

Definition 4 : Given FS(e) and a positive integer k < |FS(e)|, summary of entity e is Summ(e) ⊂ FS(e) such that |Summ(e)| = k.

Faceted entity summaries

An entity is described by a feature set. A feature (f ) is basically characterized by the property (P rop(f )) and value (V al(f )). In fact, a property binds a specific meaning to an entity using a value. We observe in general that different properties represent different aspects of an entity. For example, the properties 'profession' and 'spouse' of an entity (of type person) represent two different aspects. The first defines an intangible value and the second defines a human; one talking about the entity’s professional life and the other about its social life. Based on this observation, we can formalize facets for a feature set.

Facets: Feature set F S(e) of an entity e can be partitioned as a collection of facets F (e). The notion of a facet of a feature set can be defined using partitions as follows.

Definition 5 : Set F (e), a collection of facets of e, is a partition of F S(e). That is, the following conditions hold for F (e). (1) ∅ ∈ F (e). (2)⋃X∈F(e) X = F S(e). X∈F (e) (3) if X,Y ∈ F (e) and X = Y then X ∩ Y = ∅.

Definition 6 : Given a feature set FS(e), a collection of facets F(e), and a positive integer k < |FS(e)|, faceted entity summary of e FSumm(e), is a collection of features such that FSumm(e) ⊂ FS(e), |FSumm(e)| = k, and if k > |F(e)| then ∀X ∈ F(e) ⇒ X∩ FSumm(e) = ∅ else ∀X ∈ F(e) ⇒ |X∩ FSumm(e) | ≤ 1.

Note that if the number of facets is n and the size of the summary is k, at least one feature from each facet is included in the summary if k > n. If k < n, then at most one feature from each facet is included in the summary.


Es summary.png

The above picture illustrates the idea of faceted entity summaries. The entity shown in the picture is Marie Curie, a famous scientist and her details are extracted from DBpedia (only showing some of the facts for clarity). The different colors show facts belonging to different aspects. For example, orange color illustrates places related to her personal life. Then we select facts for the summary considering these different groupings that we call facets. If a summary is generated of length 3, it can be {f1,f2,f6}.


Clustering

The FACES approach generates faceted entity summaries that are both concise and comprehensive. Conciseness is about selecting a small number of facts. Comprehensiveness is about selecting facts to represent all aspects of an entity that improves coverage. Diversity is about selecting facts that are orthogonal to each other so that the selected few facts enrich coverage. Hence, diversity improves comprehensiveness when the number of features to include in a summary is limited. Conciseness may be achieved by following various ranking and filtering techniques. But creating summaries that satisfy both conciseness and comprehensiveness constraints simultaneously is not a trivial task. It needs to recognize facets of an entity that features represent so that the summary can represent as many facets (diverse and comprehensive) as possible without redundancy (leads to conciseness). Number and nature of clusters (corresponding to abstract concepts) in a feature set is not known a priori for an entity and is hard to guess without human intervention or explicit knowledge. Therefore, a supervised clustering algorithm or unsupervised clustering algorithm with prescribed number of clusters to seek cannot be used in this context. To achieve this objective, we have adapted a flexible unsupervised clustering algorithm based on Cobweb [2][3] and have designed a ranking algorithm for feature selection.


Hierarchical conceptual clustering

We use Cobweb [2] algorithm as the hierarchical conceptual clustering algorithm in our problem. " Cobweb is an incremental system for hierarchical clustering. the system carries out a hill-climbing search through a space of hierarchical classification schemes using operator that enable bidirectional travel through this space." Cobweb uses a heuristic measure called category utility to guide search. Category utility is a tradeoff between intra-class similarity and inter-class dissimilarity of objects (attribute-value pairs). Intra-class similarity is the conditional probability of the form P(Ai = Vij| Ck), where Ai = Vij is an attribute-value pair and Ck is a class. When this probability is larger, more members from the class share more values. Inter-class similarity is the conditional probability P(Ck| Ai = Vij). When this probability is larger, fewer objects sharing similar values are in other classes. Therefore, category utility (CU) is defined as the product of intra-class and inter-class similarities. For a partition {C1, C2,...., Cn}, CU is defined as follows,

Es eq1.png

Finally, they define CU as the gaining function as below.

Es eq3.png


Cobweb algorithm is explained in Figure 1 and its four operators are explained in Figure 2 (taken from [3]).

Figure 1. Cobweb algorithm
Figure 2. Auxiliary Cobweb operators

Modifying Cobweb for entities

We modified original Cobweb algorithm CU to suit our problem as follows.

Es eq4.png

We could not use CLASSIT[3] implementation of Cobweb as we do not have a distribution of terms as we only have a word set. We build a wordset as follows.

1. Tokenize property and value in a feature.
We take property name and values of a feature. Then tokenize property name and get typing information of the value. Then these tokenized words, original terms, and typing terms are added in a set WS.
2. Expand terms
We expand tokenized and typing terms using WordNet. For expansion we retrieve hypernyms. We add these terms into word set WS.


Typing for datatype property values

Object properties have type information assigned to their property values, but datatype property values do not have such semantic information. We propose to compute types for datatype property values. When types can be computed, we can easily cluster both object and datatype properties using CobWeb as explained in the previous section and then adapt the ranking algorithms to pick features for the faceted summary. The following figure illustrates the idea of computing a type for datatype property values. The figure illustrates 3 triples numbered 1 to 3. Triple 1 has an object property whereas triples 2 and 3 have datatype properties. Before computing a type, we could not group these three triples based on the similarity of politician expressed by their object values, but after computing the types for triples 2 and 3, we could group these three triples under the type of Politician.


Es string type.jpg

We compute types for datatype property values by utilizing head word identifier and entity spotting using n-grams. The Colin's head word detection rules provide the capability to identify the head word, which we consider as the focus term in our implementation. The focus term suggests the main focus of the text phrase that leads to identification of the type of the text phrase. When we identify the focus term, we check whether there is a direct match to a class label in the ontology as the type. If a candidate type is not found, we further analyze focus term with n-grams extracted from the text phrase to identify possible entities in the dataset and get their types or direct match for a class label as the type (Algorithm 2). If the algorithm still unable to find a type, we compute semantic similarity of the focus term and all the classes in the dataset and pick the highest non-zero value and get that class as the type. The Algorithm 1 below explains the main steps in type computation and Algorithm 2 explains the type computation with n-grams and the focus term.

Type algo1.jpg
Type algo2.jpg


Some example types generated for a sample of datatype property values taken from DBpedia dataset is as shown below.

Es typing examples.jpg

Ranking (features & facets)

We rank features in each facet using informativeness of the feature and popularity of the value of the feature. D is the set of triples (i.e., dataset) associated with data graph G defined above. N is the total number of entities. They are as follows.

First, ranking for object property features within each facet is as as follows.

Es rank.jpg


Ranking datatype property features within each facet is as follows. This is under the extension of original FACES approach. ES is a function that returns all the entities for a datatype property value. max is a function that returns the most popular entity given a set of entities.

Type es rank.jpg


We extend the original FACES by ranking facets and pick the summary based on facet ranking as well. Facet ranking is as follows.

Facet rank.jpg


Faceted entity summary creation

The faceted entity summary for an entity e is generated as follows.

  1. The feature set FS(e) is partitioned into facets. The Cobweb creates a dendrogram and it is cut at a empirically determined level to get the set of facets F(e).
  2. Then features within each of these facets are ranked.
  3. Next, the ranking scores of features in each facet are aggregated to compute the facet ranking score for each facet and they are ranked accordingly.
  4. The top tanked features from highest to lowest ranked facets are selected from each facet according to Definition 2 to form the faceted entity summary of length k.

Dataset

FACES evaluation dataset

Evaluation data for FACES is available for download


Typing evaluation with FACES (FACES extension)

Evaluation data for Type Generation is available for download


References

[1] Cheng, Gong, Thanh Tran, and Yuzhong Qu. "RELIN: relatedness and informativeness-based centrality for entity summarization." In The Semantic Web–ISWC 2011, pp. 114-129. Springer Berlin Heidelberg, 2011.
[2] Fisher, D.H.: Knowledge acquisition via incremental conceptual clustering. Machine learning 2(2), 139–172 (1987)
[3] Gennari, J.H., Langley, P., Fisher, D.: Models of incremental concept formation. Artificial intelligence 40(1), 11–61 (1989)
[4] Kalpa Gunaratna, Krishnaprasad Thirunarayan, and Amit Sheth. 'FACES: Diversity-Aware Entity Summarization using Incremental Hierarchical Conceptual Clustering'. 29th AAAI Conference on Artificial Intelligence (AAAI 2015), AAAI, 2015.
[5] Kalpa Gunaratna, Krishnaprasad Thirunarayan, Amit Sheth, and Gong Cheng. 'Gleaning Types for Literals in RDF Triples with Application to Entity Summarization'. 13th Extended Semantic Web Conference (ESWC 2016), ESWC, 2016.