EKPE

From Knoesis wiki
Jump to: navigation, search

Semantics Empowered Event-specific Key Phrases Extraction

Lu Chen

Introduction

The emergence of micro-blogging has led to the explosive growth of citizen sensor data, which is becoming a valuable source for analyzing and understanding what citizen are observing and perceiving of an event. However, there are millions of micro-blogs published per day, and it is impossible for people to digest such a huge amount of data. It is appealing to have a system to extract the key phrases of what is being said about an event from the data. That is what we called event-specific key phrases extraction. We define the event-specific key phrases as those words/phrases which are crucial in providing explanation or interpretation of an event.

Automatic key phrases extraction technique is fundamental to fight against the “information overload”, but conveying the information only by key phrases is not enough for people to understand what is being said about an event. For example, we get the following key phrases from the data talking about “Osama’s death”: Osama Bin Laden, Osama’s death, watching OBL attack, human shield, wife, Navy seals, white house team, situation room, last night, hideout in Abbottabad, White House, Pakistan. It is difficult to get the story out of those unconnected phrases, and more than that, even understanding some of the phrases themselves are not easy. In order to overcome the limitation and convey the information in a better way so that people can understand it with less effort, the key is to associate semantics of those key phrases by connecting them with rich relationships.

Here we concern about two kinds of relationships (semantics). The first one is about “what it is”, and it aims to associate meaning to a single key phrase. For example, Osama Bin Laden is a person; last night is a time; White House is a place, etc. This task is similar to Named Entity Recognition, and the recognized entity will be linked to the corresponding instance in the public knowledge base (e.g., Wikipedia, freebase, etc.). Another kind of relationships is about “how it is related to the event”, and it aims to connect those key phrases with their relationships in the context of the event. For example, last night is the time when the event of “Osama’s death” happened, and hideout in abbottabad is the place where the event of “Osama’s death” happened, the event of “human shield” is a subevent of “Osama’s death”, etc. This task is more about Relationship Identification, which we believe is even more challenging than the first one.

The general idea of realizing those two tasks is to leverage the power of knowledge, including both the explicit knowledge from public knowledge base (i.e., Wikipedia, freebase, etc.) and the implicit knowledge from event-specific corpus (i.e., concurrence, frequency, etc.)


Approaches

As the first step, we follow the traditional way of key phrases extraction – extract the frequent noun phrases from the event-specific corpus. We employ the Stanford Parser [1] to parse each micro-blog document in the corpus to get the part-of-speech tag of each word in each document. After that, the candidate noun phrases are identified using predefined patterns. Then stem each extracted candidate using the WordnetStemmer [2], and all the candidates which have the same stem are grouped together. Select top N most frequent stem, and for each selected stem, choose the most frequent noun phrases from the group of candidates of the stem. In this manner, we get the top N frequent noun phrases. However, not all the frequent noun phrases are key phrases, and as the last step, we remove the noise from the result using some heuristics.

We define an event ontology KnoEO of the concepts and relationships concerned with an event. This ontology must be general enough so that it can be used on various events. Here is an initial version of KnoEO [3].

The core of the approach is the entity recognition and relationship extraction, which is crucial for associating semantics of the extracted key phrases. For the entity recognition, on one hand, we leverage the knowledge from public knowledge base, which provide the explicit knowledge about an entity (e.g., person’s name, gender, places lived, etc.), and on the other hand, we extract the context (implicit knowledge) of the key phrase (which might be the entity), and determine whether the key phrase is represent the entity by matching between the explicit knowledge and the implicit knowledge. For the relationship extraction, on one hand, we mine the public knowledge base (e.g., WordNet) to construct a lexicon of words expressing relationships defined in the event ontology, and on the other hand, we extract the relationship indicators (e.g., the verbs which frequently co-occur with the key phrase in the event context) of the key phrases from the corpus, and estimate the relationship between the key phrases and the event using the relationship lexicon.

Figure 1 illustrates the architecture of the system.


Architecture

the architecture illustration of this approach.
Figure 1: architecture illustration.

Milestones

  • Implement Key Phrases Extraction from the event-specific corpus.
  • Define Event Ontology, in which the concepts and relationships concerned with events are formulated.
  • Implement Entity Recognition, and associate meaning to single key phrase.
  • Implement Relationship Extraction, and connect key phrases with their relationships in the context of the event.


Demonstration

Extracted Key Phrases for Several Events

The work of key phrases extraction has been integrated into Twitris[4].

Event Ontology

Event Ontology
Figure 2: An event ontology which defines concepts and relationships concerned with an event.

Accomplishments

  • I got the idea of this work during taking the Semantic Web course. I was focusing on the text mining and natural language processing, which can be seen as learning implicit knowledge from the data and using this knowledge for deriving target information from text, and didn't work or think much on the formal or explicit knowledge (e.g., the data which is represented as RDF or OWL). Now I realize that my research can benefit from both ways, and this project is a start point.

  • I am always interested in the research of relationship, including its representation, extraction, etc. In this project, I am starting to do some work on the relationship involved in the context of event. Based on this work, I might look deeper into this area in the future.


References