Domain Specific Document Retrieval Framework for Near Real-time Social Health Data

Introduction

With the advent of the web search and microblogging, the percentage of Online Health Information Seekers (OHIS) using these online services to share and seek health real-time information has increased exponentially. When OHIS turn to search engine or microblogging search services to seek out real-time information, the results are not promising. Most of the web search engines and microblogging services are limited to keyword based techniques to retrieve useful information for a given query. Often, the top results are dominated with breaking news. Similarly, in the microblogging and web search engine realm, the results do not contain real-time information. It is extremely difficult for users to retrieve relevant results based on query alone; they may get overwhelmed by the information overload. In our approach, we have considered Twitter to search documents based on some unique features: triple-pattern based mining, near real-time retrieval, and tweet contained URL based search. First, triple based pattern (subject, predicate, and object) mining technique extracts triple patterns from microblog messages--related with chronic health conditions. The triple pattern is defined in the initial question. Second, in order to make the system near real-time, the search results are divided into intervals of six hours. Third, in addition to tweets, we use URLs’ (mentioned in the tweet) content as the data source. Finally, the results are ranked according to relevance and popularity such that at a particular time the most relevant information for the questions are displayed instead of only temporal relevance.

Architecture

Our Social Health Signal platform is based on a) large scale real-time Twitter data processing b) semantic web techniques and domain knowledge c) triple-pattern based text mining. The system is divided into three major components.

Processing Pipeline: To collect and extract meta-data of the tweets. Second
Pattern Extractor: To extract relevant documents related to a given query
Rank Calculator: This module calculate the rank of the results

Architecture

Processing Pipeline

we have used tweets (messages shared on Twitter) and URLs’ content (for URL mentioned in the tweets) as the data sources to extract relevant information for based on a given user given query. To extract relevant and recent information from real-time data, the first challenge is to create infrastructure for collecting a real-time tweet and extract meta-data of a tweet. In Social Health Signal, we have used Apache Storm component to extract tweets using the public Twitter streaming API while also performing meta-data extraction. Apache storm is free, open source software, used for real-time, distributed computing. It is similar to Hadoop, which is used for batch process.

Spout:

A spout use the Streaming API to crawl real-time tweets.

Bolt:

The bolts contain computation logics to perform feature extraction logic in real-time. The first bolt is a filter bolt to identify the language of a tweet and allow only English tweets. Once all computing bolts are finished, the final bolt will save the data into the database.

Meta-data:

A tweet has many features, such as text, short_url, latitude or longitude, retweet_count, etc. All these features can be useful for finding out useful information. To extract all theses features from tweets in real-time. This process is also known as a pre-processing analytic pipeline, because the extracted features and data help to pattern extraction module.

Pattern Extractor

Once the features are extracted, which occurs at an interval of every six hours, the information extractor module collects all the stored tweets and their features for extraction using a triple-pattern based on a user query. To extract relevant information or documents we have used the IBM text analytic tool (Annotated Query Language), also known as AQL. AQL is a query language to help developers to build queries that extract structured information from unstructured or semi-structured text. We have used an AQL tool to construct a triple-pattern. The triple-pattern (subject, predicate, and object) is defined in the initial question. We have divided user’s questions into two categories: static and dynamic. Static questions are the most frequently asked questions collected from different sources. The dynamic questions are typed by the user on the fly, which is not the case with static queries.

Triple:
Pattern
Results

Rank Calculator

The contents are not of exceptionally good quality, and the number of bad quality content is quite high. The presence of both kinds of content on the social media has led to users engaging with search engines to retrieve useful information for queries. In addition to simply receiving answers, users want the results to be good quality and well-ordered. These search engines focused on ranking algorithms to order the results.

The algorithms are machine learning algorithms which rank the results based on popularity, relevancy, etc..

We have used many machine learning algorithms to evaluate the results and selected one of them based on an evaluation matrix. The algorithm we have chosen is the “Random Forest” algorithm.

a preliminary step in our application is to create a new set of features to facilitate learning. In our application, there are two sets of features: popularity and relevancy. A popularity set is a share and like counts of web URL on various online social media platforms. Similarly, to know how the extracted patterns are relevant to the user’s question, we have used a string similarity algorithm.

Architecture

Result and Evaluation

As our research is focused on extracting near real-time health information based on users’ search queries, we have made the decision to evaluate our system’s results with existing real-time search engine Twitter search engine and as well as compared results with Google time-bound search(specified/custom date range.)

In our research on real-time health information, we conducted a survey which takes into account three questions dealing with the chronic disease diabetes.

1) How to control diabetes? 2) What are the causes of diabetes? 3) What are the symptoms of diabetes?

Architecture

Evaluation Matrices

nDCG@K (Normal Discounted cumulative gain)

nDCG@K can handle multiple levels of relevance

It gives more weightage to a higher position document than a lower ranking position document

Architecture

Socialhealthsignal

Contents

Introduction

Architecture

Processing Pipeline

Pattern Extractor

Rank Calculator

Result and Evaluation

Evaluation Matrices

People

Navigation menu

Views

Personal tools

Navigation

Homepage

Search

Tools