Swapnil Soni Thesis
Domain Specific Document Retrieval Framework for Near Real-time Social Health Data.
With the advent of the web search and microblogging, the percentage of Online Health Information Seekers (OHIS) using these online services to share and seek health real-time information has increased exponentially. OHIS with chronic disease are more interested in latest and relevant information in online health search <ref> Munmun De Choudhury, Meredith Ringel Morris, and Ryen W White. Seeking and sharing health information online: Comparing search engines and social media. In Proceedings of the 32nd annual ACM conference on Human factors in computing systems, pages 1365–1376. ACM, 2014</ref> <ref>Jaime Teevan, Daniel Ramage, and Merredith Ringel Morris. # twittersearch: a com-parison of microblog search and web search. InProceedings of the fourth ACM inter-national conference on Web search and data mining, pages 35–44. ACM, 2011.</ref>. When OHIS turn to search engine or microblogging search services to seek out real-time information, the results are not promising. It is extremely difficult for users to retrieve relevant results based on query alone; they may get overwhelmed by the information overload. Following are the challenges exist in the current systems are:
- Results are limited to keyword based techniques to retrieve useful health information for a given query <ref>Alexander Pretschner and Susan Gauch. Ontology based personalized search. InToolswith Artificial Intelligence, 1999. Proceedings. 11th IEEE International Conferenceon, pages 391–398. IEEE, 1999</ref>
- Sometime results do not contain real-time information
- Microblogging search services use posts or messages to find out information (e.g, Twitter search engine uses tweet to get information)
- Ranking of results are based on relevancy
- Source of information is not reliable
In our approach, we have considered Twitter to search documents based on some unique features: triple-pattern based mining, near real-time retrieval, tweet contained URL based search, and ranking based on popularity and relevancy of the results. First, triple based pattern (subject, predicate, and object) mining technique extracts triple patterns from microblog messages--related with chronic health conditions. The triple pattern is defined in the initial question. Second, in order to make the system near real-time, the search results are divided into intervals of six hours. Third, in addition to tweets, we use URLs’ (mentioned in the tweet) content as the data source. Finally, the results are ranked according to relevance and popularity such that at a particular time the most relevant information for the questions are displayed instead of only temporal relevance.
We have divided user’s questions into two categories: static and dynamic. Static questions are the most frequently asked questions collected from different sources. The dynamic questions are typed by the user on the fly, which is not the case with static queries. The static questions are most frequently asked questions; they are collected from various sites such as WebMD, Mayo clinic, etc. However, the processes of extracting relevant documents based on static and dynamic questions are different. In the case of static questions, we extract documents every six hours, while in dynamic questions, we extract documents from that moment to last six hours data.
Our platform (Social Health Signal) is based on a) large scale real-time Twitter data processing b) semantic web techniques and domain knowledge c) triple-pattern based text mining. The system is divided into three major components.
- Processing Pipeline: To collect and extract meta-data information from the tweets.
- Pattern Extractor: It extracts relevant documents related to a given query
- Rank Calculator: This module calculate the rank of the results
We have used tweets (messages shared on Twitter) and URLs’ content (for URL mentioned in the tweets) as the data sources to extract relevant information for based on a given user given query. To extract relevant and recent information from real-time data, the first challenge is to create infrastructure for collecting a real-time tweet and extract meta-data of a tweet. In Social Health Signal, we have used Apache Storm component to extract tweets using the public Twitter streaming API while also performing meta-data extraction. Apache storm is free <ref>Apache. Storm, distributed and fault-tolerant realtime computation, 2015. [Online;accessed 22-February-2015].</ref>, open source software, used for real-time, distributed computing. Spouts and Bolts are basic components in storm for real-time processing of data. Networks of spouts and bolts are packaged into a no ''topology'', which is submitted to storm cluster. A tweet has many features, such as text, short url, latitude or longitude, retweet count, etc. All these features can be useful for finding out useful information. To extract all theses features from tweets in real-time. This process is also known as a pre-processing analytic pipeline, because the extracted features and data help to pattern extraction module.
A spout use the Streaming API to crawl real-time tweets.
The bolts contain computation logic to perform features extraction logic in real-time. The first bolt is a filter bolt to identify the language of a tweet and allow only English tweets. Once all computing bolts are finished, the final bolt will save the data into the database.
Once the features are extracted from the tweets, which occurs at an interval of every six hours, the information extractor module collects all the stored tweets and their features for extraction using a triple-pattern based on a user query. We have implemented an this module inside the Apache Hadoop realm;it contains three sub-modules URLs' content extractor, Social media share and like counts extractor and pattern extractor.
- URL content extractor:
This is the first module to execute and it aims to extract content from the URLs. To extract content from the URLs.
- Social media share and like counts extractor:
People share URLs on social media for detailed information. People also click on like buttons to show positive feelings towards and approval of shared links. These shares and like counts show the popularity of URLs on social media (Facebook shares, Facebook likes count, Twitter shares count, and Google domain pagerank).
- Pattern extractor:
This module is a very important module for extracting relevant documents based on an AQL (Annotated Query Language) query. The relevant documents are triple-patterns found in the URLs’ content and tweets.
To extract relevant information or documents we have used the IBM text analytic tool (Annotated Query Language), also known as AQL. AQL is a query language to help developers to build queries that extract structured information from unstructured or semi-structured text. We have used an AQL tool to construct a triple-pattern. A triple-pattern consists of three parts: subject, predicate, object. Subject and object are a noun or noun phrase, similarly a predicate is a verb, verb phrase, noun or noun phrase.
How to control diabetes?
X → control → diabetes
X → control → blood sugar
X → handle → blood sugar
X → handle → diabetes
Tweet: 5 easy natural remedies to control diabetes : If you are a diabetic or know someone who is a diabeti...
Retrieved document: 5 easy natural remedies to control diabetes
To simply receiving results, users want the results to be good quality and well-ordered. Existing search engines focused on ranking algorithms to order the results based on relevancy. We have used machine learning algorithms to rank the results based on popularity, relevancy, and reliability. We have evaluated many machine learning algorithms and selected one of them based on an evaluation matrix. The algorithm we have chosen is the no ''Random Forest'' algorithm.
As a preliminary step in our application is to create a new set of features to facilitate learning. In our application, there are three sets of features: popularity and relevancy, and reliability. A popularity set is a share and like counts of web URL on various online social media platforms. Similarly, to know how the extracted patterns are relevant to the user’s question, we have used a string similarity algorithm. Finally, we have chosen Google domain page rank of URLs for reliability. Following figure shows the precision and recall of classifiers.
Result and Evaluation
As our research is focused on extracting near real-time health information based on users’ search queries, we have made the decision to evaluate our system’s results with existing real-time search engine Twitter search engine. However, we also compared our results with Google time-bound search (specified/custom date range) for real-time.
In our research on near real-time health information, we have evaluated our system based on reliability, relevancy, and real-time. We conducted a survey which takes into account three questions dealing with the chronic disease diabetes for evaluating relevancy. However, for reliability we check the URL’s (extracted news article) Google domain pagerank and our filtration criteria is URL’s Google domain pagerank should be greater than 4. For real-time, we have considered only 6 hours data to find out information of a user’s given query.
- Query1: How to control diabetes?
- Query2: What are the causes of diabetes?
- Query3: What are the symptoms of diabetes?
We measure the objective performance of our system using nDCG@K. The nDCG is a ranking metric. It predicts a list of sorted documents, and then compares it with a list of relevant documents. Its values vary from 0.0 to 1.0, and 1.0 represents the ideal ranking. Also, nDCG metric is commonly used to measure the performance of search engines. In nDCG, the documents, which are highly relevant, are more valuable when they appear on top in a search result list.