Difference between revisions of "PREDOSE"

From Knoesis wiki
Jump to: navigation, search
(Overview)
(Overview)
Line 47: Line 47:
 
   |title_background=#77bbbb
 
   |title_background=#77bbbb
 
   |content= The overall research plan has three(3) distinct stages:  
 
   |content= The overall research plan has three(3) distinct stages:  
# '''Data Collection:''' is an intended alternative to manually conducted interviews. It operates under the assumption that similar information such as those gathered from interview sessions can be found in online forums.  Therefore, data crawling software can be used to collect such data from web sources, thereby alleviating the laborious task of relying solely on interviews as the source of qualitative data.  
+
# '''Data Collection:''' In this stage, Kno.e.sis researchers intend to develop scalable data collection alternatives to manual interviews. Data collection from web based data  operate under the assumption that similar web forum data is more representative of prescription drug abuse practices than manually conducted interviews.  Therefore, we developed a suite of web crawling software that collect data from web forums.  
# '''Automatic Qualitative Coding:''' is the process of automating human generated qualitative codes, mainly through entity identification, relationship identification and complete triple extraction. This process aims at capturing the semantics of information expressed in the web forums with sufficient accuracy to enable subsequent analysis. The complete range of techniques for triple extraction include rule-based, pattern-based, statistical probabilistic and semantics-based analysis, all of which will play a critical role in this phase.  
+
 
 +
# '''Automatic Qualitative Coding:''' In this stage, the research team endeavor to automatically extract semantic information from text, deemed ''semantically equivalent'' to  human generated qualitative codes, mainly through entity identification, relationship identification and complete triple extraction. This process aims at capturing the semantics of information expressed in the web forums with sufficient accuracy to enable subsequent analysis. The complete range of techniques for triple extraction include rule-based, pattern-based, statistical probabilistic and semantics-based analysis, all of which will play a critical role in this phase.  
 
# '''Data Analysis & Interpretation:''' is the final stage the project. The resulting RDF data (i.e. Drug Abuse Ontology - DAO) collected from phase 2, will be analyzed using existing semantic web tools at Kno.e.sis or new tools to be developed where appropriate. Tasks such as search, automatic summarization, reasoning and discovery are anticipated outcomes from this phase.
 
# '''Data Analysis & Interpretation:''' is the final stage the project. The resulting RDF data (i.e. Drug Abuse Ontology - DAO) collected from phase 2, will be analyzed using existing semantic web tools at Kno.e.sis or new tools to be developed where appropriate. Tasks such as search, automatic summarization, reasoning and discovery are anticipated outcomes from this phase.
 
}}
 
}}
Line 54: Line 55:
 
[[Image:Citar-research-plan-071811.png | center | 600px | thumb | Fig1: Research Plan]]
 
[[Image:Citar-research-plan-071811.png | center | 600px | thumb | Fig1: Research Plan]]
  
The overall architecture of PREDOSE contains various sub-components. We discuss each of these in further detail below:
+
The PREDOSE platform consists of various sub-components. We discuss each of these in further detail below:
  
 
====Stage 1: Data Collection====
 
====Stage 1: Data Collection====

Revision as of 02:46, 25 May 2013

PREDOSE (PREscription Drug abuse Online Surveillance and Epidemiology project) is an NIH-funded (July 2011 - July 2013) inter-disciplinary project between the Ohio Center of Excellence in Knowledge-enabled Computing (Kno.e.sis) and the Center for Interventions, Treatment and Addictions Research (CITAR) at Wright State University. The overall aim of PREDOSE is to develop techniques to facilitate prescription drug abuse epidemiology, related to the illicit use of pharmaceutical opioids. PREDOSE is designed to capture the knowledge, attitudes and behaviors of prescription drug abusers through the automatic extraction of semantic information (including entities, relationships, triples, sentiments and template pattern expressions) from social media..


PREDOSE in the Media

Semantic App Helps Researchers Understand Prescription Drug Abuse, (article on Semanticweb.com June 11, 2012)

Overview

The non-medical use of pharmaceutical opioids has been identified as one of the fastest growing forms of drug abuse in the U.S. The White House Office of National Drug Control Policy (ONDCP) has recently launched the Epidemic: Responding to America’s Prescription Drug Abuse Crisis initiative to curb prescription drug abuse problem, mainly through education and drug monitoring programs. The White House Initiative has been prompted by recent research which associate the rise in prescription drug abuse with two important phenomena: 1) expanded the pathways to heroin addiction and 2) escalating rates of accidental overdose deaths. To combat these trends, public health professionals require timely and reliable information on new and emerging drug trends on prescription drug abuse.

Although existing epidemiological data systems provide critically important information about drug abuse trends, they are often time-lagged. Hence, there is a critical need for epidemiological sources that could complement existing drug trend monitoring systems and enhance their capacity for early identification of new and emerging trends. The World Wide Web (Web) has been identified as one of the leading data sources for detecting patterns and changes in the non-medical use of pharmaceutical and other illicit drugs. Many Web 2.0 empowered social platforms, including web forums, provide venues for individuals to freely share their experiences, post questions, and offer comments about different drugs. The PREDOSE project aims to leverage web forum data to provide such timely emerging information on the non-medical use of pharmaceutical opioids.


The PREDOSE project therefore has two(2) specific aims:

Goals
  1. To determine user knowledge, attitudes and behavior related to the non-medical use of pharmaceutical opioids (namely buprenorphine) as discussed on Web-based forums
  2. To determine spatio-temporal-themaitc trends in pharmaceutical opioid abuse as discussed on Web-based forums
PREDOSE Team

Principal Investigators: Raminta Daniulaityte, Amit P. Sheth
Co-Investigators: Robert Carlson, Russel Falck
Graduate Students: Delroy Cameron, Lu Chen, Gary A. Smith, Gaurish Anand, Revathy Krishnamurthy, Nishita Jaykumar, Swapnil Soni
External Collaborators: Pablo N. Mendes
Post Doctoral Researchers: Kera Z. Watkins
Visiting Researchers: Drashti Dave
Past Members: Matthan Sink, Michael Cooney, Sujan Perera, Mandeep Singh, Pratik Desai, Mary Oberer, Kaustav Saha

PREDOSE Project Overview

Problem: Historically, qualitative research in drug abuse interventions programs have relied on manual data collection practices. Data have been gathered from interactive interview sessions with single or groups of users. Interviews are typically transcribed into text, then manually annotated by researchers, to identify themes from the interview sessions. This process is called qualitative coding. Qualitative research tools such as NVivo, have been use to facilitate such Content Analysis. However, the intensive manual effort required to perform the annotations is not scalable and hence, is not practical for Web-based data. Instead, to effectively process the large volume of heterogeneous Web-based data, the field requires a highly automated way of collecting, processing and analyzing semantic metadata from the web.

Proposed Solution: To automate the extraction of semantic metadata researchers from the Kno.e.sis Center at Wright State University endeavor to build on prior research to address the complex problem In particular, researchers at Kno.e.sis have successfully applied Semantic Web techniques, to account for shortcomings in Machine Learning and Natural Language Processing techniques to automatically extract knowledge from structured biomedical text and social media (specifically tweets). Substantial progress in understanding content and identify social perceptions from informal text from sources including MySpace, Facebook, and Twitter has been made, through metadata extraction and spatio-temporal-thematic analysis (i.e., semantic analysis). These cutting-edge information processing techniques, with appropriate adaptations can now be exploited to fit the needs of public health and drug abuse research on conversational and informal text, such as those occurring in web forums.

Research Plan

The overall research plan has three(3) distinct stages:

  1. Data Collection: In this stage, Kno.e.sis researchers intend to develop scalable data collection alternatives to manual interviews. Data collection from web based data operate under the assumption that similar web forum data is more representative of prescription drug abuse practices than manually conducted interviews. Therefore, we developed a suite of web crawling software that collect data from web forums.
  1. Automatic Qualitative Coding: In this stage, the research team endeavor to automatically extract semantic information from text, deemed semantically equivalent to human generated qualitative codes, mainly through entity identification, relationship identification and complete triple extraction. This process aims at capturing the semantics of information expressed in the web forums with sufficient accuracy to enable subsequent analysis. The complete range of techniques for triple extraction include rule-based, pattern-based, statistical probabilistic and semantics-based analysis, all of which will play a critical role in this phase.
  2. Data Analysis & Interpretation: is the final stage the project. The resulting RDF data (i.e. Drug Abuse Ontology - DAO) collected from phase 2, will be analyzed using existing semantic web tools at Kno.e.sis or new tools to be developed where appropriate. Tasks such as search, automatic summarization, reasoning and discovery are anticipated outcomes from this phase.
Fig1: Research Plan

The PREDOSE platform consists of various sub-components. We discuss each of these in further detail below:

Stage 1: Data Collection

  • Web Site Selection: Web forums selected for the study are chosen based on the following criteria 1) they allow free discussion of psychoactive drug use; 2) contain information on illicit pharmaceutical drug use, and 3) are publicly accessible. Additionally, since it is important that this study collects relevant and timely information, such forums are also expected to be very active both in terms of number of users and topics of discussion.
  • Web Crawling: Various popular HTML parsers (e.g. Nutch, Jericho HTML Parser etc) exist for parsing web data. Data crawling periodically is necessary to update our databases with the most recent data published by the selected sources. Standardized web forum software somewhat alleviate the traditional problems involved with mining web data. The use of such software enable exploitation of the structure of web forum site by our custom crawlers.
  • Data Cleaning: One of the most challenging problems in dealing with web data is decoding special HTML characters to obtain ASCII text and separating special characters from standard text.
  • Location Resolution: Collection location data is important for spatio-temporal-thematic analysis. It would not be surprising that drug abuse practices across continent with regard to some specific drugs (e.g. heroin) will vary vastly. The most anticipated variations are likely in drug mixtures. For example, it may be popular culture to use heroin+cocaine in one region, while this practice is entirely uncommon in another.
  • Informal Text Database: It is necessary to collect and store a wide selection on data for this study. Some database tables include, users, posts, source and location (city, state, country, continent, zip).

Stage 2: Automatic Qualitative Coding

This is the most challenging aspect of this project. The aim is to use various information extraction techniques to extraction triples from web forum data. Such extraction is to be undertaken in three steps:

  • Entity Identification: The most challenging aspect of entity identification from web forum data is the informal nature of the text. Web forum data is characterized by a proliferation of slang terms instead of standard references to known drugs. Fortunately, slang term to known drug mappings are available online through various source, such as (NIDA, NDCP, Erowid, Urban Dictionary etc). We exploit these sources as a starting point for recognizing slang terms that reference known drugs. However, these mappings create the unfortunate side effect of ambiguity. "Oxy" can refer to Oxycontin, Generic Oxycontin, Oxycontin OP or Oxycontin OC. Hence, some techniques for slang term disambiguation become necessary. We have so far taken a probabilistic approach to entity disambiguation, since the surround terms to an ambiguous slang term are also slang and therefore do not help semantics-based approach that leverage the ontology schema.
  • Relationship Extraction: We anticipate that the success of our entity extraction along with Drug Abuse Ontology schema will directly impact the relationship extraction. However alternative relationship extraction have been covered elsewhere and will be adapted where appropriate.
  • Triple Extraction: Previous work in the lab have successfully implemented rule-based triple extraction (Ramakrishnan C, Mendes P. N. etc) on structured biomedical literature. In other work, (Thomas C, Mehra P, etc) have implemented a statistical/probabilistic approach to triple extraction also on structured text. Such techniques are not likely apply to informal web forum text. Hence, one approach is to translate our informal text into structured text, once entities and relationships have been identified. Alternatively, standard-alone pattern-based, probabilistic and semantics-based techniques can be used to complete triple extraction based on the effectiveness of the entity and relationship extraction.
  • Drug Abuse Ontology (DAO): The final output of the triple extraction is population of the Drug Abuse Ontology instance base. This, together with the DAO schema, we intend to maintain as a dynamic ontology created from user-generated content (UGC). The current DAO is available online.

Stage 3: Data Analysis & Interpretation

  • Semantic Web Tools: Many tools for data analysis exist at Kno.e.sis. Some of these include, 1) Twitris for spatio-temporal-thematic analysis 2) Cuebee for automatic complex query creation over RDF data and 3) Scooner for guided navigation of documents annotated with semantic metadata (entities or triples). Once the DAO has been created, the data can be easily infused into any of these tools to support analysis. Alternatively, new tools can be created on demand.
  • Spatio-Temporal-Thematic Analysis: Discussion on the integration of web forum data into Twitris has already begun. Owing to the use of the slang term dictionary, qualitative researchers will be able to observe posts contains easily identifiable and non-ambiguous references to known drugs in various locations.


Live Web Application

http://knoesis-hpco.cs.wright.edu/predose/

Publications

  1. D. Cameron, N. Jaykumar, G. Anand, K.Thirunarayan, G. A. Smith, A. P. Sheth, S. Soni, K. Z. Watkins. Knowledge-Aware Search (under review)
  2. D. Cameron, G. A. Smith, R. Daniulaityte, A. P. Sheth, L. Chen, G. Anand, R. Carlson, K. Z. Watkins, R. Falck. PREDOSE: A Semantic Web Platform for Drug Abuse Epidemiology using Social Media (under review)
  3. Lu Chen, Wenbo Wang, Meenakshi Nagarajan, Shaojun Wang and Amit P. Sheth. Extracting Diverse Sentiment Expressions with Target-dependent Polarity from Twitter. In Proceedings of the 6th International AAAI Conference on Weblogs and Social Media (ICWSM), 2012.
  4. R. Daniulaityte, R. Carlson, R. Falck, D. Cameron, S. Perera, L. Chen, A. P. Sheth. "I Just Wanted to Tell You That Loperamide WILL WORK": A Web-Based Study of Extra-Medical Use of Loperamide. Journal of Drug and Alcohol Dependence. (2012). (in press).
  5. R. Daniulaityte, R. Carlson, R. Falck, D. Cameron, S. Perera, L. Chen, A. P. Sheth. A Web-Based Study of Self-Treatment of Opioid Withdrawal Symptoms with Loperamide. The College on Problems of Drug Dependence CPDD 2012, Palm Springs, CA USA, June 9-14, 2012.

Related

  1. Researchers use social web forum data to understand nonmedical use of painkillers
  2. Semantic App Helps Researchers Understand Prescription Drug Abuse (news article on semanticweb.com)
  3. PREDOSE @CITAR
  4. U.S. Targeting Prescription Drug Abuse
  5. Twitter Helps Determine "Morning People" and "Night Owls"

Funding

This project is sponsored by the National Institutes of Health (NIH) Grant No. R21 DA030571-01A1 awarded to the Ohio Center of Excellence in Knowledge-enabled Computing (Kno.e.sis) and the Center for Treatment, Interventions and Addictions Research (CITAR) titled “A Study of Social Web Data on Buprenorphine Abuse using Semantic Web Technology.” Any opinions, findings, conclusions or recommendations expressed in this material are those of the investigator(s) and do not necessarily reflect the views of the National Institutes of Health.

Contact: Delroy Cameron