EMPWR: Knowledge Graph Development Platform

From Knoesis wiki
Jump to: navigation, search

Background and Motivation

Knowledge Graph (KG) is an encapsulation of structured knowledge in a graphical representation & used for a variety of information processing and management tasks such as

  • Data & knowledge integration from diverse sources
  • Improve automation
  • Enabling new generation of applications
  • Empowering machine learning (ML) & NLP techniques with domain knowledge

and applications such as question answering, summarization, text simplification, and Named Entity Recognition (NER).

Most existing KG platforms & tools are limited in

  • Provenance
  • Dynamicity (ie: static schema vs schema generation)
  • Temporal
  • Domain specificity
  • Modularity

The AIISC Knowledge Graph (KG) EMPWR effort involves the development of a comprehensive tool and platform for KG development with the following aims

1. Develop a KG development platform capable of instantiating KGs in any domains from structured, semi-structured, and unstructured data:

    • Biomedical & pharmaceutical domain with Percuro

2. Improve & address the limitations of existing KG platforms

3. Constructs a Knowledge Graph (based on a combination of)

    • Enrich an existing Knowledge Graph (Top-down declarative)
    • Construct a Knowledge Graph out of given entities (Bottom-up data driven)

Goals & Use-Cases

The goals of KGs are to provide

Collaborations

Percuro is a collaborative research project involving WIPRO, The AI Institute at University of South Carolina (AIISC), and IIT-Patna (IIT-P). It involves development of semantic (i.e., knowledge graph enhanced) approach to natural language processing (NLP), natural language generation (NLG) and natural language understanding (NLU) targeted at the pharmaceutical domain. It will involve techniques for NLP/NLG/NLU on biomedical and clinical documents relevant to pharmaceutical markets.

Percuro aims to solve tasks such as (a) text simplification, (b) summarization and (c) question answering. These are tasks that are not straightforward and require more information that what the text provides.

Overview

Project: Multimodal Knowledge Graphs (MMKGs) Construction and Analysis to Accelerate Pharmaceutical Drug Discovery with EMPWR

One of the main challenges in the pharmaceutical industry is navigating and integrating vast, heterogeneous datasets, from scientific literature and clinical trial results to genomic and molecular data. Knowledge Graphs (KGs) have emerged as a powerful and promising technology to structure this information, representing real - world entities like drugs, diseases, and genes as nodes, and their relationships as edges. However, building and maintaining these KGs is a complex, and continuous process. The EMPWR platform is a comprehensive solution designed to manage the entire KG lifecycle. It was developed to bridge the gap between traditional symbolic AI systems, which are explainable but difficult to scale, and modern data - driven systems like deep learning and transformer - based models, which are powerful but often function as "black boxes".

Subfigure 1

Figure: The EMPWR platform architecture integrated with the Common Metadata Framework (CMF), and the Intelligent Data Store (IDS) for the KG’s lifecycle.

EMPWR is a platform that supports the creation, enrichment, management, and maintenance of large - scale KGs. It utilizes a hybrid approach, combining symbolic and data - driven techniques to automate and scale the KG development process. The platform is designed to be domain - agnostic and handle data from unstructured, semi - structured, and structured sources. The platform addresses key challenges in KG development:

  • Scalability: It automates knowledge extraction and enrichment, reducing the manual labor and time required to build a KG from the ground up.
  • Explainability and Trust: By integrating with the Common Metadata Framework (CMF), EMPWR captures detailed metadata, provenance, and lineage for the entire workflow. This allows researchers to trace information back to its source, ensuring reproducibility and building trust in the KG's outputs.
  • Performance: EMPWR can be integrated with the Intelligent Data Store (IDS), a high - performance, in - memory triple datastore that enables rapid querying and analysis of massive graphs.

EMPWR's architecture is modular and built around the standard KG lifecycle, which includes design, data ingestion, enrichment, storage, consumption, and maintenance. Its core components include:

  • A Front End: A user interface for data uploading and querying the KG.
  • Knowledge Extraction: A module equipped with a suite of NLP toolkits and language models for knowledge extraction.
  • Knowledge Enrichment: A component that augments the extracted knowledge by linking it to external, authoritative sources like DBpedia, DrugBank, and the Unified Medical Language System (UMLS).

Extending EMPWR for Multimodal Knowledge Graphs (MMKGs)

Drug discovery relies on more than just text. It involves analyzing 2D/3D molecular structures, protein models, medical imaging, and time-series data from experiments. A Multimodal Knowledge Graph (MMKG) extends the traditional KG by incorporating these diverse data modalities, linking an entity not just to textual facts but also to relevant images, videos, or other non-textual data. The EMPWR platform's modular design allows it to be extended to build and manage MMKGs by incorporating modality-specific processing models into its Knowledge Extraction and Enrichment modules. For instance, computer vision models like CLIP can be used to process and understand images of chemical structures, linking them to their corresponding textual descriptions within the graph.

Use Case: Building a Pharma MMKG (Percuro-built Pharma-KG) for Drug Discovery

This section illustrates how EMPWR can be used to construct the MMKG to accelerate drug discovery, for example, by identifying new targets or repurposing existing drugs for Alzheimer's disease.

Step 1: Participatory Design & Data Ingestion

The process begins by identifying and ingesting data from multiple siloed sources:

  • Structured/Semi-structured Data:
    • DrugBank: Information on existing drugs, their targets, and mechanisms of action.
    • UMLS: A metathesaurus of medical vocabularies and terminologies.
    • Hetionet: An integrative network of biomedical knowledge.
  • Unstructured Text:
    • PubMed: Scientific literature and research articles detailing experimental findings.
  • Multimodal Data:
    • PubChem: A database of chemical molecules with 2D structures.
    • Protein Data Bank (PDB): A repository of 3D protein structures.

Step 2: Knowledge Extraction Once ingested, EMPWR’s Knowledge Extraction module processes the data.

  • From Text: EMPWR extracts triples from PubMed articles. For example, from a paper abstract, it might extract: (GSK - 3β, phosphorylates, Tau protein)

(source: https://pmc.ncbi.nlm.nih.gov/articles/PMC8063930/).

  • From Images (MMKG Extension): The framework uses vision - language models to process images. For example, it identifies the 2D image of a molecule from PubChem as Donepezil and extracts its structural features

(source: https://www.drugs.com/donepezil.html).

Step 3: Knowledge Enrichment The extracted entities and relations are then enriched by linking them to external knowledge stores.

  • Example: The entity Tau protein is linked to its UMLS concept ID for standardization. The extracted triple (GSK - 3β, phosphorylates, Tau protein) is enriched with information from DBpedia and other sources, confirming the biological context. The Donepezil entity is linked to its DrugBank entry to add information about its use in treating Alzheimer's.
  • Multimodal Linkage: The 2D structure image of Donepezil is now explicitly linked in the graph to the Donepezil entity node, which is also linked to its target protein, Acetylcholinesterase.

Step 4: Schema Generation, Storage, and Refinement EMPWR automatically infers a schema from the extracted triples, which can be validated by a user. Example Schema: (Gene, interactsWith, Protein), (Drug, hasChemicalStructure, Image), (Protein, has3DModel, 3D_Model).

The resulting MMKG, containing millions of interconnected nodes and relationships, is stored in a high-performance graph database like the IDS for scalable querying. The entire process, from the models used for extraction to the sources used for enrichment, is logged by the Common Metadata Framework (CMF) for full provenance and traceability.

Step 5: Consumption & Maintenance (Querying the Percuro - built Pharma - MMKG) Here is how the Percuro - MMKG (with the capabilities of EMPWR to continuously evolve and ingest new data sources and knowledge) can be used to support the following discoveries:

Drug Repurposing for COVID-19 with Baricitinib The illustration shows how EMPWR could have been used to identify the rheumatoid arthritis drug, Baricitinib, as a treatment for COVID-19 - 19 (source: https://pmc.ncbi.nlm.nih.gov/articles/PMC8250677/).

  • Data Ingestion:
    • Unstructured Text: A corpus of the latest COVID - 19 research papers from PubMed mentioning symptoms and biological pathways.
    • Structured Data: Datasets from public sources like DrugBank, WikiData, and Hetionet are integrated into the platform.
  • Knowledge Extraction & Enrichment: From a sentence like, "Severe COVID - 19 is associated with a cytokine storm, a proce* ss involving the Janus kinase (JAK) signaling pathway", the system extracts the relation (Cytokine Storm, involves, JAK Pathway).
    • The entity "JAK Pathway" is then enriched and linked to specific proteins in that pathway, such as JAK1 and JAK2.
  • Querying the Graph with Actual Triples: The system now searches the integrated knowledge graph to find approved drug*s that interact with the newly implicated "JAK Pathway". The query traverses the following actual triples from established databases:

Subfigure 1

  • From Hetionet (a public biomedical knowledge graph):
    • Compound::DB09079 - [inhibits] - > Gene::3716
    • Translation: The compound with DrugBank ID DB09079 (Baricitinib) inhibits the gene with Entrez ID 3716 (JAK1).
  • From WikiData (a public, collaborative knowledge base):
    • (wd:Q27276182) - - [wdt:P129] - - > (wd:Q13550805)
    • Translation: The WikiData item for Baricitinib (wd:Q27276182) has the property "has target" (wdt:P129) pointing to the item for Janus Kinase 1 (wd:Q13550805).
  • From DrugBank (a comprehensive drug database):
    • (drugbank:DB09079) - - [hasTarget] - - > (drugbank_target:BE0005089)
    • Translation: The drug Baricitinib (drugbank:DB09079) has the target with ID BE0005089, which corresponds to the Janus Kinase 1 protein.

Outcome: By connecting the relationship extracted from the literature (Cytokine Storm, involving JAK Pathway) with the validated triples from three different sources, EMPWR generates an evidence-based hypothesis. It proposes Baricitinib as a strong candidate for repurposing against COVID-19 because it directly inhibits a key protein in the inflammatory process. This exact connection was later validated, and Baricitinib received FDA approval for treating COVID-19. (source: https://pmc.ncbi.nlm.nih.gov/articles/PMC8308612/)

Use Case: Linking Traditional Medicine to Modern Targets with Turmeric and Enabling Precision Medicine Initiatives

This illustration demonstrates how EMPWR's multimodal capabilities could validate an Ayurvedic remedy by linking it to a known molecular target for inflammation (source: https://www.frontiersin.org/journals/immunology/articles/10.3389/fimmu.2023.1233652/full).

1. Data Ingestion to build a MMKG

  • Unstructured Text: Digitized Ayurvedic texts describing the use of Turmeric for inflammatory conditions, along with modern research papers on inflammation.
  • Structured Data: Chemical data from PubChem and biological pathway information from WikiData.
  • Multimodal Data: Images of the Turmeric plant and 2D/3D structures of its active compounds.

2. Knowledge Extraction & Enrichment: EMPWR processes the varied sources.

  • From Ayurvedic texts, it extracts (Turmeric, treats, Inflammation).
  • From scientific papers, it identifies the key active compound in Turmeric as
Curcumin links the general concept of "inflammation" to the specific biological pathway NF-κB (source: https://pmc.ncbi.nlm.nih.gov/articles/PMC7522354/)

3. Querying the Graph with Actual Triples: The MMKG is queried to see if Curcumin directly interacts with the NF-κB pathway. The query finds these actual triples:

Subfigure 1

  • From PubChem (a database of chemical molecules):
    • (pubchem:969516) - - [interactsWith] - - > (chebi:"NF - kappaB pathway")
    • Translation: The compound with PubChem ID 969516 (Curcumin) has the role of an NF - κB inhibitor (as defined by the ChEBI ontology).
  • From WikiData:
    • (wd:Q42226) - - [wdt:P129] - - > (wd:Q695333)
    • Translation: The WikiData item for Curcumin (wd:Q42226) "has target" (wdt:P129) pointing to the item for NF - κB (wd:Q695333).
  • The MMKG would also contain this multimodal triple:
    • (compound:Curcumin) - - [hasImage] - - >(Image:Curcumin)

Outcome: The EMPWR - built Percuro - MMKG bridges the gap between traditional and modern medicine. It validates the traditional use of Turmeric with a link to the active compound Curcumin, which is a known inhibitor of NF-κB, a key pathway in inflammation. The provenance of this finding is clear, tracing from Ayurvedic texts to chemical databases and modern biological ontologies, providing a strong foundation for further drug discovery.

By leveraging the EMPWR platform to build and maintain MMKGs, pharmaceutical organizations can break down critical data silos. The combination of EMPWR's automated lifecycle management, the CMF's provenance tracking, and the IDS's scalable performance creates a powerful, trustworthy, and transparent environment for data-driven research. This neurosymbolic approach enables the integration of diverse data types: from text to images and facilitates complex reasoning, ultimately accelerating the path to discovering novel therapies and medicines.

Toolkit

The Knowledge Graph Toolkit (EMPWR) V1.0 currently supports knowledge sources from:

  • PharmKG (base)
  • Open-domain: DBpedia & Wikidata
  • Biomedical & pharmaceutical domain: Drugbank & UMLS

and has an extensive coverage over the domains of Drugs, Chemicals, Diseases entities and their associated relations:

  • Chemical: Drug interactions, diseases cured, etc.
  • Physical: Lethal dosage, boiling point, pressure, solubility, etc
  • Disease: Symptoms, treatments, differential diagnosis, etc.
  • Aliases: Common names, chemical names and external identifiers.

GitHub

Demo

[NEW]

Tutorial

Paper

People

Prior Work on Knowledge Graphs