Biomedical Sciences

Provenance in Biomedical Sciences

Research in life sciences, particularly those involving infectious agents like T. cruzi and related kinetoplastids, continues to experience an enormous increase in data and related information with the sequencing of genomes, use of expression profiling, and the completion of proteomic analyses. There are tremendous amount of data exist and are still being generated in parasite research through use of industrial scale experiment protocols, easy access to distributed data resources, and computational tools. However, to validate and verify the data not only raw experimental data, but also supporting data (or metadata), which is also called provenance, is required. For example, to verify microarray data, information about nucleotide sequence present on the array, samples, sample treatments, etc. is also very important. Provenance information enables validation of data quality, verification of experimental procedures and other parameters that generated those data, and computation of trust values associated with these data. Therefore, provenance information is as important as raw data in biomedical domain.

Need for Parasite Experiment Ontology

Trypanosoma cruzi is a protozoan parasite that causes Chagas disease or American trypanosomiasis, which is the leading cause of death in Latin America. Parasite researchers work significantly to identify gene knockout and vaccination targets for controlling T. cruzi and related parasite infection. For this purpose, they may have to use various external data sources, such as PubMed, UniProtDB, TriTryDB, etc. along with their internal lab data to formulate queries. However, the internal lab data as well as the external databases are in heterogeneous format and use different methods for data generation and curation that makes query processing very difficult. Since existing approaches that use tedious manual techniques to integrate these datasets are not effective, we proposed to use Semantic Web technologies to create an ontology-driven integrated environment to facilitate identification of vaccine, diagnostic, and chemotherapeutic targets in the human pathogen, T. cruzi. For this purpose, a domain and application specific Parasite Experiment Ontology (PEO) was developed that captures not only experiment details, but also provenance information that can later be queried. Provenir was used as an upper level ontology to develop PEO.

Development of PEO using Provenir

The Parasite Experiment Ontology (PEO) was developed to model the provenance information of Gene Knockout (GKO), Strain Project (SP), microarray, and proteome experiment data. It was also used as schema to formulate or run a query. All the classes and properties were added to PEO while ensuring that new construct does not contradict constructs in the provenir ontology. The PEO is modeled using the OWL-DL language and contains 118 classes and 27 properties (23 object and 4 datatype properties) with a description logic (DL) expressivity of ALCHQ(D).

Below we describe how Provenir was extended to develop PEO using GKO and SP experiment protocols as examples:

Modeling of Experimental Processes and Protocols: Two classes namely, gene_knockout_process and strain_creation_process, are created as subclass of provenir:process3 class, to model the generic gene knockout and strain creation processes. The knockout_project_protocol and strain_creation_protocol classes represent the particular protocols used in the lab. The GKO and SP protocols consist of multiple sub-processes, which are also modeled in PEO, for example sequence_extraction, plasmid_construction, transfection, drug_selection, and cell_cloning (Figure 1).
Modeling of Datasets and Parameters Used in Protocols: A novel feature of the provenir ontology is the distinct modeling of provenir:data_collection (representing entities that undergo processing in an experiment) and provenir:parameter (representing entities that influence the behavior of a process or agent). In the PEO, for example process, transfection, the input value Tcruzi_sample is modeled as a subclass of provenir:data_collection class and the parameter value transfection_buffer is modeled as a sub-class of the provenir:parameter class. Further, the parameter values are also categorized along the space, time, and theme (domain-specific) dimensions, for example the date on which an experiment is conducted is modeled using the Time:DateTimeDescription class (Figure 1).
Modeling of Experiment Materials and Other Details: PEO uses provenir:agent class to model researchers, instruments, and biological agents involved in an experiment. For example, transfection_machine, microarray_plate_reader are instruments modeled as subclass of provenir:agent; researcher is an example of human agent; and knockout_plasmid is an example of a biological agent.

In addition to the eleven relationships defined in provenir ontology, new object and datatype properties specific to experiment protocols have been created. For example, four new object properties are defined to model the similarity relationships between two genomic regions, namely is_paralogous_to, is_orthologous_to, is_homologous_to, and is_identical_to.

Application

The PEO forms schema to query semantically integrated parasite experimental datasets and databases using intuitive query processing tool, Cuebee. Following are the example provenance queries that can be formulated and successfully executed using PEO and Cuebee:

List all groups that are using“target_region_plasmid_Tc00.1047053504033.170_1”.
Find the name of the researcher who created the knockout plasmid “plasmid66.”
“cloned_sample66” is not episomal. How many transfection attempts are associated with this sample?
Which gene was used create the cloned sample “cloned_sample66"?

Acknowledgement

This work is funded by NIH RO1 Grant# 1R01HL087795-01A1.

Reference

S.S. Sahoo, D.B. Weatherly, R. Mutharaju, P. Anantharam, A. Sheth, R.L. Tarleton, “Ontology-driven Provenance Management in eScience: an Application in Parasite Research”, The 8th International Conference on Ontologies, DataBases, and Applications of Semantics, (ODBASE 2009), Vilamoura, Algarve-Portugal, Nov 02 - 04, 2009.

Biomedical Sciences

Navigation menu

Views

Personal tools

Navigation

Homepage

Search

Tools