Difference between revisions of "CyberInfrastructure Proposal For EarthCube Community"

From Knoesis wiki
Jump to: navigation, search
 
(13 intermediate revisions by 2 users not shown)
Line 1: Line 1:
This proposal presents a cyberInfrastructure for long tails scientists to share and discover their data.  
+
This proposal presents a cyberInfrastructure for sharing and discovery of long tail of science data.  
  
 
== Objective ==
 
== Objective ==
  
The proposed system will facilitates scientists to upload their data and make it standardized using well known vocabularies. System will annotate scientist's data with terms from community developed vocabularies such as GCMD and AGI. They will also be able to refine the automatically suggested annotation based on their preference. Once published the data, users can use keywords and topic terms to discover and locate the data. Faceted search will also be provided based on the collected annotations to search using attribute value pairs. System will provide flexible search and tools for harmonizing structural, content and semantic heterogeneity.
+
The proposed system will facilitate scientists to upload their data and annotate them using community developed vocabularies such as GCMD and AGI. It will also enable refining or adapting automatically suggested annotations. Once published, the data can be discovered using keywords, topic terms and attribute value pairs in a faceted search. The system will provide tools for flexible search and for harmonizing structural, content and semantic heterogeneity.
  
We identified three work flows to be addressed when publishing long tail scientists data. Depending on the need, expertise and availability of the data, scientists will be able to choose one or more work flows.  
+
We have identified three different workflows to cater to the different requirements for publishing long tail of science data. To accommodate variations in the nature of the data, and the expertise of a user, the three workflows present different trade-offs between convenience of data sharing and breadth of data analysis to be performed.  
  
 
=== Publishing Data in Original Form/Legacy Data ===
 
=== Publishing Data in Original Form/Legacy Data ===
  
This will allow long tail scientists to publish data in original form which can be in unstructured data, technical papers, images and etc. This will provide tools to read the data in original form, annotate the data and make those searchable. At this step captions(image,table and etc) will be considered for annotation. Tables and images can be searched via captions using keywords and topic terms.
+
This workflow will allow long tail of science data to be published in its original form which can be legacy data, or unstructured text, images and tables as found in technical papers, or available separately. We will provide tools to read the data in its original form, annotate the data and index it to make it searchable. Initially, the focus will be on processing and indexing only captions for images and tables.
  
  
 
=== Publishing Data in Digital Format ===
 
=== Publishing Data in Digital Format ===
  
This workflow will serve the need for long tail scientists who have their data in a digital format. Essentially this will include data in EXCEL, CSV and relational data format. At this stage system will have the capability to deal with more structured data and hence system will be able to provide better suggestions in terms of annotations and furthermore data publishers can even consider annotating columns and cell values in their data to provide more insights on their data.   
+
This workflow will serve the long tail of science data that is in a digital format such as EXCEL, CSV and relational data files. We will provide additional capabilities to the data publishers by suggesting annotations for structured data such as by annotating columns and cell values in their data.   
  
  
 
=== Publishing Data in Linked Data ===  
 
=== Publishing Data in Linked Data ===  
  
Linked Data initiative http://linkeddata.org/ emerged from Semantic Web technologies in the recent past making its own way in the web by providing a publishing and querying paradigm for structured raw data. Arrival of LOD to semantic web changed the way we share data in the web; primarily on how to interconnect data sets together. Currently it consists of 295 data sets with 31 billion RDF triples and it covers a broad range of domains such as Life Sciences, Geography, Government, Media, Education, Publication and so on.
+
Linked Data initiative http://linkeddata.org/ emerged from Semantic Web technologies and provides a new paradigm for publishing and querying structured data on the Web. LOD has changed the way data is shared, linked and reused on the Web. Currently it consists of 295 data sets with 31 billion RDF triples and covers a broad range of domains such as Life Sciences, Geography, Government, Media, Education, Publication and so on.
 
+
At this stage long tail scientists have the capability to convert their data in to the RDF format and publish their data in Linked Open Data. This will be able to standardize the data itself and furthermore this will allow scientists to interlink their data with other data sets exist in linked open data. This makes their data available for more advanced intelligent applications such as federated querying. Even though there are existing tools to convert data in to RDF and publish data, still this requires scientists to get some help from computer scientists. Our system will provide the relevant sophisticated tools for them to use. 
+
  
 +
The scientists will have the capabilities to convert their data into the RDF format and publish it as Linked Open Data. This will enable standardization and rich interlinking of existing datasets. This makes their data available for more advanced intelligent applications such as using integrated or federated querying. Even though there are existing tools to convert data into RDF and publish it, these are still not easy to use for domain experts.
  
 
== Architecture ==
 
== Architecture ==
 
The following image illustrates the architecture of the proposed system.  
 
The following image illustrates the architecture of the proposed system.  
[[File:Architecture3.png]]
+
[[File:Archi7.2.png]]
 
+
  
  
 
* '''Data Registry'''  
 
* '''Data Registry'''  
  
Data publishers will register their data through the data registry and provenance information such as author, location and etc will also be collected. Sample of the form of data can be registered is given at a later section.
+
Data publishers will register their data through the data registry, providing  provenance information such as author, affiliation, and location. A sample of the form of data can be registered is given later.
  
 
* '''Annotator'''
 
* '''Annotator'''
  
Registered data will be annotated using standard vocabularies such as (GCMD and AGI index) which is stored in a vocabulary registry. Annotation tools will suggest the possible matches for the user and user will have the ability to further refine the suggestions given by the system. Annotations will be stored in the Meta Data Store.  
+
Registered data will be annotated using standard vocabularies such as GCMD and AGI index which is stored in a vocabulary registry. Annotation tools will suggest annotations to the user and enable them to modify it if desired. Annotations will be stored in the Meta Data Store.  
  
Kino http://wiki.knoesis.org/index.php/Kino is an integrated suite of tools that enables scientists to annotate Web documents and we plan extend this to facilitate annotation for this proposal.
+
Kino http://wiki.knoesis.org/index.php/Kino is an integrated suite of tools that enables scientists to annotate Web documents. we plan to extend and customize this for Geo-sciences domain.
  
 
* '''Indexer'''
 
* '''Indexer'''
  
Collected data and its associated meta data will be indexed to facilitate Searching.
+
Collected data and its associated metadata will be indexed for search.
  
 
* '''Simple Search'''
 
* '''Simple Search'''
  
Simple Search facilitates key word based queries where user can specify some key words and system will provide a ranked list of results.
+
Simple Search facilitates keyword based queries to which the system will respond with a ranked list of results.
  
 
* '''Faceted Search'''
 
* '''Faceted Search'''
  
In addition to the Simple Search functionality system will provide the Faceted Search where users can provide the attribute value pairs to search/discover data. Users have the ability to incoperate provenance information for search as well.
+
In addition to the Simple Search functionality, the system will provide Faceted Search capability where users can provide attribute value pairs to enable more expressive search and discover. Users can also query using provenance information.
  
 
* '''Mapping to RDF'''
 
* '''Mapping to RDF'''
  
As defined in the third work flow given data can be transformed to RDF using existing tools and this allows data publishers to convert the data in to a standard form.
+
As defined in the third workflow, data publishers can transform their data into RDF using existing tools and convert into standard form.
  
 
* '''Data Publisher'''
 
* '''Data Publisher'''
  
This component will upload the RDF converted data into Linked Open Data and it will be accessed and queried from any where in the world.
+
This component will upload the RDF converted data into Linked Open Data. The LOD can be accessed and queried globally.
  
 
* '''Semantic Browsing'''
 
* '''Semantic Browsing'''
  
Semantic Browsing will allow us to navigate through the RDF data sets which is based on the triples. iExplore http://knoesis.wright.edu/iExplore/ is a tool we developed for Semantic Browsing.  
+
Semantic Browsing will allow navigation through the RDF datasets. iExplore http://knoesis.wright.edu/iExplore/ is a tool we have developed for Semantic Browsing.  
  
 
== Form of Data ==
 
== Form of Data ==

Latest revision as of 16:21, 21 August 2012

This proposal presents a cyberInfrastructure for sharing and discovery of long tail of science data.

Objective

The proposed system will facilitate scientists to upload their data and annotate them using community developed vocabularies such as GCMD and AGI. It will also enable refining or adapting automatically suggested annotations. Once published, the data can be discovered using keywords, topic terms and attribute value pairs in a faceted search. The system will provide tools for flexible search and for harmonizing structural, content and semantic heterogeneity.

We have identified three different workflows to cater to the different requirements for publishing long tail of science data. To accommodate variations in the nature of the data, and the expertise of a user, the three workflows present different trade-offs between convenience of data sharing and breadth of data analysis to be performed.

Publishing Data in Original Form/Legacy Data

This workflow will allow long tail of science data to be published in its original form which can be legacy data, or unstructured text, images and tables as found in technical papers, or available separately. We will provide tools to read the data in its original form, annotate the data and index it to make it searchable. Initially, the focus will be on processing and indexing only captions for images and tables.


Publishing Data in Digital Format

This workflow will serve the long tail of science data that is in a digital format such as EXCEL, CSV and relational data files. We will provide additional capabilities to the data publishers by suggesting annotations for structured data such as by annotating columns and cell values in their data.


Publishing Data in Linked Data

Linked Data initiative http://linkeddata.org/ emerged from Semantic Web technologies and provides a new paradigm for publishing and querying structured data on the Web. LOD has changed the way data is shared, linked and reused on the Web. Currently it consists of 295 data sets with 31 billion RDF triples and covers a broad range of domains such as Life Sciences, Geography, Government, Media, Education, Publication and so on.

The scientists will have the capabilities to convert their data into the RDF format and publish it as Linked Open Data. This will enable standardization and rich interlinking of existing datasets. This makes their data available for more advanced intelligent applications such as using integrated or federated querying. Even though there are existing tools to convert data into RDF and publish it, these are still not easy to use for domain experts.

Architecture

The following image illustrates the architecture of the proposed system. Archi7.2.png


  • Data Registry

Data publishers will register their data through the data registry, providing provenance information such as author, affiliation, and location. A sample of the form of data can be registered is given later.

  • Annotator

Registered data will be annotated using standard vocabularies such as GCMD and AGI index which is stored in a vocabulary registry. Annotation tools will suggest annotations to the user and enable them to modify it if desired. Annotations will be stored in the Meta Data Store.

Kino http://wiki.knoesis.org/index.php/Kino is an integrated suite of tools that enables scientists to annotate Web documents. we plan to extend and customize this for Geo-sciences domain.

  • Indexer

Collected data and its associated metadata will be indexed for search.

  • Simple Search

Simple Search facilitates keyword based queries to which the system will respond with a ranked list of results.

  • Faceted Search

In addition to the Simple Search functionality, the system will provide Faceted Search capability where users can provide attribute value pairs to enable more expressive search and discover. Users can also query using provenance information.

  • Mapping to RDF

As defined in the third workflow, data publishers can transform their data into RDF using existing tools and convert into standard form.

  • Data Publisher

This component will upload the RDF converted data into Linked Open Data. The LOD can be accessed and queried globally.

  • Semantic Browsing

Semantic Browsing will allow navigation through the RDF datasets. iExplore http://knoesis.wright.edu/iExplore/ is a tool we have developed for Semantic Browsing.

Form of Data

Please click on the images to enlarge.

Table

Geographic-impacts-table.png

Image

Gis relief 600.jpg

Unstructured Data

Unstructuredtostructured.jpg