CyberInfrastructure Proposal For EarthCube Community

From Knoesis wiki
Revision as of 22:32, 30 July 2012 by Kevin (Talk | contribs)

Jump to: navigation, search

This proposal proposes a cyberInfrastructure for long tails scientists to share and discover their data.

Objective

The proposed system will facilitates scientists to upload their data and make it standardized using well known vocabularies. System will annotate scientist's data with terms from community developed vocabularies such as GCMD and AGI. They will be able to further refine the automatically suggested annotation based on their preference. Once published the data users can use keywords and topic terms to discover and locate the data. Faceted search will also be provided based on the collected annotations. System will provide flexible search and tools for harmonizing structural, content and semantic heterogeneity.

We identified three work flows to be addressed when publishing long tail scientists data. Depending on the need, expertise and availability of the data scientists will be able to choose one or more work flows.

Publishing Data in Original Form/Legacy Data

This will allow long tail scientists to publish data in original form which can be in unstructured data, technical papers, images and etc. This will provide tools to read the data in original form, annotate the data and make those searchable. At this step captions(image,table and ) will be considered for annotation. Tables and images can be searched via captions using keywords and topic terms that can be standardized for semantics.


Publishing Data in Digital Format

This workflow will serve the need for long tail scientists who have their data in a digital format. Essentially this will include data in EXCEL, CSV and relational data format. At this stage system will have the capability to deal with more structured data and hence system will be able to provide better suggestions in terms of annotations and furthermore data publishers can even consider annotating columns and cell values in their data.


Publishing Data in Linked Data

Linked Data initiative http://linkeddata.org/ emerged from Semantic Web technologies in the recent past making its own way in the web providing a publishing and querying paradigm for structured raw data. Arrival of LOD to semantic web changed the way we share data in the web; primarily on how to interconnect data sets together. This facilitates the growing desire for direct access to raw data which is different from the way we used to explore the web with only documents. Currently it consists of 295 data sets with 31 billion RDF triples and it covers a broad range of domains such as Life Sciences, Geography, Government, Media, Education, Publication and so on.

At this stage long tail scientists have the capability to convert their data in to the RDF format and publish their data in Linked Open Data. This will be able to standardize the data itself furthermore this will allow scientists to interlink their data with other data sets exist in linked open data. This makes available their data for more advanced intelligent applications like federated querying over number of data sets. Even though there are existing tools for convert data in to RDF and publish data still this requires scientists to get some help from computer scientists. Our system will provide the relevant sophisticated tools in that case.


Architecture

The following image illustrates the architecture of the proposed system. Architecture3.png


  • Data Registry

Data publishers will register their data through the data registry and important provenance information such as author, location and etc will be collected. Sample of the form of data can be registered is given at a later section.

  • Annotator

Registered data will be annotated using standard vocabularies such as (GCMD and AGI index) which is stored in a vocabulary registry. Annotation tools will suggest the possible matches for the user and user will have the ability to further refine the suggestions given by the system. Annotations will be stored in the Meta Data Store.

  • Indexer

Collected data and its associated meta data will be indexed to facilitate Searching.

  • Simple Search

Simple Search facilitates key word based queries where user can specify some key words and system will provide a ranked list of results.

  • Faceted Search

In addition to the Simple Search functionality system will provide the Faceted Search where user can provide the key value pairs to search/discover data. Users have the ability to incoperate provenance for search.

  • Mapping to RDF

As defined in the more advanced work flow given data can be transformed to RDF using existing tools and this allows data publishers to publish the data in a standard form.

  • Data Publisher

This component will upload the RDF converted data into Linked Open Data and it will be accessed and queried from any where in the world.

  • Semantic Browsing

Semantic Browsing will allow us to navigate through the RDF data sets which is based on the triples.

Form of Data

Table

Geographic-impacts-table.png

Image

Gis relief 600.jpg

Unstructured Data

Unstructuredtostructured.jpg

Links

Annotator - Kino http://wiki.knoesis.org/index.php/Kino

Semantic Browsing - iExplore http://knoesis.wright.edu/iExplore/