SemBigData

From Knoesis wiki
Revision as of 16:20, 1 August 2014 by Alfarom (Talk | contribs) (Created page with "===Characteristics of the Big Data Problem=== We discuss the primary characteristics of the Big Data problem as it pertains to the Five Vs. (The first three were originally intr...")

(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to: navigation, search

Characteristics of the Big Data Problem

We discuss the primary characteristics of the Big Data problem as it pertains to the Five Vs. (The first three were originally introduced by Doug Laney of Gartner.)

Abstract

We discuss the nature of Big Data and address the role of semantics in analyzing and processing Big Data that arises in the context of Physical-Cyber-Social Systems. We organize our research around the Five Vs of Big Data, where four of the Vs are harnessed to produce the fifth V - value. To handle the challenge of Volume, we advocate semantic perception that can convert low-level observational data to higher-level abstractions more suitable for decision-making. To handle the challenge of Variety, we resort to the use of semantic models and annotations of data so that much of the intelligent processing can be done at a level independent of heterogeneity of data formats and media. To handle the challenge of Velocity, we seek to use continuous semantics capability to dynamically create event or situation specific models and recognize relevant new concepts, entities and facts. To handle Veracity, we explore the formalization of trust models and approaches to glean trustworthiness. The above four Vs of Big Data are harnessed by the semantics-empowered analytics to derive Value for supporting practical applications transcending physical-cyber-social continuum.

Introduction

Physical-Cyber-Social Systems (PCSS) (Sheth et al, 2013) are a revolution in sensing, computing and communication that brings together a variety of resources. The resources can range from networked embedded computers and mobile devices to multimodal data sources such as sensors and social media. The applications can span multiple domains such as medical, geographical, environmental, traffic, behavioral, disaster response, and system health monitoring. The modeling and computing challenges arising in the context of PCSS can be organized around the Five Vs of Big Data (volume, variety, velocity, veracity and value), which align well with our research efforts that exploit semantics, network and statistics-empowered Web 3.0

Volume

The sheer number of sensors and the amount of data reported by sensors is enormous and growing rapidly. For example, 25+ billion sensors have been deployed and about 250TB of sensor data are generated for a NY-LA flight on Boeing 7371. Parkinson’s disease dataset2 that tracked 16 people (9 patients + 7 control) with mobile phone containing 7 sensors over 8 weeks is 12 GB in size. However, availability of fine-grained raw data is not sufficient unless we can analyze, summarize or abstract them in actionable ways. For example, from a pilot’s perspective, the sensors data processing should yield insights about whether the jet engine and the flight control surfaces are behaving normally or is there cause for concern? Similarly, we should be able to measure the symptoms of Parkinson’s disease using sensors on a smartphone, monitor its progression, and synthesize actionable suggestions to improve the quality of life of the patient? Cloud computing infrastructure can be deployed for raw processing of massive social and sensor data. However, we still need to investigate how to effectively translate large amounts of machine-sensed data into a few human comprehensible nuggets of information necessary for decision-making. Furthermore, privacy and locality considerations require moving computations closer to the data source, leading to powerful applications on resource-constrained devices. In the latter situation, even though the amount of data is not large by normal standards, the resource constraints negate the use of conventional data formats and algorithms, and instead necessitate the development of novel encoding, indexing, and reasoning techniques (Henson et al, 2012a). The volume of data challenges our ability to process them. First, it is difficult to abstract fine-grained machine-accessible data into coarse-grained human comprehensible form that summarizes the situation and is actionable. Second, it is difficult to scale computations to take advantage of distributed processing infrastructure and, where appropriate, exploit reasoning on mobile devices.

Variety

PCSS generate and process a variety of multimodal data using heterogeneous background knowledge to interpret the data. For example, traffic data (such as from 511.org) contains numeric information about vehicular traffic on roads (e.g., speed, volume, and travel times), as well as textual information about active events (e.g., accidents, vehicle breakdowns) and scheduled events (e.g., sporting events, music events) (Anantharam et al, 2013). Weather datasets (such as from Mesowest) provide numeric information about primitive phenomena (e.g., temperature, precipitation, wind speed) that are required to be combined and abstracted into human comprehensible weather features in textual form. In medical domains (e.g., cardiology, asthma, and Parkinson’s disease), various physiological, physical and chemical measurements (obtained through on-body sensors, blood tests, and environmental sensors) and patients’ feedback and self-appraisal (obtained by interviewing them) can be combined and abstracted to determine their health and well-being. The available knowledge captures both qualitative and quantitative aspects. Such diverse knowledge when integrated can provide complementary and corroborative information (Sheth and Thirunarayan, 2012). Geoscience datasets, and materials and process specifications used for realizing Integrated Computational Materials Engineering1 (ICME) and Materials Genome Initiative2 (MGI), exhibit lot of syntactic and semantic variety3 (Thirunarayan et al, 2005). The variety in data formats and the nature of available knowledge challenges our ability to integrate and interoperate with heterogeneous data.