Difference between revisions of "Social Signals"

From Knoesis wiki
Jump to: navigation, search
(Thematic Analysis and Casual Text)
Line 41: Line 41:
 
Semantic annotation of content refers to the process of making data more meaningful through labels (via marking up, tagging, or annotating) that conform to an agreed-upon reference model, be it a common nomenclature, dictionary, taxonomy, folksonomy, or ontology that models a specific domain. Annotations with these vocabularies make Web-based documents and data understandable to machines as well as easier to integrate and analyze. When applications use ontology rules, whether they range from simple to complex or are explicitly stated or inferred from the ontology’s class properties and relationships, such applications can realize powerful reasoning over annotated data.
 
Semantic annotation of content refers to the process of making data more meaningful through labels (via marking up, tagging, or annotating) that conform to an agreed-upon reference model, be it a common nomenclature, dictionary, taxonomy, folksonomy, or ontology that models a specific domain. Annotations with these vocabularies make Web-based documents and data understandable to machines as well as easier to integrate and analyze. When applications use ontology rules, whether they range from simple to complex or are explicitly stated or inferred from the ontology’s class properties and relationships, such applications can realize powerful reasoning over annotated data.
 
      
 
      
User-generated content (UGC) and other observations from citizen sensor networks have unique characteristics that set them apart from the traditional content found in news or scientific articles. Coupled with the issues associated with social media content mentioned earlier (such as textual informality), the task of annotation becomes even more challenging when entities named with English language-words (Stephen King’s novel It, Madonna’s album Music, or Why, Arizona, one of the state’s smaller cities) must be identified within informal text. This is an important challenge that Web 3.0 applications will consistently face — the process of automatically creating accurate markups or annotations from UGC to common referenced models.The key to semantically annotating content is the process of identifying and disambiguating named entities. In short, semantic annotation transforms unstructured data into a structured representation that lets applications search, analyze, and aggregate information. When looking for information about General Motors, for example, semantically annotated content can return analyses on all its variations, such as GM, GenlMotors, and so on. Clearly, the roles of ontologies and knowledge bases in creating markups will be even more important than they were before the social Web’s explosive growth — not only can they act as common reference models, but they’ll also play a crucial role in inferring semantics behind UGC while supplementing well-known statistical and natural language processing (NLP)techniques. Consider this tweet from the Mumbai terror attacks: “mumbai taj 4th floor left wing fire, live on desitv.” Although natural language understanding is hard in itself, the noncapitalization of key entities such as Mumbai and the Taj Hotel makes for inaccurate natural language parse structures (compare Fig-ures 3a and 3b, generated using the Berkeley Natural Language Parser at http://nlp.cs.berkeley.edu/). In such scenarios, knowing from a domain model that the Taj Hotel is a landmark in the city of Mumbai can offer meaningful support to the statistical strength of a corpus’s entities.
+
User-generated content (UGC) and other observations from citizen sensor networks have unique characteristics that set them apart from the traditional content found in news or scientific articles. Coupled with the issues associated with social media content mentioned earlier (such as textual informality), the task of annotation becomes even more challenging when entities named with English language-words (Stephen King’s novel It, Madonna’s album Music, or Why, Arizona, one of the state’s smaller cities) must be identified within informal text. This is an important challenge that Web 3.0 applications will consistently face — the process of automatically creating accurate markups or annotations from UGC to common referenced models.
 +
 
 +
The key to semantically annotating content is the process of identifying and disambiguating named entities. In short, semantic annotation transforms unstructured data into a structured representation that lets applications search, analyze, and aggregate information. When looking for information about General Motors, for example, semantically annotated content can return analyses on all its variations, such as GM, GenlMotors, and so on. Clearly, the roles of ontologies and knowledge bases in creating markups will be even more important than they were before the social Web’s explosive growth — not only can they act as common reference models, but they’ll also play a crucial role in inferring semantics behind UGC while supplementing well-known statistical and natural language processing (NLP)techniques. Consider this tweet from the Mumbai terror attacks: “mumbai taj 4th floor left wing fire, live on desitv.” Although natural language understanding is hard in itself, the noncapitalization of key entities such as Mumbai and the Taj Hotel makes for inaccurate natural language parse structures (compare Fig-ures 3a and 3b, generated using the Berkeley Natural Language Parser at http://nlp.cs.berkeley.edu/). In such scenarios, knowing from a domain model that the Taj Hotel is a landmark in the city of Mumbai can offer meaningful support to the statistical strength of a corpus’s entities.

Revision as of 18:19, 18 January 2010

A Playground for Mobile Sensors, Human Computing, and Semantic Analytics

A computer which can calculate the Question to the Ultimate Answer, a computer of such infinite and subtle complexity that organic life itself shall form part of its operational matrix. And you yourselves shall take on new forms and go down into the computer to navigate its ten-million-year program! Yes! I shall design this computer for you. And I shall name it also unto you. And it shall be called ... The Earth.” —Douglas Adams, The Hitchhiker’s Guide to the Galaxy

Douglas Adams’s stupendous vision — the Earth transformed into a supercomputer powered by human intelligence — was fictional but reflects the potential of the most recent advances in science and technology to transform our planet into a powerful computing platform. With 6 billion human inhabitants acting as processing nodes, Earth could indeed become the computer that provides the best answers to life’s most complex and difficult questions (http://icsc.eecs.uci.edu/abstract_wed1.html). Such a platform’s computing power could even exceed the exponential growth predicted by the famous Moore’s law.

It might seem like science fiction at first blush, but with the Internet serving as the communication backbone that connects us all, we could reach this point sooner than we think. When Time magazine named “you” as its person of the year in 2006, it captured the infinite possibilities brought forth by connecting humans and providing a platform to harness their collective intellect, knowledge, and experiences. As much as we can’t question the role technology has played in fostering this new era of comput ing, central to its success has been the participation of people from all walks of life. Through each of our small but significant and sustained contributions, we’ve created and maintained vast repositories such as Wikipedia. We’re also helping machines organize the world’s online resources by tagging and sharing various bits of information. New tools are extracting and using the knowledge we’ve created to improve searching, browsing,we’ve embedded into what and decision-making, substantially improving on software that didn’t previously use such a collective intelligence. In this article, I introduce the exciting paradigm of citizen sensing enabled by mobile sensors and human computing—that is, humans as citizens on the ubiquitous Web, acting as sensors and sharing their observations and views using mobile devices and Web 2.0 services.

Citizen-Sensor Networking

By contributing so much online content, many people have become “citizens” of an Internet- or Web-enabled social community; the use of Internet- or Web-enabled mobile devices to upload this data gives these sevices the ability to act as sensors. Thus, the term citizen-sensor network refers to an interconnected network of people who actively observe, report, collect, analyze, and disseminate information via text, audio, or video messages.

This combination of human-in-the-loop sensing, Web 2.0, and mobile computing has led to the emergence of several citizen-sensor networks. In particular, Web 2.0 fostered the open environment and applications or tagging, blogging, wikis, and social networking sites that have made information consumption, production, and sharing so incredibly easy. However, two significant developments in mobile computing helped enable citizen-sensor networks as we know them today: enhanced features such as GPS capability and cameras became a standard part of most mobile devices, and large companies created open mobile operating systems, such as Apple’s OS X for the iPhone and Google’s Android.

Microblogging — in which users share short messages (roughly 140 characters) and pictures, typically over the Web — is of particular interest to citizen-sensors. This relatively new medium emerged on the Web in 2006 and achieved widespread adoption extremely quickly. Twitter, the most popular microblogging application, has nearly 6 million members who post almost 2 million messages per day http://twitterfacts.blogspot. com/20 07/06/t w it ter-nu mber-of-tweets-per-day.html). Applications such as Twitterific and tweetie enable microblogging on mobile platforms, in which users can directly post photos and other digital captures of the events they observe onto the Web or social networking sites from their mobile devices.

Such applications have virtually eliminated the barriers of entry to participation and seem to have actively encouraged the emergence of citizen journalism and science. Other examples of citizen journalism include Wikinews, a growing number of sites and services such as CNN’s IReport, Demotix, and Merinews. More recently, organizations such as the Boston police department have embraced citizen-sensors to assist in crime prevention (www.cityofboston.gov/police/cristop.asp). Several citizen science projects involve participants with mobile devices capturing observations and reports for environmental data collection, bird and animal counts, and more. One of the most visible uses of citizen-sensors occurred during the Mumbai terrorist attacks last November, when tweets (Twitter updates) and Flickr feeds by citizens armed with mobile phones reported observations of events in real time, often well before traditional media reports could do so (www.informationweek.com/blog/main/archives/2008/11/twitter_in_cont.html).

The interesting twist that citizen-sensor networks bring to reporting a news story or scientific discovery is that they can record and report an event from multiple angles and perspectives. The messages that citizen-sensors send or upload come with a host of additional information, such as the spatio temporal metadata provided in the devices used to capture them (www.cnet.com.au/tag/camera-data-iphone-location.htm). Generally, an event has a time, location, and multiple thematic elements, which in turn become the basis of its semantic description. A collection of thematically (conceptually), spatially, and temporally related events define a situation; situational awareness, which represents “perception of the environmental elements with in a volume of time and space, and the comprehension of their meaning” (http://en.wikipedia.org/wiki/Situational_awareness) then leads to insight and actionable information.

   The human-in-the-loop aspect of citizen or participatory sensing offers several advantages to traditional (machine) sensing. Machines are good at symbolic processing but poor at perception, which is the act of converting sensory information into symbols or words that are meaningful to humans. Placing humans in the sensing loop greatly alleviates this deficiency: sensors or devices can perform continuous, long-term sensing of a limited or fixed set of modalities, but humans are much better at contextualizing and discriminating (deciding what’s interesting or important) data, filtering (reporting on things of interest and importance) it across multiple modalities, and capturing the resulting observations for future symbolic processing by machines or collectively with other humans. Humans are also better at using sensing and perception to adapt to subsequent activities, which in turn affect what they observe and report. What gives humans this distinct advantage is their ability to deal with semantics and leverage extensive background knowledge, experience, common sense, and complex reasoning, even with fuzzy data or inconsistent information. Although traditional sensors merely report encoded observations, humans process observations from a citizen-sensor network via their intellect and available contextual knowledge.

Situational awareness is the perception of an environment along with its temporal, spatial, and thematic components, and is critical for making decisions in complex situations. A first step in a systematic approach to situational awareness is to model sensing as a cycle of operations involving observation, perception, and communication. Figure 1 shows both citizen- and machine sensing in this general framework. Within the perception and communication operations, citizen and machine-sensors can share information that might provide enhanced situational awareness that neither sensing system could offer alone. Two recent advances are noteworthy in this context: the ability to treat sensors as services on the Web (via standards such as Sensor Web Enablement) and the emergence of mobile sensing with humans in the loop(because humans are much better at reacting to observations).

Moreover, researchers have made several computational advances in terms of the Semantic Web and its derivatives1 and in the corresponding ability to develop domain models (ontologies) and knowledge bases, semantically annotate all types of data (specifically, to extract spatial, temporal, and thematic metadata), and computationally exploit data along these three dimensions.2 As Semantic Web proponents know, annotation is the key to making data more meaningful, both for human consumption and for machine computation. Semantically annotated sensor data is more easily integrated, interpreted, and combined with databases, knowledge bases, and advanced computing capabilities. Although I’ve discussed semantic annotation of (machine) sensor data as part of this column before, let’s shift the focus here to semantic annotation of — and metadata extraction from — messages submitted in citizen-sensors. Both of these capabilities share characteristics with the semantic annotation of casual text, such as that used in social networking content.3

Semantic Annotation of Citizen-Sensor Data

The high level of citizen participation in disseminating information during last year’s terrorist attacks in Mumbai, India, demonstrated the growing power of citizen journalism. Using Flickr and Twitter, ordinary people such as Vinu Ranganathan shared their views of the events as they unfolded (http://www3.flickr.com/photos/vinu/sets/72157610144709049). Although user contributions played an invaluable role in disseminating news, we can realize significant additional value through their integration with semantic analysis, which leads to situational awareness (http://knoesis.wright.edu/library/resource.php?id=00702).

The example depicted in Figure 2 shows metadata gathered from Twitter updates and Flickr images posted during the Mumbai attacks. //We?// can use it to extract spatial information about a resource (such as geo-coordinates for where a picture was taken or from where a message was posted) to determine the closest street address. From the image information in Figure 2, for example, we can identify the closest street address as 5, Hormusji Street, Colaba, Mumbai. When given to an “address to location” service, this information yields prominent locations near this address, including the Nariman House, Vasant Vihar, and the income Tax Office. Next, by using temporal information from the image, we can get Twitter messages posted around the time it was taken; spatial information helps restrict the geography to just where these messages originated. The location information in conjunction with semantic models that describe a particular domain of interest (terrorism, in this context) let us connect tweets that describe the event to images found in Flickr. Such integration provides a richer description of the event and lets us create trails of various events.

The bursty and high-throughput nature of citizen-sensor data, the thematic differences between messages, and the text’s unmediated and casual nature pose several interesting research challenges, such as determining the trustworthiness of information sources (http://news.yahoo.com/s/ap/20090511/ap_on_re_eu/eu_ireland_wikipedia_hoaxer), creating semantic models for general-purpose domains, and integrating application-specific semantic metadata across information sources.

Thematic Analysis and Casual Text

The problem of semantically integrating citizen-sensor data is nontrivial. On one hand, the social context surrounding the production of such data offers exciting opportunities, but on the other, this same social context content’s informal nature. Off-topic discussions are common, making it difficult to automatically identify context. Moreover, the content is often fragmented, doesn’t always follow grammar rules, and relies heavily on domain- or demographic specific slang, abbreviations, and entity variations (using skik3 for SideKick 3, for example). Content from microblogging sites is rather terse by nature, so all these factors combined make the process of automatically identifying what a message is actually about that much harder.

We can define the semantic metadata extracted from citizen-sensor content as thematic information that which tells us more about the topic or theme underlying the content. In addition to the metadata encoded in citizen-sensor messages, we can extract semantic metadata from the messages themselves. In light of various reported events, integrating potentially multimodal data from different citizen-sensor sources using spatiotemporal and thematic information can significantly enhance situational understanding and awareness, which in turn plays a vital role in our response to such events.

Semantic annotation of content refers to the process of making data more meaningful through labels (via marking up, tagging, or annotating) that conform to an agreed-upon reference model, be it a common nomenclature, dictionary, taxonomy, folksonomy, or ontology that models a specific domain. Annotations with these vocabularies make Web-based documents and data understandable to machines as well as easier to integrate and analyze. When applications use ontology rules, whether they range from simple to complex or are explicitly stated or inferred from the ontology’s class properties and relationships, such applications can realize powerful reasoning over annotated data.

User-generated content (UGC) and other observations from citizen sensor networks have unique characteristics that set them apart from the traditional content found in news or scientific articles. Coupled with the issues associated with social media content mentioned earlier (such as textual informality), the task of annotation becomes even more challenging when entities named with English language-words (Stephen King’s novel It, Madonna’s album Music, or Why, Arizona, one of the state’s smaller cities) must be identified within informal text. This is an important challenge that Web 3.0 applications will consistently face — the process of automatically creating accurate markups or annotations from UGC to common referenced models.

The key to semantically annotating content is the process of identifying and disambiguating named entities. In short, semantic annotation transforms unstructured data into a structured representation that lets applications search, analyze, and aggregate information. When looking for information about General Motors, for example, semantically annotated content can return analyses on all its variations, such as GM, GenlMotors, and so on. Clearly, the roles of ontologies and knowledge bases in creating markups will be even more important than they were before the social Web’s explosive growth — not only can they act as common reference models, but they’ll also play a crucial role in inferring semantics behind UGC while supplementing well-known statistical and natural language processing (NLP)techniques. Consider this tweet from the Mumbai terror attacks: “mumbai taj 4th floor left wing fire, live on desitv.” Although natural language understanding is hard in itself, the noncapitalization of key entities such as Mumbai and the Taj Hotel makes for inaccurate natural language parse structures (compare Fig-ures 3a and 3b, generated using the Berkeley Natural Language Parser at http://nlp.cs.berkeley.edu/). In such scenarios, knowing from a domain model that the Taj Hotel is a landmark in the city of Mumbai can offer meaningful support to the statistical strength of a corpus’s entities.