Difference between revisions of "Twarql"

From Knoesis wiki
Jump to: navigation, search
(Architecture)
(Architecture)
 
(33 intermediate revisions by 3 users not shown)
Line 1: Line 1:
 
Twarql: Twitter Feeds through SPARQL
 
Twarql: Twitter Feeds through SPARQL
 +
[[Image:FirstFrame.png|thumb||right|top||link=http://bit.ly/twarql|alt=Twarql Demonstration|Twarql Demonstration http://bit.ly/twarql]]
  
<code>
+
<!--
 
'''''NOTICE: Twarql is in pre-alpha stage.'''''  
 
'''''NOTICE: Twarql is in pre-alpha stage.'''''  
We are currently refactoring some parts of the code and stabilizing our REST API. We plan to release a first version of the source code and API by June 7th, 2010. <!--If you would like more information about the project, please visit our project page [[Linked Open Social Signals]].-->
+
We are currently refactoring some parts of the code and stabilizing our REST API. We plan to release a first version of the source code and API by June 28th, 2010. <!--If you would like more information about the project, please visit our project page [[Linked Open Social Signals]].
</code>
+
-->
  
 
== Introduction ==
 
== Introduction ==
 +
 +
Every day, Web users are using Twitter to simultaneously publish millions of microblog posts (microposts or ``tweets'') with opinions, observations and suggestions that may represent invaluable information for businesses and researchers around the world <ref>See recent statistics from Twitter at http://blog.twitter.com/2010/02/measuring-tweets.html</ref>
 +
Taking advantage from this "wisdom of the crowd" --- which refers to "''the process of taking into account the collective opinion of a group of individuals rather than a single expert to answer a question''"<ref>From Wikipedia, see http://en.wikipedia.org/wiki/Wisdom_of_the_crowd</ref> ---, Twitter data has been successfully used, for example, to forecast box-office revenues for movies <ref name=corr10asur>S. Asur and B. A. Huberman. Predicting the future with social media. CoRR, abs/1003.5699, 2010.</ref> or to manage earthquakes detection <ref name=www-earthquake>www-earthquake</ref>.
 +
However, analyzing the vast amount of microblog data published each second can be extremely challenging, especially in situations where it has to be done in real-time.
 +
 +
We developed Twarql to enable annotations and management of streaming tweets in order to alleviate information overload.
 +
Twarql encodes information from microblog posts as Linked Open Data in order to enable flexibility for those interested in collectively analyzing microblog data for sensemaking. Instead of requiring the use of keywords or custom software for filtering information, Twarql leverages a full fledged query language (SPARQL) that is much more expressive than keywords.
  
 
Our approach encompasses the following steps:
 
Our approach encompasses the following steps:
Line 14: Line 22:
 
* enable subscription to a stream of microposts that match a given query (Concept Feeds);
 
* enable subscription to a stream of microposts that match a given query (Concept Feeds);
 
* enable scalable real-time delivery of streaming data ([http://code.google.com/p/sparqlpush SparqlPuSH]).
 
* enable scalable real-time delivery of streaming data ([http://code.google.com/p/sparqlpush SparqlPuSH]).
 +
 +
Twarql is available at http://twarql.sf.net as open source and can be easily extended and deployed to enable Twitter monitoring systems that can be used in various contexts: brand tracking, disaster relief management, stock exchange monitoring, etc., as it flexible architecture makes easy to write such components. A screen capture demonstration is available [http://bit.ly/twarql here].
 +
 +
== Demonstration ==
 +
We have two demonstration videos. The [http://bit.ly/twarql first video] demonstrates the user perspective, interacting with the system to formulate a query and obtain microblog posts that match that query. The [http://bit.ly/sparqlpush second video] focuses on the server side and demonstrates the modules of our architecture at work, distributing the microposts via pubsubhubbub.
 +
 +
=== Deepwater Horizon Oil Spill Scenario ===
 +
You can try out our [http://knoesis1.wright.edu/twarql/ live demo] that is currently streaming tweets about the [http://en.wikipedia.org/wiki/Deepwater_Horizon_oil_spill oil spill]. The [http://knoesis1.wright.edu/twarql/featured.html featured streams] will illustrate concept feeds in action, and the [http://knoesis1.wright.edu/twarql/query.html query page] will allow you to define your own concept feed through our query formulation interface.
 +
 +
=== Brand Tracking Scenario: IPad ===
 +
For the Triplify Challenge 2010 we have collected tweets mentioning iPad from June 3rd until Jun 8th to demonstrate our system in a brand tracking scenario. <br>
 +
<!--
 +
The dataset is available via a SPARQL Endpoint http://.... (Powered by Openlink Virtuoso).
 +
Here are some example questions you can ask:
 +
- each of the use case queries that return results go here
 +
-->
 +
You can download datasets:
 +
* [http://knoesis1.wright.edu/library/tools/twarql/TwarqlIPadTweets.nt.gz Full Dataset]<br>
 +
Total Number of tweets : 511,147<br>
 +
Total Number of Triples : 4,479,631<br>
 +
Sentiment: 53,237 positive; 6,739 negative; 451,171 neutral.
 +
 +
*[http://knoesis1.wright.edu/library/tools/twarql/TwarqlIPadTweets-sample.nt.gz Sample Dataset]<br>
 +
Total Tweets: 584<br>
 +
Total Triples: 3983<br>
 +
 +
<!--
 +
* Location: XXX tweets with geolocation; YYY without geolocation. Most common location: ZZZ.
 +
* Competitors: XXX tweets with competitors; Most co-occuring competitor: ZZZ.
 +
-->
  
 
== Architecture ==
 
== Architecture ==
  
See the workflow between the components of the architecture
+
See the workflow between the components of the architecture:
 +
 
 
{{#widget:SlideShare
 
{{#widget:SlideShare
|doc=streamingannotatedtweets-100507114104-phpapp02
+
|id=4008063
 
|width=425
 
|width=425
 
|height=355
 
|height=355
 
}}
 
}}
 +
  
 
=== Tweet Annotation ===
 
=== Tweet Annotation ===
  
* extract content (entity mentions, hashtags and URLs) from microposts;
+
* extract content from microposts;
 +
** entity mentions (e.g. from DBpedia)
 +
** hashtags  
 +
** URLs
 +
** user mentions
 
* encode content in a structured format (RDF) using shared vocabularies (FOAF, SIOC, MOAT, etc.);
 
* encode content in a structured format (RDF) using shared vocabularies (FOAF, SIOC, MOAT, etc.);
 +
 +
We offer Twarql Annotation both as REST and Java APIs. You can download our source code and easily extend the annotation pipeline with your own extractors.
  
 
=== Concept Feeds ===
 
=== Concept Feeds ===
Line 50: Line 96:
 
Parameters:
 
Parameters:
 
* http://knoesis1.wright.edu/twarql'''/search'''?keyword=''k1,...,kn''&output=<output type>
 
* http://knoesis1.wright.edu/twarql'''/search'''?keyword=''k1,...,kn''&output=<output type>
** '''input''': keywords, output type (twitter-json, entities, sparql-json)
+
** '''input''': keywords, output type (tweets, entities, sparql)
 
** '''output''': tweets, entities, triples
 
** '''output''': tweets, entities, triples
  
Line 59: Line 105:
 
** #id
 
** #id
  
* http://knoesis1.wright.edu/twarql'''/stream'''?keyword=''k1,...,kn''&output=<output type>
+
* http://knoesis1.wright.edu/twarql'''/stream'''?keyword=''k1,...,kn''&id=<registered concept feed id>&output=<output type>
  
 
=== Output Formats ===
 
=== Output Formats ===
Line 124: Line 170:
 
== People ==
 
== People ==
  
You may contact us if you have any questions about the implementation or API. We try list our major contributions below our names so that you know to whom you should direct your question.
+
You may contact us if you have any questions about the implementation or API. We have listed our major contributions below our names so that you know to whom you should direct your question.
  
* Pablo Mendes
+
* [http://knoesis.wright.edu/students/pablo Pablo Mendes] (@pablomendes)
 
** Architecture, SPARQL client ([[Cuebee]]), Social Sensor (Extraction, Annotation), API, Documentation
 
** Architecture, SPARQL client ([[Cuebee]]), Social Sensor (Extraction, Annotation), API, Documentation
* Pavan Kapanipathi
+
* [http://knoesis.wright.edu/students/pavan Pavan Kapanipathi] (@pavankaps)
 
** Application Server, Semantic Publisher, Streaming SPARQL, API Content Negotiation
 
** Application Server, Semantic Publisher, Streaming SPARQL, API Content Negotiation
* Alex Passant
+
* [http://apassant.net Alex Passant] (@terraces)
 
** SparqlPuSH, Annotation Vocabularies
 
** SparqlPuSH, Annotation Vocabularies
 +
 +
[[Category:Microblogging]] [[Category:Information Extraction]] [[Category:Information Exploration]] [[Category:RDF]] [[Category:SPARQL]] [[Category:Linked Data]] [[Category:Annotation]] [[Category:Open Source]]
 +
 +
== References ==
 +
<references/>

Latest revision as of 17:17, 22 January 2016

Twarql: Twitter Feeds through SPARQL

Twarql Demonstration
Twarql Demonstration http://bit.ly/twarql


Introduction

Every day, Web users are using Twitter to simultaneously publish millions of microblog posts (microposts or ``tweets) with opinions, observations and suggestions that may represent invaluable information for businesses and researchers around the world <ref>See recent statistics from Twitter at http://blog.twitter.com/2010/02/measuring-tweets.html</ref> Taking advantage from this "wisdom of the crowd" --- which refers to "the process of taking into account the collective opinion of a group of individuals rather than a single expert to answer a question"<ref>From Wikipedia, see http://en.wikipedia.org/wiki/Wisdom_of_the_crowd</ref> ---, Twitter data has been successfully used, for example, to forecast box-office revenues for movies <ref name=corr10asur>S. Asur and B. A. Huberman. Predicting the future with social media. CoRR, abs/1003.5699, 2010.</ref> or to manage earthquakes detection <ref name=www-earthquake>www-earthquake</ref>. However, analyzing the vast amount of microblog data published each second can be extremely challenging, especially in situations where it has to be done in real-time.

We developed Twarql to enable annotations and management of streaming tweets in order to alleviate information overload. Twarql encodes information from microblog posts as Linked Open Data in order to enable flexibility for those interested in collectively analyzing microblog data for sensemaking. Instead of requiring the use of keywords or custom software for filtering information, Twarql leverages a full fledged query language (SPARQL) that is much more expressive than keywords.

Our approach encompasses the following steps:

  • extract content (entity mentions, hashtags and URLs) from microposts;
  • encode content in a structured format (RDF) using shared vocabularies (FOAF, SIOC, MOAT, etc.);
  • enable structured querying of microposts (SPARQL);
  • enable subscription to a stream of microposts that match a given query (Concept Feeds);
  • enable scalable real-time delivery of streaming data (SparqlPuSH).

Twarql is available at http://twarql.sf.net as open source and can be easily extended and deployed to enable Twitter monitoring systems that can be used in various contexts: brand tracking, disaster relief management, stock exchange monitoring, etc., as it flexible architecture makes easy to write such components. A screen capture demonstration is available here.

Demonstration

We have two demonstration videos. The first video demonstrates the user perspective, interacting with the system to formulate a query and obtain microblog posts that match that query. The second video focuses on the server side and demonstrates the modules of our architecture at work, distributing the microposts via pubsubhubbub.

Deepwater Horizon Oil Spill Scenario

You can try out our live demo that is currently streaming tweets about the oil spill. The featured streams will illustrate concept feeds in action, and the query page will allow you to define your own concept feed through our query formulation interface.

Brand Tracking Scenario: IPad

For the Triplify Challenge 2010 we have collected tweets mentioning iPad from June 3rd until Jun 8th to demonstrate our system in a brand tracking scenario.
You can download datasets:

Total Number of tweets : 511,147
Total Number of Triples : 4,479,631
Sentiment: 53,237 positive; 6,739 negative; 451,171 neutral.

Total Tweets: 584
Total Triples: 3983


Architecture

See the workflow between the components of the architecture:


Tweet Annotation

  • extract content from microposts;
    • entity mentions (e.g. from DBpedia)
    • hashtags
    • URLs
    • user mentions
  • encode content in a structured format (RDF) using shared vocabularies (FOAF, SIOC, MOAT, etc.);

We offer Twarql Annotation both as REST and Java APIs. You can download our source code and easily extend the annotation pipeline with your own extractors.

Concept Feeds

  • enable structured querying of microposts (SPARQL);
  • enable subscription to a stream of microposts that match a given query;

SPARQLPuSH

Twarql API

REST Endpoints

Summary:

  • URL scheme: http://<base-url>/<operation>/?<parameter>=<value>&...&output=<output format>
  • Base URL: http://knoesis1.wright.edu/twarql
  • Operations: search, register, stream, query
  • Output formats: twitter-json, sparql-json, entities

Parameters:

  • http://knoesis1.wright.edu/twarql/search?keyword=k1,...,kn&output=<output type>
    • input: keywords, output type (tweets, entities, sparql)
    • output: tweets, entities, triples

Output Formats

We also provide output according to the format presented on the twitter-api-announce message.

{
"text" : "hey @raffi tell @noradio to check out http://dev.twitter.com #hot",
...
"entities" : {
 "user_mentions" : [
 {
   "id" : 8285392,
   "screen_name" : "raffi",
   "indices" : [4, 9]
 },
 {
   "id" : 3191321,
   "screen_name" : "noradio",
   "indices" : [16, 23]
 }
],
"urls" : [
 {      "url" : "http://dev.twitter.com",
   "indices" : [38, 64]
 },
],
"hashtags" : [
 {      "text" : "#hot",
   "indices" : [66, 69]
   "url" : "http://search.twitter.com/search?q=%23hot"
 }
]
}
...
}

Error/Warning Messages

ERROR

  • Unknown Stream: You are requesting a stream id that was not registered.
  • Invalid Query: You are trying to register an invalid SPARQL query.
  • Unsupported Content-type: The requested content type is not supported.

WARNING

  • No Results: There are no results for the query.

Supported Clients

  • SPARQL Protocol-compliant Clients
    • Cuebee is a SPARQL query formulation and results exploration engine. We provide a TweetExplorer that can be directly plugged into Cuebee.


  • RSS/Atom clients
    • View SparqlPuSH

People

You may contact us if you have any questions about the implementation or API. We have listed our major contributions below our names so that you know to whom you should direct your question.

  • Pablo Mendes (@pablomendes)
    • Architecture, SPARQL client (Cuebee), Social Sensor (Extraction, Annotation), API, Documentation
  • Pavan Kapanipathi (@pavankaps)
    • Application Server, Semantic Publisher, Streaming SPARQL, API Content Negotiation
  • Alex Passant (@terraces)
    • SparqlPuSH, Annotation Vocabularies

References

<references/>