RT Events On LOD

From Knoesis wiki
Revision as of 19:47, 22 November 2011 by Koneru (Talk | contribs) (Created page with "=Real Time Social Events on LOD= ==Introduction== Linked Open Data (LOD) describes a method of publishing structured data so that it can be interlinked and become more useful (Wi...")

(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to: navigation, search

Real Time Social Events on LOD

Introduction

Linked Open Data (LOD) describes a method of publishing structured data so that it can be interlinked and become more useful (Wikipedia). Transforming unstructured Social data (Tweets) to structured and publishing it on LOD, will enrich its value. In this project we published Social Data related to on-going events on LOD in Real-Time. We also developed a visualization tool for event centric social data to visualize trending entities with relations from DBPedia (Graph Visualization).

In this project, we have extended Twarql to collect most of the metadata from Twitter and also extracted by analysis to transform the unstructured tweet to a structured form.

Architecture

The architecture extends that of Twarql to include the extractions of metadata from the tweet using Twitter Storm. Once the metadata is extracted, the transformation of tweet to RDF is done using a light weight vocabulary (extension of the vocabulary used for SMOB).

Figure 1. Real Time Social Events On Linked Open Data

Metadata Extraction

Different Phases of work

  1. Finding the difference of Metadata between what twitter provides now and what are being used by Twarql
Metadata provided by Twitter Streaming API Metadata used by Twarql as of now
  • text - Gives you the content of the Tweet.
  • favorited - Tells whether the status is favorited.
  • created_at - UTC Timestamp for tweet creation.
  • in_reply_to_screen_name - Gives you InReplyTo screen name.
  • in_reply_to_status_id - Gives you InReplyTo's status id.
  • entities - Twitter now provides different entities.Instead of parsing the text yourself to try to extract those entities, you can use the entities attribute that contains this parsed and structured data.
  • user_mentions - An array of Twitter screen names extracted from the Tweet text.
  • urls - An array of URLs extracted from the Tweet text.
  • hashtags - An array of hashtags extracted from the Tweet text.
  • geo - If the user has enabled geo-location.
  • place - Geo-location from where user tweeted from.
  • coordinates - Gives you the coordinates of the origin of the tweet.
  • retweeted - Tells you whether this tweet is a retweet.
  • truncated - Tells you if the tweet is truncated.
  • user - Gives you the user.Various attributes in it.
  • in_reply_to_user_id - Gives you the inReplyTo's tweet id.
  • id - Unique id of the tweet.
  • id
  • user
  • text
  • geo
  • place
  • coordinates
  • created_at

Note: So there is a difference of metadata that is being used compared to that provided by Twitter Streaming API. Also it is more assuring to use the entities provided by Twitter API than relying on our algorithms.

  1. Coming up with a Schema for this remainder (which are not being used in Twarql)
  2. Extracting entities from the tweets and Finding the corresponding dbpediaurl for that metadata.
  3. Converting the data into RDF Triples
  4. Storing them into triple store -- using Virtuosos
  5. Publishng them on to web http://twarql.org/resource/page/post/126824738495545344.
  6. Accessing them using SPARQL queries http://twarql.org:8890/sparql.

What have I learned

  • What is LINKED OPEN DATA (LOD).
  • What is DBPedia?.
  • Different Ontologies.
  • What is RDF? How to create a Triple? Finally How to store them into a Triple Store?
  • What is SPARQL Query Language? How to write SPARQL Queries?

Graph Visualization

Work that has been done:

  • Find a good triple store visualization library: did much searching the internet for well-written clean and good-looking graph visualization libraries that had the ability to display entities with names, lines representing relationships between those entities and varying size of entities based on frequency. The best one I found is called JavaScript InfoVis Toolkit [1]. It's a very well rounded visualization library with a lot of options for all kinds of graphs. The graph style to best fit this project was the "force directed" graph. I partially implemented the project stubbing out the pull method that gets the data from the rdf database. The graph uses JSON for its data and JQuery to display it's graph.
  • Put together a demo html page with linked jquery library, linked "pull.php" file and linked graph visualization library.
  • After showing the demo to Pavan, it was decided that the graph looked too plain, so some more styling was done to the test page, not only in css, but inside the javascript and query graph code as we'll, for instance, the lines between the entities were made thicker, a loading bar to show the graph loading was implemented and code was changed to make the entities appear closer on the graph.
  • Implement a function that gets JSON from pull.php and into javascript form so that graph.js can get data to graph. I'm used this online tutorial [2] to learn how to get JSON from php to javascript
  • Implement a pull.php file that accesses dbpedia directly to send queries and retrieve JSON using GET and POST. modified and extended code from this blog [3] to do this.
  • pull.php file also converts formats of JSON from the dbpedia ontologies etc into something that the graph can understand like "entity", "relationship" etc.

What I Learned

  • How to write and edit javascript, how to include external javascript files in a website
  • How to write and modify jQuery, how to include it on a web project.
  • What Json is, how to it works, how it gets transfered across multiple languages such as javascript and php.
  • Gained experience in rdf, sqarql, triple-based databases and just databases in general, how to query etc.
  • How arrays work in php, how they can be transfered to and from json
  • How to work with, communicate with and collaborate with a team of developers and programmers to complete a project

Twitter Storm

Kurtis -- Storm
Storm is an open-source computing platform that provides a set of language-agnostic primitives to perform distributed computation on real-time data. Storm performs transformations on streams, or "unbounded sequence[s] of tuples", using the spout and bolt primitives. Spouts are sources of streams. Bolts are single-step transformations on that stream. Spouts deliver streams to bolts. Bolts may manipulate those streams and deliver them as tuples to other bolts. Bolts can be grouped, which allows data to be pushed to a matching task. The complete set of stream transformations are called a topology.

Use Case

Some of the use cases for this application will be the use of real-time querying of Semantic Social Stream

Use Case -- INDIA AGAINST CORRUPTION

Data that we worked on

We worked on the tweets collected for India Against Corruption. Basically we used the twitris database for this

Here are some statistics

  • Total number of tweets (microposts) -- 116001
  • Number of entities in tweets -- 85834

Data that we created

  • Number of tweets which have at least one entity -- 64691
  • Number of tweets which have more than one entity -- 21143
  • Number of persons mentioned in all these tweets (of the total 16197) -- 363
  • Number of places mentioned in all these tweets (of the total 6030) -- 479

The Big Thing

  • We have created about 1262627 (1.26 Million) Triples.

Published Data

We have successfully published all the data over the web http://twarql.org/resource/page/post/126824738495545344

SOME SPARQL QUERIES

Here is the Link for the demo of our project, where we are using SPARQL queries to fetch desired information from the LOD that we have published.

http://twitris.knoesis.org/iac/frontend/twitrisMainPage/index.php (Search&Explore tab).

SPARQL queries we have used.

  • Most Spoken About Places
select ?place, COUNT(?place) AS ?placecount where {
?tweet <http://moat-project.org/ns#taggedWith> ?place .
?place a <http://dbpedia.org/ontology/Place> .
} GROUP BY ?place ORDER BY DESC(?placecount )
  • Give the names of the politicians in the tweet, with sentiment positive
select ?place, COUNT(?place) AS ?placecount where {
?tweet <http://moat-project.org/ns#taggedWith> ?place .
?place a <http://dbpedia.org/ontology/Person> .
?tweet <http://twarql.org/resource/property/sentiment> <http://twarql.org/resource/property/Positive> .
} GROUP BY ?place ORDER BY DESC(?placecount )
  • I want to know the person who is both a politician and an engineer, who is being mentioned in this event
select ?place, COUNT(?place) AS ?placecount where {
?tweet <http://moat-project.org/ns#taggedWith> ?place .
?place a <http://dbpedia.org/ontology/Person> .
?place <http://dbpedia.org/property/profession> <http://dbpedia.org/resource/Politician> .
?place <http://dbpedia.org/property/profession> <http://dbpedia.org/resource/Engineer> .
} GROUP BY ?place ORDER BY DESC(?placecount )
  • Person and his profession
select DISTINCT ?place ?profession where {
?tweet <http://moat-project.org/ns#taggedWith> ?place .
?place a <http://dbpedia.org/ontology/Person> .
?place <http://dbpedia.org/property/profession> ?profession .
}
  • Politicians spoken about in a Place
select ?place ?person count(?person) AS ?personcount where {
?tweet <http://moat-project.org/ns#taggedWith> ?place .
?tweet <http://moat-project.org/ns#taggedWith> ?person .
?place a <http://dbpedia.org/ontology/Place> .
?person a <http://dbpedia.org/ontology/Person> .
?person <http://dbpedia.org/property/profession> <http://dbpedia.org/resource/Politician> .
} GROUP BY ?place ?person ORDER BY DESC(?personcount)

PROJECT DEMO LINK

http://www.kiddiescissors.com/twitvis

References

Summary : In this propose that
Summary : In this paper they discuss the collection, semantic annotation, analysis and distribution of real time social signal (mainly twitter micro feeds). For the semantic annotation part, they enrich the microblog(tweet) using advanced semantic web technologies like common representation languages, domain models (ontologies) and shared knowledge models on the web. They propose a software architecture for semantic annotation. They also discuss how they use RDF(S)/OWL data formats (FOAF, SIOC, OPO, MOAT etc) for this modeling in order to provide easy reuse across Semantic Web based applications, notably by using SPARQL for querying.
They argue that background knowledge changes the way you look into information because it puts information into context, which is a must for micro posts(tweets) because they are short, and therefore individually lack volume of information that provides an informative context.

Work Schedule

Pramod

  1. Difference of the metadata being collected now and what you have found
  2. Schema for the tweets
  3. Event based schema for the tweets
  4. Code to count the most frequent DBPedia entities.
  5. Realtime modification of the count in RDF at the triple store

Dylan

  1. Find a visualization library
  2. Intergrate it
  3. Modify it to projects specific needs
  4. Implement pull function to fill it with data

Kurtis

You can do this in your own wiki Storm.

Team

  • Kurtis -- (email required)
  • Pramod Koneru -- koneru@knoesis.org
  • Dylan Williams -- dylan@kiddiescissors.com
  • Pavan Kapanipathi -- pavan@knoesis.org