RT Events On LOD

From Knoesis wiki
Jump to: navigation, search


Real Time Social Events on LOD

Introduction

Linked Open Data (LOD) describes a method of publishing structured data so that it can be interlinked and become more useful (Wikipedia). Transforming unstructured Social data (Tweets) to structured and publishing it on LOD, will enrich its value. In this project we published Social Data related to on-going events on LOD in Real-Time. We also developed a visualization tool for event centric social data to visualize trending entities with relations from DBPedia (Graph Visualization).

In this project, we have extended Twarql to collect most of the metadata from Twitter and also extracted ,by analysis, to transform the unstructured tweet to a structured form.

Architecture

The architecture extends that of Twarql to include the extractions of metadata from the tweet using Twitter Storm. Once the metadata is extracted, the transformation of tweet to RDF is done using a light weight vocabulary (extension of the vocabulary used for SMOB). We have used DBpedia for background knowledge.

Figure 1. Real Time Social Events On Linked Open Data

Metadata Extraction

Different Phases of work

  1. Finding the difference of Metadata between what twitter provides now and what are being used by Twarql
Metadata provided by Twitter Streaming API Metadata used by Twarql as of now
  • text - Gives you the content of the Tweet.
  • favorited - Tells whether the status is favorited.
  • created_at - UTC Timestamp for tweet creation.
  • in_reply_to_screen_name - Gives you InReplyTo screen name.
  • in_reply_to_status_id - Gives you InReplyTo's status id.
  • entities - Twitter now provides different entities.Instead of parsing the text yourself to try to extract those entities, you can use the entities attribute that contains this parsed and structured data.
  • user_mentions - An array of Twitter screen names extracted from the Tweet text.
  • urls - An array of URLs extracted from the Tweet text.
  • hashtags - An array of hashtags extracted from the Tweet text.
  • geo - If the user has enabled geo-location.
  • place - Geo-location from where user tweeted from.
  • coordinates - Gives you the coordinates of the origin of the tweet.
  • retweeted - Tells you whether this tweet is a retweet.
  • truncated - Tells you if the tweet is truncated.
  • user - Gives you the user.Various attributes in it.
  • in_reply_to_user_id - Gives you the inReplyTo's tweet id.
  • id - Unique id of the tweet.
  • id
  • user
  • text
  • geo
  • place
  • coordinates
  • created_at

Note: So there is a difference of metadata that is being used compared to that provided by Twitter Streaming API. Also it is more assuring to use the entities provided by Twitter API than relying on our algorithms.

  1. Coming up with a Schema for this remainder (which are not being used in Twarql)
  2. Extracting entities from the tweets and Finding the corresponding dbpedia-url for that metadata.
  3. Converting the data into RDF Triples
  4. Storing them into triple store -- using Virtuoso
  5. Publishng them on to web http://twarql.org/resource/page/post/126824738495545344.
  6. Accessing them using SPARQL queries http://twarql.org:8890/sparql.

What have I learned

  • What is LINKED OPEN DATA (LOD).
  • What is DBPedia?.
  • Different Ontologies.
  • What is RDF? How to create a Triple? Finally How to store them into a Triple Store?
  • What is SPARQL Query Language? How to write SPARQL Queries?

Graph Visualization

Work that has been done:

  • Find a good triple store visualization library: did much searching the internet for well-written clean and good-looking graph visualization libraries that had the ability to display entities with names, lines representing relationships between those entities and varying size of entities based on frequency. The best one I found is called JavaScript InfoVis Toolkit [1]. It's a very well rounded visualization library with a lot of options for all kinds of graphs. The graph style to best fit this project was the "force directed" graph. I partially implemented the project stubbing out the pull method that gets the data from the rdf database. The graph uses JSON for its data and JQuery to display it's graph.
  • Put together a demo html page with linked jquery library, linked "pull.php" file and linked graph visualization library.
  • After showing the demo to Pavan, it was decided that the graph looked too plain, so some more styling was done to the test page, not only in css, but inside the javascript and query graph code as we'll, for instance, the lines between the entities were made thicker, a loading bar to show the graph loading was implemented and code was changed to make the entities appear closer on the graph.
  • Implement a function that gets JSON from pull.php and into javascript form so that graph.js can get data to graph. I'm used this online tutorial [2] to learn how to get JSON from php to javascript
  • Implement a pull.php file that accesses dbpedia directly to send queries and retrieve JSON using GET and POST. modified and extended code from this blog [3] to do this.
  • pull.php file also converts formats of JSON from the dbpedia ontologies etc into something that the graph can understand like "entity", "relationship" etc.

What I Learned

  • How to write and edit javascript, how to include external javascript files in a website
  • How to write and modify jQuery, how to include it on a web project.
  • What Json is, how to it works, how it gets transfered across multiple languages such as javascript and php.
  • Gained experience in rdf, sqarql, triple-based databases and just databases in general, how to query etc.
  • How arrays work in php, how they can be transfered to and from json
  • How to work with, communicate with and collaborate with a team of developers and programmers to complete a project

Use Case

Some of the use cases for this application will be the use of real-time querying of Semantic Social Stream

Use Case -- TWITRIS/IAC (INDIA AGAINST CORRUPTION)

Data that we worked on

We worked on the tweets collected for India Against Corruption. Basically we used the twitris database for this

Here are some statistics

  • Total number of tweets (microposts) -- 116001
  • Number of entities in tweets -- 85834

Data that we created

  • Number of tweets that have at least one entity -- 64691
  • Number of tweets that have more than one entity -- 21143
  • Number of persons mentioned in all these tweets (of the total 16197) -- 363
  • Number of places mentioned in all these tweets (of the total 6030) -- 479

The Big Thing

  • We have created about 1262627 (1.26 Million) Triples.

Published Data

We have successfully published all the data over the web http://twarql.org/resource/page/post/126824738495545344

SOME SPARQL QUERIES

SPARQL queries we have used.

  • Most Spoken About Places

select ?place, COUNT(?place) AS ?placecount where {
   ?tweet <http://moat-project.org/ns#taggedWith> ?place .
   ?place a <http://dbpedia.org/ontology/Place> .
} GROUP BY ?place  ORDER BY DESC(?placecount )

  • Give the names of the politicians in the tweet, with sentiment positive

select ?person,  COUNT(?person) AS ?personcount where {
   ?tweet <http://moat-project.org/ns#taggedWith> ?person .
   ?person a <http://dbpedia.org/ontology/Person> .
   ?tweet <http://twarql.org/resource/property/sentiment> <http://twarql.org/resource/property/Positive> .
} GROUP BY ?person  ORDER BY DESC(?personcount )

  • I want to know the person who is both a politician and an engineer, who is being mentioned in this event

select ?person,  COUNT(?person) AS ?personcount where {
   ?tweet <http://moat-project.org/ns#taggedWith> ?person .
   ?person a <http://dbpedia.org/ontology/Person> .
   ?person <http://dbpedia.org/property/profession> <http://dbpedia.org/resource/Politician> .
   ?person <http://dbpedia.org/property/profession> <http://dbpedia.org/resource/Engineer> .
} GROUP BY ?person  ORDER BY DESC(?personcount )

  • Person and his profession

select DISTINCT ?person ?profession where {
   ?tweet <http://moat-project.org/ns#taggedWith> ?place .
   ?person a <http://dbpedia.org/ontology/Person> .
   ?person <http://dbpedia.org/property/profession> ?profession .
}

  • Politicians spoken about in a Place

select ?place ?person count(?person) AS ?personcount where {
   ?tweet <http://moat-project.org/ns#taggedWith> ?place .
   ?tweet <http://moat-project.org/ns#taggedWith> ?person .
   ?place a <http://dbpedia.org/ontology/Place> .
   ?person a <http://dbpedia.org/ontology/Person> .
   ?person <http://dbpedia.org/property/profession> <http://dbpedia.org/resource/Politician> .
} GROUP BY ?place ?person ORDER BY DESC(?personcount)

PROJECT DEMO LINKS

Here we are using SPARQL queries to fetch desired information from the LOD that we have published (only the backend).

References

Work Schedule

Pramod

  1. Difference of the metadata being collected now and what you have found
  2. Schema for the tweets
  3. Event based schema for the tweets
  4. Code to count the most frequent DBPedia entities.
  5. Realtime modification of the count in RDF at the triple store

Dylan

  1. Find a visualization library
  2. Intergrate it
  3. Modify it to projects specific needs
  4. Implement pull function to fill it with data

Presentation

Team

  • Pavan Kapanipathi -- pavan@knoesis.org
  • Pramod Koneru -- koneru@knoesis.org
  • Dylan Williams -- dylan@kiddiescissors.com