Similarity rank

From Knoesis wiki
Jump to: navigation, search

Similarity rank is designed to rank similar items in a graph based on Wikipedia. First the wikipedia graph is transformed into its matrix transformation, and cosine similarity is used to calculate the value.


Wikipedia Graph

A DirectedSparseMultigraph is used to create the graph, from the jung package. There are four types of edges: RegularEdge, SeeAlsoEdge, RegularBackEdge, SeeAlsoBackEdge. This is transformed into a matrix. Where each row and column represent a vertix in the graph and the value of the cell is the wieght of the edge. Since a matrix cannot have more than one edge, a single weight must be chosen. First it selects the highest weight and uses that, but if it is doubly-linked, then a different(higher) value is used, chosen from the constants DoubleRegularWeight and DoubleSeeAlsoWeight. The matrix is sparse and uses a SortedArrayMap for the rows.


The ranking

Cosine similarity is used to calculate the ranking:


 \text{similarity} = \cos(\theta) = {A \cdot B \over \|A\| \|B\|}.

With this formula, dissimilarity can be measured as well as similarity, So if negative weights can be given(to show that two nodes have a degree of oppositeness) we can measure if two concepts are related, similar, or dissimilar.

It takes advantage of the sorted rows to calculate the intersection of two rows.


Running SimRank

SimRank jar can be run and it will output all the similarity ranks of wikipedia.The first parameter is the location of the wikipedia xml. It also has a second optional parameter that can be used to limit how many articles will be read from wikipedia. If the second parameter is used the process needs to be killed manually because the wikipedia iterator thread will block, and java provides no effective way to kill threads.