Annotating entries in the browser

From Knoesis wiki
Jump to: navigation, search

Entries are annotated by first, transforming the text to its stem and removing stop words and non-alpha characters, and then searching through a optimized prefix tree.


Prefix Tree

A prefix tree is a a tree where each node in the the tree represents a character in the string. Each node in the tree can be costly since most of time each node has only one child. So instead, all the consecutive nodes with one child are combined into one node. As it steps through the tree, each node is marked to signal that the end of an entry is found. Since the tree is only going to store alpha characters, 7 bits is used to store the character and 1 bit is used to mark the node.


Text Transformations

To transform the text, a non-alpha character is used as a delimiter to split the text into words. The first words are checked to see if they are stop words, if they are, they are skipped until a non-stop word is found. When this happens, the words are transformed to their stem using the Porter stemming algorithm. Then each stem word is added to the tree without spaces.


Searching

When searching a similar process is followed to transform the text, but when a word is found in the tree, it tries to keep searching for a longer entry. For example, if “brain” is found in the tree, it will keep searching to see if it can find a longer entry such as “brain activity” or “brain activity monitor”. If it cannot find another longer entry, it will go back to the word after the last entry found, and will search from the top of the tree again. For example if it had the text “brain memory processes”, and “brain” was in the tree and “memory processes” was in the tree also, it would first find “brain” then try to continue to find more, however “brain memory” is not in the the tree. So it goes back to the word “memory” and begins to find “memory” and then “memory processes”.

To get the ids associated with an entry, a hash table is used to map an entry to a set of ids.


Details

In the tree, two types of nodes are used a string node, which used to represent consecutive nodes that have only one child, and a branch node, which used for all nodes with multiple children. Adding entries to the tree can be complex because of the need to transform some nodes from a string node to a branch node. To insert an entry, we start at the root node which is a branch node. First, it checks if there is a child node for the first character. If the first character doesn't exist it adds a child string node with the rest of the characters. If the child node exist then it checks if it is a branch node. If is a branch node then it repeats this process for the rest of the string. If it is a string node then it check the child of the string node to see it needs to convert the child node to a branch node. If it doesn't then it tries to insert to the string node. If it doesn't need to be inserted at the end of the string node then it converts the rest of the string node to a branch node.