Prefix Tree

From Knoesis wiki
Jump to: navigation, search

A prefix tree is an associative array that matches a string to some value or values. Each node in the prefix tree stores a part of the key determined by the position in the tree. Here are the advantages of using a prefix tree over a hash map:

  • A partial lookup can be done with a prefix tree, whereas a hash map can only perform an exact fit. This includes finding all entries that start with a certain characters and using a wildcard character for certain types of ambiguous look up.
  • A prefix tree can be more memory efficient than a hash map for a large number of short strings since the characters of each string is shared among the nodes
  • When certain entities with vary lengths need to be annotated in free text, a prefix tree can be very efficient.

Using the prefix tree

The easiest way to use a prefix tree is use the default constructor like this:

PrefixTree<String> tree = new PrefixTree<String>();

This is a tree with String values. However, the prefix tree can be customized. For example, a tree can have each node represent a character or it some cases it could be more memory efficient for each node to have a word. This can be accomplish like this:

//Each node is character
PrefixTree<String> tree = new PrefixTree<String>(new PrefixTree.ByCharacters());

//Each node is word
PrefixTree<String> tree = new PrefixTree<String>(new PrefixTree.ByWords());

This can be further customized with user-defined representations. There are two interfaces to accomplish this. First is the Reader interface:

public interface Reader
{
    boolean next();

    String current();

    int startIndex();

    int endIndex();

    boolean setIndex(int i);

    List<String> toList();
}

This interface is used to feed character(s) to the each node in the tree. Then to transform a String to the Reader interface, the Transformer<String, Reader> interface must be implemented, and passed to the constructor of the PrefixTree.

There are some classes to help provide some extended support for the tree. WordTransformer class reads word by word(alphanumeric words, that is) and then allows another Transformer to be passed in so a transformation can be applied for each word.(Such that each word could be stemmed using the StemmerTransformer)

Wildcards

Wildcards can be used in the tree. A wildcard is character that could represent any character. For example:

PrefixTree<String> tree = new PrefixTree<String>(new PrefixTree.ByCharacters());
tree.setWildcard("*");
tree.add("t*m", "true");
tree.get("tim"); //returns "true"
tree.get("tom"); //returns "true"

It is important to note that the wildcard is a node and not just a character, so if ByWords is used then it would be a wildcard word rather than a wildcard character.