Quinary Search Tree

update (2016-04-05):
    Ternary Search Tree (improved)的实现有误,将稍后更正。

There is always a topic that how to store a set of strings. Throughout the years, many kinds of data structure had been proposed and applied, for instance, hash tables, binary search tree, digital search tree, ternary search tree and so on and so forth.

如何存储一组字符串总是一个经久不衰的话题。在这些年间,有许多种不同的数据结构不断被提出和应用,例如哈希表、二分搜索树、数字查找树、三元搜索树等等。

Let's begin with ternary search tree which is proposed by Jon and Robert at Pricenton[1]. There is one more pointer in the node compared with binary search tree, and it's the exactly minor change that not only makes it combine the time efficiency of digital tries with the space efficiency of binary search trees, but also provides more features like prefix search and fuzz search.

让我们从三元搜索树开始,它是由Jon和Robert在Pricenton[1]提出的。和二分搜索树相比起来,它的每个节点中多了一个指针,也正是这么一个小小的改动,不仅使它结合了数字查找树的时间效率和二分搜索树的空间效率,还使得它提供了诸如前缀搜索和模糊搜索这样的特性。

But can we make it more efficient in time with a tolerable space increment? We need to analyse the algorithm and the data structure of ternary search tree.

但是我们能在可容忍的空间使用增加上,进一步提高它的时间效率吗?我们需要分析三元搜素树的算法与结构。

1) The infomation/structure ratio 信息/结构比

A typical node of ternary search tree contains one char and three pointers, and the infomation/structure ratio is approximate 7.7% on 32bit machine and 4% on 64bit machine. The solution to improve the infomation/structure ratio is to use more bytes to store infomation.

一个典型的三元搜索树结点包含了1个char与3个指针,它的信息/结构比在32位、64位机器上分别约为7.7%、4%。要提高信息/结构比,我们可以在结点上用更多的字节来存放信息。

If we use 4 bytes or 8 bytes to store infomation in one node, the infomation/structure ratio is going to be

如果我们在一个结点内用4字节或者8字节来存放信息,那么信息/结构比将会是

4 bytes / 32bit 4 bytes / 64bit 8 bytes / 32bit 8 bytes / 64bit
30.7% 16.0% 62.5% 32.0%

It's obviously that the infomation/structure ratio increases a lot.

显然信息/结构比将会提高许多。

2) Two-level partitions 两次partitions

In spite of the fact that the node of ternary search tree has three pointers, there is not much difference from binary search tree. The algorithm of ternary search tree compares the current character in the search string with the character at the node. If the search character is less, the search goes to the left child; if the search character is greater, the search goes to the right child.[2]

尽管三元搜索树的结点拥有三个指针,但它和二分搜索树并没有太大的差别。三元搜索树的算法把需要搜索的字符串的当前字符与当前结点中存储的字符做比较。如果搜索的字符小,则转向左子树搜索;如果搜索的字符大,则转向右子树搜索。[2]

To make it more fast, we can make it bipartite one more time. This concept is very simple, just check the last bit of the search character. If last bit is 0, the search goes to the first pointer of corresponding child, otherwise goes to the second pointer of corresponding child.

为了让它变得更快,我们可以再做一次二分。这个想法非常简单,就只是检查了被搜索的字符的最后一个位。如果最后的位是0,搜索就转向对应子树的第一个指针,否则转向对应子树的第二个指针。

For instance, given a set of strings, {"on","as","in","it","of","at","or","is","be","he","by"}. The quinary search tree we built is shown in figure 1.

举个例子,给定一组字符串,{"on","as","in","it","of","at","or","is","be","he","by"}。我们构建的五元搜索树将如下图1所示。

QST-example

The C structure of this concept is

这个想法的C结构体如下

typedef struct quinary_node_t {
    char c;
    struct quinary_node_t * le[2], * eq, * gt[2];
} quinary_node_t, * quinary_tree;

The insertion and search procedures have a some slightly changes. Let's take the search procedure for example. The original search algorithm of a ternary search tree is[2],

插入和搜索的过程都有一些轻微的变化。这里以搜索过程为例子。原始的三元搜素树的搜索算法如下[2]

int search(char *s)
{   Tptr p;
    p = root;
    while (p) {
        if (*s < p->splitchar)
            p = p->lokid;
        else if (*s == p->splitchar) {
            if (*s++ == 0)
                return 1;
            p = p->eqkid;
       } else
           p = p->hikid;
    }
    return 0;
}

After applying this concept, the new search algorithm is (the major changes display in bold font)

在应用了这个想法之后,新的搜索算法如下(主要的变化以粗体字显示)

quinary_tree search(const quinary_tree tree, char * key) {
    quinary_tree root;
    root = tree;
    while (root) {
        if (*key < root->c) {
            root = root->le[(*key & 0x1)];
        } else if (*key > root->c) {
            root = root->gt[(*key & 0x1)];
        } else {
            if (*key == 0) {
                return root;
            }
            key++;
            root = root->eq;
        }
    }
    return NULL;
}

Combine the two improvement above, we get a quinary search tree. Using test set within 1 million strings (we've applied solution one to ternary search tree for comparing), quinary search tree gives a significant decrement of time when insert and an observable decrement of time when search. And the memory usage drops a lot.

结合以上两种改进,我们将得到一颗五元搜索树。当使用包含100万条字符串的测试集测试时(我们也将solution 1应用在了三元搜索树上作为比较),五元搜索树在插入时有明显的时间优势、在搜索时也有可见的优势,并且内存使用降低了很多。

The results of some different tests are shown below

以下是一些不同的测试结果

Test 1, 测试一


  • 1 million strings, 100万条数据
  • random strings, 随机字符串
  • fixed length, 固定长度
  • 4/8/16/24/32/48/64/128/192/256 bytes, 长度为4/8/16/24/32/48/64/128/192/256字节

PUT:
1M-PUT-QST

GET:
1M-GET-QST

Usage:
1M-usage-QST

Test 2, 测试二


  • 10 million strings, 1000万条数据
  • random strings, 随机字符串
  • random length, upperbound are 8/16/24/32 bytes, 随机长度,8/16/24/32字节
  • random length, upperbound are 4/8/16/24 bytes, 固定长度,4/8/16/24字节

PUT:
10M-PUT-QST

GET:
10M-GET-QST

Usage:
10M-usage-QST

Basic C++ implementation of quinary search tree


BlueCocoa/QuinarySearchTree

References


  1. https://www.cs.princeton.edu/~rs/strings/ - Sorting and Searching Strings
  2. https://www.cs.upc.edu/~ps/downloads/tst/tst.html - Ternary Search Trees By Jon Bentley and Bob Sedgewick
声明: 本文为0xBBC原创, 转载注明出处喵~

发表评论

电子邮件地址不会被公开。 必填项已用*标注