Quinary Search Tree

update (2016-04-05):
    Ternary Search Tree (improved)的实现有误,将稍后更正。

There is always a topic that how to store a set of strings. Throughout the years, many kinds of data structure had been proposed and applied, for instance, hash tables, binary search tree, digital search tree, ternary search tree and so on and so forth.


Let's begin with ternary search tree which is proposed by Jon and Robert at Pricenton[1]. There is one more pointer in the node compared with binary search tree, and it's the exactly minor change that not only makes it combine the time efficiency of digital tries with the space efficiency of binary search trees, but also provides more features like prefix search and fuzz search.


But can we make it more efficient in time with a tolerable space increment? We need to analyse the algorithm and the data structure of ternary search tree.


1) The infomation/structure ratio 信息/结构比

A typical node of ternary search tree contains one char and three pointers, and the infomation/structure ratio is approximate 7.7% on 32bit machine and 4% on 64bit machine. The solution to improve the infomation/structure ratio is to use more bytes to store infomation.


If we use 4 bytes or 8 bytes to store infomation in one node, the infomation/structure ratio is going to be


4 bytes / 32bit 4 bytes / 64bit 8 bytes / 32bit 8 bytes / 64bit
30.7% 16.0% 62.5% 32.0%

It's obviously that the infomation/structure ratio increases a lot.


2) Two-level partitions 两次partitions

In spite of the fact that the node of ternary search tree has three pointers, there is not much difference from binary search tree. The algorithm of ternary search tree compares the current character in the search string with the character at the node. If the search character is less, the search goes to the left child; if the search character is greater, the search goes to the right child.[2]


To make it more fast, we can make it bipartite one more time. This concept is very simple, just check the last bit of the search character. If last bit is 0, the search goes to the first pointer of corresponding child, otherwise goes to the second pointer of corresponding child.


For instance, given a set of strings, {"on","as","in","it","of","at","or","is","be","he","by"}. The quinary search tree we built is shown in figure 1.



The C structure of this concept is


typedef struct quinary_node_t {
    char c;
    struct quinary_node_t * le[2], * eq, * gt[2];
} quinary_node_t, * quinary_tree;

The insertion and search procedures have a some slightly changes. Let's take the search procedure for example. The original search algorithm of a ternary search tree is[2],


int search(char *s)
{   Tptr p;
    p = root;
    while (p) {
        if (*s < p->splitchar)
            p = p->lokid;
        else if (*s == p->splitchar) {
            if (*s++ == 0)
                return 1;
            p = p->eqkid;
       } else
           p = p->hikid;
    return 0;

After applying this concept, the new search algorithm is (the major changes display in bold font)


quinary_tree search(const quinary_tree tree, char * key) {
    quinary_tree root;
    root = tree;
    while (root) {
        if (*key < root->c) {
            root = root->le[(*key & 0x1)];
        } else if (*key > root->c) {
            root = root->gt[(*key & 0x1)];
        } else {
            if (*key == 0) {
                return root;
            root = root->eq;
    return NULL;

Combine the two improvement above, we get a quinary search tree. Using test set within 1 million strings (we've applied solution one to ternary search tree for comparing), quinary search tree gives a significant decrement of time when insert and an observable decrement of time when search. And the memory usage drops a lot.

结合以上两种改进,我们将得到一颗五元搜索树。当使用包含100万条字符串的测试集测试时(我们也将solution 1应用在了三元搜索树上作为比较),五元搜索树在插入时有明显的时间优势、在搜索时也有可见的优势,并且内存使用降低了很多。

The results of some different tests are shown below


Test 1, 测试一

  • 1 million strings, 100万条数据
  • random strings, 随机字符串
  • fixed length, 固定长度
  • 4/8/16/24/32/48/64/128/192/256 bytes, 长度为4/8/16/24/32/48/64/128/192/256字节




Test 2, 测试二

  • 10 million strings, 1000万条数据
  • random strings, 随机字符串
  • random length, upperbound are 8/16/24/32 bytes, 随机长度,8/16/24/32字节
  • random length, upperbound are 4/8/16/24 bytes, 固定长度,4/8/16/24字节




Basic C++ implementation of quinary search tree



  1. https://www.cs.princeton.edu/~rs/strings/ - Sorting and Searching Strings
  2. https://www.cs.upc.edu/~ps/downloads/tst/tst.html - Ternary Search Trees By Jon Bentley and Bob Sedgewick
声明: 本文为0xBBC原创, 转载注明出处喵~


电子邮件地址不会被公开。 必填项已用*标注