您的位置：首页 > 理论基础 > 数据结构算法

数据结构《16》----自动补齐实现《一》----Trie 树

2014-04-25 22:47 387 查看

1. 简述

Trie 树是一种高效的字符串查找的数据结构。可用于搜索引擎中词频统计，自动补齐等。

在一个Trie 树中插入、查找某个单词的时间复杂度是 O(len), len是单词的长度。

如果采用平衡二叉树来存储的话，时间复杂度是 O(lgN), N为树中单词的总数。

此外，Trie 树还特别擅长前缀搜索，比方说现在输入法中的自动补齐，输入某个单词的前缀，abs,

立刻弹出 abstract 等单词。

Trie 树优良的查找性能是建立在牺牲空间复杂度的基础之上的。

本文将给出一个 Trie树的简单实例，并用这个Trie建立了一个单词数目是 7000+的英语词典。

从而分析 Trie 树所占的空间。

2. 定义

一棵典型的 Trie 树，如下图所示：

每一个节点包含一个长度是 26 的指针数组。这 26 个指针分别代表英文 26 个字母。

同时，每个节点拥有一个红色标记，表示 root 到当前的路径是否是一个单词。

例：下图中最左边的一个路径表示单词 abc 和 abcd.

3. 性能

本人做了一个小测试，当建立一个 7000+ 的词典时，Trie 树共分配了 22383 个节点，每个节点占了 27 * 4 BYTE，

所以共消耗了大约 22383 * 27 * 4 BYTE = 2.4 M

而这 7000 个单词平均长度假设是 8 个字母，那么总共占 7000 * 8 BYTE= 5.6 KB

两者相差 42 倍！！！

从上述小测试可以看到，Trie 树需要占用大量的空间，特别是如果考虑大小写，或者建立汉字的 Trie树时，每个节点所需要的指针数目将更大。

其实，大伙一眼就能发现，Trie 树中，每个节点包含了大量的空指针，因而造成了大量的空间消耗。

可以采用三叉树（Ternary Search Tree）, 改进 Trie 树。将在下一篇文章中讨论。

4. 源码

// Last Update:2014-04-16 23:24:47
/**
* @file trie.h
* @brief Trie
* @author shoulinjun@126.com
* @version 0.1.00
* @date 2014-04-16
*/

#ifndef TRIE_H
#define TRIE_H

#include <iostream>
#include <fstream>
#include <string>
#include <cstring>
using std::string;
using std::cout;
using std::endl;

const int branchNum = 26;

struct TrieNode
{
TrieNode(): isStr(false)
{
memset(next, 0, sizeof(next));
}
bool isStr;
TrieNode* next[branchNum];
};

string ToLower(const string &s)
{
string str;
string::const_iterator it = s.begin();
while(it != s.end())
{
str += (char)tolower(*it);
++ it;
}
return str;
}

/**
* a simple data stucture
* usefull for AutoComplete
*/
class Trie
{
public:
Trie(): root(new TrieNode()) {}
~Trie() {
cout << "# of nodes allocated: " << count << endl;
destroy(root); }

void Insert(const string &str);
bool Search(const string &str) const;
void AutoComplete(const string &str);
void Input(const string &file);

private:
TrieNode* find(const string &str) const;
void dfs(TrieNode *root, string &path);
void destroy(TrieNode * &root);

TrieNode *root;
static size_t count;
};

size_t Trie::count = 0;

void Trie::destroy(TrieNode * &root)
{
for(int i=0; i<branchNum; ++i)
{
if(root->next[i])
destroy(root->next[i]);
}
delete root;
root = NULL;
}

void Trie::Insert(const string &s)
{
if(s.empty()) return;

/* support lower cases now */
string str = ToLower(s);
string::const_iterator it = str.begin();
TrieNode *location(root);

// bypassing existing nodes
while(it != str.end() && location->next[*it - 'a'] != NULL)
{
location = location->next[*it - 'a'];
++ it;
}

// Insert
while(it != str.end() && location->next[*it - 'a'] == NULL)
{
location->next[*it - 'a'] = new TrieNode();
++ count;
location = location->next[*it - 'a'];
++ it;
}
location->isStr = true;
}

void Trie::Input(const string &str)
{
std::ifstream ifile(str.c_str());

string word;

while(ifile >> word)
{
Insert(word);
}

ifile.close();
}

bool Trie::Search(const string &s) const
{
TrieNode *location = root;

string str = ToLower(s);
location = find(str);
return (location) && location->isStr;
}

TrieNode* Trie::find(const string &str) const
{
TrieNode *location = root;
string::const_iterator it = str.begin();
while(it != str.end() && location->next[*it - 'a'] != NULL)
{
location = location->next[*it - 'a'];
++ it;
}
return (it == str.end()) ? location : NULL;
}

void Trie::dfs(TrieNode *root, string &path)
{
if(root == NULL) return;

if(root->isStr)
cout << path << endl;
for(char x='a'; x<='z'; ++x)
{
if(root->next[x-'a'] != NULL)
{
path += x;
dfs(root->next[x-'a'], path);
path.resize(path.size()-1);
}
}
}

void Trie::AutoComplete(const string &str)
{
TrieNode *location(root);
string path;

location = find(str);
path = str;
dfs(location, path);
}

#endif  /*TRIE_H*/

内容来自用户分享和网络整理，不保证内容的准确性，如有侵权内容，可联系管理员处理

标签：

相关文章推荐

新的分享

章节导航