trie on the black silkonharz trie on是什么意思

10733人阅读
数据结构(18)
字串处理(5)
维基百科,自由的百科全书
Trie,又称单词查找树或键树,是一种形结构,是一种树的变种。典型应用是用于统计和排序大量的(但不仅限于字符串),所以经常被搜索引擎系统用于文本词频统计。它的优点是:最大限度地减少无谓的字符串比较,查询效率比高。
它有3个基本性质:
不包含,除根节点外每一个节点都只包含一个。从到某一,上经过的连接起来,为该对应的。每个的所有包含的都不相同。
这是一个Trie结构的例子:
在这个Trie结构中,保存了A、to、tea、ted、ten、i、in、inn这8个字符串,仅占用8个(不包括占用的空间)。
这是一个用于词频统计的程序范例,因使用了getline(3),所以需要才能链接成功,没有的话可以自行改写。代码由捐献,遵从GPL版权声明。
#include &stdio.h&
#include &stdlib.h&
#include &string.h&
#define TREE_WIDTH 256
#define WORDLENMAX 128
struct trie_node_st {
struct trie_node_st *next[TREE_WIDTH];
static struct trie_node_st root={0, {NULL}};
static char *spaces=& \t\n/.\&\'()&;
static int
insert(const char *word)
struct trie_node_st *curr, *
if (word[0]=='\0') {
for (i=0; ; ++i) {
if (curr-&next[ word[i] ] == NULL) {
newnode=(struct trie_node_st*)malloc(sizeof(struct trie_node_st));
memset(newnode, 0, sizeof(struct trie_node_st));
curr-&next[ word[i] ] =
if (word[i] == '\0') {
curr = curr-&next[ word[i] ];
curr-&count ++;
static void
printword(const char *str, int n)
printf(&%s\t%d\n&, str, n);
static int
do_travel(struct trie_node_st *rootp)
static char worddump[WORDLENMAX+1];
static int pos=0;
if (rootp == NULL) {
if (rootp-&count) {
worddump[pos]='\0';
printword(worddump, rootp-&count);
for (i=0;i&TREE_WIDTH;++i) {
worddump[pos++]=i;
do_travel(rootp-&next[i]);
main(void)
char *linebuf=NULL, *line, *
size_t bufsize=0;
while (1) {
ret=getline(&linebuf, &bufsize, stdin);
if (ret==-1) {
while (1) {
word = strsep(&line, spaces);
if (word==NULL) {
if (word[0]=='\0') {
insert(word);
/* free(linebuf); */
do_travel(&root);
From Wikipedia, the free encyclopedia
A trie for keys &A&, &to&, &tea&, &ted&, &ten&, &i&, &in&, and &inn&.
This article is about a tree data structure. For the French commune, see.
In , a trie, or prefix tree, is an
that is used to store an
where the keys are usually . Unlike a , no node in the tree stores the key assoc instead, its position in the tree defines the key with which it is associated. All the descendants of a node have a common
of the string associated with that node, and the root is associated with the. Values are normally not associated with every node, only with
leaves and some inner nodes that correspond to keys of interest.
The term trie comes from retrieval. Following the , the inventor, , pronounces it
However, it is pronounced
&try& by other authors.
In the example shown, keys are listed in the nodes and values below them. Each complete English word has an arbitrary integer value associated with it. A trie can be seen as a, although the symbol on each edge is often implicit in the order of the branches.
It is not necessary for keys to be explicitly stored in nodes. (In the figure, words are shown only to illustrate how the trie works.)
Though it is most common, tries need not be keyed by character strings. The same algorithms can easily be adapted to serve similar functions of ordered lists of any construct, e.g., permutations on a list of digits or shapes. In particular, abitwise
trie is keyed on the individual bits making up a short, fixed size of bits such as an integer number or pointer to memory.
A series of graphs showing how different algorithms scale with number of items
Behavior of Fredkin-style tries as a function of size
(in this case, nedtries, which is an in-place implementation, and therefore has a much steeper curve than a dynamic memory based trie implementation)
Behavior of red-black trees as a function of size
(in this case, the BSD rbtree.h, which shows classic O(log N) behaviour)
Behavior of hash tables as a function of size
(in this case, uthash, which when averaged shows classic O(1) behaviour)
Unlike most other algorithms, tries have the peculiar feature that the code path, and hence the time required, is almost identical for insert, delete, and find operations. As a result, for situations where code is inserting, deleting and finding in equal
measure, tries can handily beat , as well as provide a better basis for the CPU's instruction and branch caches.
The following are the main advantages of tries over
Looking up keys is faster. Looking up a key of length m takes worst case(m) time. A BST performs O((n))
comparisons of keys, wheren is the number of elements in the tree, because lookups depend on the depth of the tree, which is logarithmic in the number of keys if the tree is balanced. Hence in the worst case, a BST takes O(m logn)
time. Moreover, in the worst case log(n) will approach m. Also, the simple operations tries use during lookup, such as array indexing using a character, are fast on real machines.Tries are more space-efficient when they contain a large number of short keys, since nodes are shared between keys with common initial subsequences.Tries facilitate .The number of internal nodes from root to leaf equals the length of the key. Balancing the tree is therefore of no concern.
The following are the main advantages of tries over :
Tries support ordered iteration, whereas iteration over a hash table will result in a order given by the hash function (and further affected by
the order of hash collisions, which is determined by the implementation).Tries facilitate , but hashing does not, as a consequence of the above. Performing such a &closest fit& find can, depending on implementation, be as quick as an exact find.Tries tend to be faster on average at insertion than hash tables because hash tables must rebuild their index when it becomes full - a very expensive operation. Tries therefore have much better bounded worst-case time costs, which is important for latency-sensitive
programs.Since no hash function is used, tries are generally faster than hash tables for small keys.
As mentioned, a trie has a number of advantages over binary search trees. A trie can also
be used to replace a , over which it has the following advantages:
Looking up data in a trie is faster in the worst case, O(m) time, compared to an imperfect hash table. An imperfect hash table can have key collisions. A key collision is the hash function mapping of different keys to the same position in a hash table.
The worst-case lookup speed in an imperfect hash table is
time, but far more typically is O(1), with O(m) time spent evaluating the hash.There are no collisions of different keys in a trie.Buckets in a trie which are analogous to hash table buckets that store key collisions are necessary only if a single key is associated with more than one value.There is no need to provide a hash function or to change hash functions as more keys are added to a trie.A trie can provide an alphabetical ordering of the entries by key.
Tries do have some drawbacks as well:
Tries can be slower in some cases than hash tables for looking up data, especially if the data is directly accessed on a hard disk drive or some other secondary storage device where the random-access time is high compared to main memory.Some keys, such as floating point numbers, can lead to long chains and prefixes that are not particularly meaningful. Nevertheless a bitwise trie can handle standard IEEE single and double format floating point numbers.
A common application of a trie is storing a dictionary, such as one found on a. Such applications take advantage of a trie's ability
to quickly search for, insert, however, if storing dictionary words is all that is required (i.e. storage of information auxiliary to each word is not required), a minimal
would use less space than a trie. This is because an acyclic deterministic finite automaton can compress identical branches from the trie which correspond to the same suffixes (or parts) of different words being stored.
Tries are also well suited for implementing approximate matching algorithms, including those used in and
We can describe trie lookup (and membership) easily. Given a recursive trie type, storing an optional value at each node, and a list of children tries, indexed by the next character, (here, represented as a
data type):
data Trie a =
Trie { value
:: Maybe a
, children :: [(Char,Trie a)] }
We can look up a value in the trie as follows:
find :: String -& Trie a -& Maybe a
t = value t
find (k:ks) t = case lookup k (children t) of
-& Nothing
-& find ks t'
In an imperative style, and assuming an appropriate data type in place, we can describe the same algorithm in (here, specifically for
testing membership). Note that children is map of a node' and we say that a &terminal& node is one which contains a valid word.
def find(node, key):
for char in key:
if char not in node.children:
return None
node = node.children[char]
return node.value
A simple Ruby version:
class Trie
def initialize
@root = Hash.new
def build(str)
node = @root
str.each_char do |ch|
node[ch] ||= Hash.new
node = node[ch]
node[:end] = true
def find(str)
node = @root
str.each_char do |ch|
return nil unless node = node[ch]
node[:end] && true
A compiling Java version:
public class MinimalExample{
private interface Node {
public static final Node EMPTY_NODE = new Node() {
@Override public String getValue() { return &&; }
@Override public boolean containsChildValue(char c) { return false; }
@Override public Node getChild(char c) { return this; }
public String getValue();
public boolean containsChildValue(char c);
public Node getChild(char c);
public Node findValue(Node startNode, String value) {
Node current = startNode;
for (char c : value.toCharArray()) {
if (current.containsChildValue(c)) {
current = current.getChild(c);
current = Node.EMPTY_NODE;
return current;
Lexicographic sorting of a set of keys can be accomplished with a simple trie-based algorithm as follows:
Insert all keys in a trie.Output all keys in the trie by means of , which results in output that is in
increasing order.
is a kind of .
is another kind of
that is more appropriate for outputting the values that are in a rather than a trie.
This algorithm is a form of .
A trie forms the fundamental data structure of , currently (2007) the fastest known, memory/cache-based, string sorting algorithm.
for sorting N keys based on tries is (1) if there are N processors and the lengths of the keys have a constant upper bound. There is the potential that the keys might collide by having common prefixes or by being identical to one another, reducing or eliminating the speed advantage of having
multiple processors operating in parallel.
A special kind of trie, called a , can be used to index all suffixes in a text in order to carry out fast full text searches.
Bitwise tries are much the same as a normal character based trie except that individual bits are used to traverse what effectively becomes a form of binary tree. Generally, implementations use a special CPU instruction to very quickly find the first set
bit in a fixed length key (e.g. GCC's __builtin_clz() intrinsic). This value is then used to index a 32 or 64 entry table which points to the first item in the bitwise trie with that number of leading zero bits. The search then proceeds by testing each subsequent
bit in the key and choosing child[0] or child[1] appropriately until the item is found.
Although this process might sound slow, it is very cache-local and highly parallelizable due to the lack of register dependencies and therefore in fact has excellent performance on modern out-of-order execution CPUs. A for example performs much better on paper, but is highly cache-unfriendly and causes multiple pipeline and stalls on modern CPUs which
makes that algorithm bound by memory latency rather than CPU speed. In comparison, a bitwise trie rarely accesses memory and when it does it does so only to read, thus avoiding SMP cache coherency overhead, and hence is becoming increasingly the algorithm
of choice for code which does a lot of insertions and deletions such as memory allocators (e.g. recent versions of the famous).
A reference implementation of bitwise tries in C and C++ useful for further study can be found at.
When the trie is mostly static, i.e. all insertions or deletions of keys from a prefilled trie are disabled and only lookups are needed, and when the trie nodes are not keyed by node specific data (or if the node's data is common) it is possible to compress
the trie representation by merging the common branches. This application is typically used for compressing lookup tables when
the total set of stored keys is very sparse within their representation space.
For example it may be used to represent sparse
(i.e. subsets of a much fixed enumerable larger set) using a trie keyed by the bit element position within the full set, with the key created from the string of bits needed to encode the integral position of each element. The trie will then have
a very degenerate form with many missing branches, and compression becomes possible by storing the leaf nodes (set segments with fixed length) and combining them after detecting the repetition of common patterns or by filling the unused gaps.
Such compression is also typically used in the implementation of the various fast lookup tables needed to retrieve character properties (for example to represent case mapping tables,
or lookup tables containing the combination of base and combining characters needed to support Unicode normalization). For such application, the representation is similar to transforming a very large unidimensional sparse table into a multidimensional matrix,
and then using the coordinates in the hyper-matrix as the string key of an uncompressed trie. The compression will then consist of detecting and merging the common columns within the hyper-matrix to compress the last
each dimension of
the hypermatrix stores the start position within a storage vector of the next dimension for each coordinate value, and the resulting vector is itself compressible when it is also sparse, so each dimension (associated to a layer level in the trie) is compressed
separately.
Some implementations do support such data compression within dynamic sparse tries and allow insertions and deletions in compressed tries, but generally this has a significant cost when compressed segments need to be split or merged, and some tradeoff has
to be made between the smallest size of the compressed trie and the speed of updates, by limiting the range of global lookups for comparing the common branches in the sparse trie.
The result of such compression may look similar to trying to transform the trie into a (DAG), because the reverse transform from a DAG
to a trie is obvious and always possible, however it is constrained by the form of the key chosen to index the nodes.
Another compression approach is to &unravel& the data structure into a single byte array. This approach eliminates the need
for node pointers which reduces the memory requirements substantially and makes memory mapping possible which allows the virtual memory manager to load the data into memory very efficiently.
Another compression approach is to &pack& the trie. Liang describes a space-efficient implementation
of a sparse packed trie applied to , in which the descendants of each node may be interleaved in memory.
(aka DAWG)
by Lloyd Allison Topcoder tutorial[] and
implementations.
Black, Paul E. ()..
Dictionary of Algorithms and Data Structures. . Archived from
Franklin Mark Liang (1983).
(Doctor of Philosophy thesis). Stanford University. Archived from on . (1997). &6.3: Digital Searching&.The Art of Computer Programming Volume 3: Sorting and Searching (2nd ed.). Addison-Wesley. p.&492.&.Bentley, J
(Dr Dobb's). Archived from on . (1960). &Trie Memory&.Communications of the ACM 3 (9): 490–499.
:. (PDF).Jan Daciuk, Stoyan Mihov, Bruce W. Watson, Richard E. Watson
Computational Linguistics (Association for Computational Linguistics)26: 3.
:. Archived from on . &This paper presents a method for direct building of minimal acyclic finite states automaton which recognizes a given finite list of words in lexicographical order. Our approach is to construct a minimal automaton in a single phase
by adding new strings one by one and minimizing the resulting automaton on-the-fly&Ulrich Germann, Eric Joanis, Samuel Larkin (2009). (PDF).ACL Workshops: Proceedings of the Workshop on Software Engineering, Testing, and Quality Assurance for Natural Language Processing. Association for Computational
Linguistics. pp.&31–39. &We present Tightly Packed Tries (TPTs), a compact implementation of read-only, compressed trie structures with fast on-demand paging and short load times. We demonstrate the benefits of TPTs for storing n-gram back-off language models
and phrase tables for. Encoded as TPTs, these databases require less space than flat text file representations
of the same data compressed with the gzip utility. At the same time, they can be mapped into memory quickly and be searched directly in time linear in the length of the key, without the need to decompress the entire file. The overhead for local decompression
during search is marginal.&
de la Briandais, R. (1959). &File Searching Using Variable Length Keys&.Proceedings of the Western Joint Computer Conference: 295–298.
&&相关文章推荐
* 以上用户言论只代表其个人观点,不代表CSDN网站的观点或立场
访问:71265次
积分:1060
积分:1060
排名:千里之外
原创:33篇
转载:35篇文档分类:
下载后只包含 1 个 PDF 格式的文档,没有任何的图纸或源代码,
下载前请先预览,预览内容跟原文是一样的,在线预览图片经过高度压缩,下载原文更清晰。
您的浏览器不支持进度条
下载文档到电脑,查找使用更方便
还剩?页未读,继续阅读
播放器加载中,请稍候...
该用户其他文档
下载所得到的文件列表基于变步长Trie树的数据包分类技术的研究与实现.pdf
文档介绍:
Classified Index: TP393
U.D.C.: 621.38
Dissertation for the Master Degree in Engineering
RESEARCH AND IMPLEMENTATION OF
PACKET CLASSIFICATION TECHNIQUE
BASED ON VARIED STEP TRIE
Candidate: YinQing
Supervisor: Professor Chi Lejun
Academic Degree Applied for: Master of Engineering
Specialty: Computer Science and Technoloy
School puter Science and
Affiliation: Technology
Date of Defence: June, 2008
Degree-Conferring-Institution: Harbin Institute of Technology
哈尔滨工业大学工学硕士学位论文
数据包分类技术是网络管理的基础技术,尤其在网络访问控制以及面向
网络业务的 Qos 控制中发挥着至关重要的作用。
目前已有的数据包分类算法面向静态规则算法,其研究目标主要集中在
加快数据包匹配的速度上。而面向统一威胁管理目标的网络安全系统则大部
分由多个部件联动形成,访问控制规则分发多以自动联动的模式工作,规则
动态更新的数量和频率比以前有了极大的提高,更新的速度对于网络安全防
护系统的性能有着重要的影响。
本文首先从分析访问控制列表出发,研究规则库特征,根据访问控制规
则的分布特点,提出基于变步长 trie 树的数据包分类算法。该算法使用变步
长 trie 树结构,根据规则集合中的源地址域,选择合适的渐变步长,建立查
找树。由于采用变步长 trie 树结构,大大减少了查询过程中的访存操作。在
此基础上,结合线性链表,散列表以及二叉查找树结构,可快速构建基于变
步长 trie 树的复合式查找树结构。使得该算法在保持较高的数据包分类速度
的同时,可以快速添加和删除访问控制规则,达到动态更新规则库的目的。
最后,本文将详细介绍采用该算法的基于专用 Linux 系统平台的数据包分类
系统,该系统运行稳定,并取得了较好的应用效果。
与同类算法的量化测试评估实验表明,本文提出的算法具有明显的性能
提升,尤其在平均访存次数和规则构建速度上具有明显的优势,其中:在内
存占用处于可接受范围内的情况下,该算法比预处理时间最优的 EGT-PC 算
法减少了 60%时间耗费。平均访问次数则与该项指标较优的 Hicuts 算法以
及 RFC 算法持平。而最坏访存次数在 5000 条规则以下的规则集评测中,比
EGT-PC 算法减少了 50%,在 10000 条规则以上的规则集评测中,与 Hicuts
算法接近。此外,采用该算法的数据包分类系统,作为网络安全防护体系中
的联动部件,在实际应用中表现良好。
关键词数据包分类;访问控制列表;变步长 trie 树;动态更新
哈尔滨工业大学工学硕士学位论文
Packet classification is a kind of basic technique in management of
network ,which plays an important role especially in the field work
access control work services-oriented Qos control.
The previous research of packet classification algorithms focus on static
algorithm, mainly about speeding up the match of process. But now work
security systems which face on united threat management are consist of multiple
interacting parts. Access Control List’dispensing and updating has been
automated. So the quantity and frequency of dynamic updating regularly
improves a lot, meanwhile the speed of updating influences even determines the
performance of work security system.
This paper begins with analyzing the ACL and researching the characters of
the rules library, according to the characters of the distribution, introduces a
packet classification algorithm basing on the varied step trie. It uses the varied
step trie structure, according to the source address of the rule set, picks the
property gradient varied step width, found a trie. Because of this architecture,
searching operation is reduced greatly in process. And also considering the linear
table, hash table and bintree structure, we can quickly found pound search
tree stucture basing on varied step trie. All of this not only keeps the high speed
of data packet classification, but also speed up ACL’s delete and add operation, in
order to dynamically updating the1
内容来自淘豆网转载请标明出处.}

我要回帖

更多关于 black on the blonde 的文章

更多推荐

版权声明:文章内容来源于网络,版权归原作者所有,如有侵权请点击这里与我们联系,我们将及时删除。

点击添加站长微信