Url

Suffix tree

What is this site? It is generaly simplier version of wikipedia. You will find there selected articles. Enjoy!

Suffix tree for the string BANANA. Each substring is terminated with special character $. The six paths from the root to a leaf (shown as boxes) correspond to the six suffixes A$, NA$, ANA$, NANA$, ANANA$ and BANANA$. The numbers in the boxes give the start position of the corresponding suffix. Suffix links drawn dashed.

In computer science, a suffix tree (also called PAT tree or, in an earlier form, position tree) is a data structure that presents the suffixes of a given string in a way that allows for a particularly fast implementation of many important string operations.

The suffix tree for a string S is a tree whose edges are labeled with strings, such that each suffix of S corresponds to exactly one path from the tree's root to a leaf. It is thus a radix tree (more specifically, a Patricia trie) for the suffixes of S.

Constructing such a tree for the string S takes time and space linear in the length of S. Once constructed, several operations can be performed quickly, for instance locating a substring in S, locating a substring if a certain number of mistakes are allowed, locating matches for a regular expression pattern etc. Suffix trees also provided one of the first linear-time solutions for the longest common substring problem. These speedups come at a cost: storing a string's suffix tree typically requires significantly more space than storing the string itself.

Contents

History

The concept was first introduced as a position tree by Weiner in 1973 in a paper which Donald Knuth subsequently characterized as "Algorithm of the Year 1973". The construction was greatly simplified by McCreight in 1976 , and also by Ukkonen in 1995. Ukkonen provided the first linear-time online-construction of suffix trees, now known as Ukkonen's algorithm.

Definition

The suffix tree for the string S of length n is defined as a tree such that ( page 90):

Since such a tree does not exist for all strings, S is padded with a terminal symbol not seen in the string (usually denoted $). This ensures that no suffix is a prefix of another, and that there will be n leaf nodes, one for each of the n suffixes of S. Since all internal non-root nodes are branching, there can be at most n −  1 such nodes, and n + (n − 1) + 1 = 2n nodes in total (n leaves, n − 1 internal nodes, 1 root).

Suffix links are a key feature for linear-time construction of the tree. In a complete suffix tree, all internal non-root nodes have a suffix link to another internal node. If the path from the root to a node spells the string χα, where χ is a single character and α is a string (possibly empty), it has a suffix link to the internal node representing α. See for example the suffix link from the node for ANA to the node for NA in the figure above. Suffix links are also used in some algorithms running on the tree.

Functionality

A suffix tree for a string S of length n can be built in Θ(n) time, if the alphabet is constant or integer . Otherwise, the construction time depends on the implementation. The costs below are given under the assumption that the alphabet is constant.

Assume that a suffix tree has been built for the string S of length n, or that a generalised suffix tree has been built for the set of strings D=\{S_1,S_2,\dots,S_K\} of total length n=|n_1|+|n_2|+\cdots+|n_K|. You can:

The suffix tree can be prepared for constant time lowest common ancestor retrieval between nodes in Θ(n) time ( chapter 8). You can then also:

Applications

Suffix trees can be used to solve a large number of string problems that occur in text-editing, free-text search, computational biology and other application areas. Primary applications include:

Suffix trees are often used in bioinformatics applications, searching for patterns in DNA or protein sequences (which can be viewed as long strings of characters). The ability to search efficiently with mismatches might be considered their greatest strength. Suffix trees are also used in data compression; they can be used to find repeated data, and can be used for the sorting stage of the Burrows–Wheeler transform. Variants of the LZW compression schemes use suffix trees (LZSS). A suffix tree is also used in suffix tree clustering, a data clustering algorithm used in some search engines (first introduced in ).

Implementation

If each node and edge can be represented in Θ(1) space, the entire tree can be represented in Θ(n) space. The total length of all the strings on all of the edges in the tree is O(n2), but each edge can be stored as the position and length of a substring of S, giving a total space usage of Θ(n) computer words. The worst-case space usage of a suffix tree is seen with a fibonacci word, giving the full 2n nodes.

An important choice when making a suffix tree implementation is the parent-child relationships between nodes. The most common is using linked lists called sibling lists. Each node has pointer to its first child, and to the next node in the child list it is a part of. Hash maps, sorted/unsorted arrays (with array doubling), and balanced search trees may also be used, giving different running time properties. We are interested in:

Let σ be the size of the alphabet. Then you have the following costs:

Lookup Insertion Traversal
Sibling lists / unsorted arrays O(σ) Θ(1) Θ(1)
Hash maps Θ(1) Θ(1) O(σ)
Balanced search tree O(logσ) O(logσ) O(1)
Sorted arrays O(logσ) O(σ) O(1)
Hash maps + sibling lists O(1) O(1) O(1)

Note that the insertion cost is amortised, and that the costs for hashing are given perfect hashing.

The large amount of information in each edge and node makes the suffix tree very expensive, consuming about ten to twenty times the memory size of the source text in good implementations. The suffix array reduces this requirement to a factor of four, and researchers have continued to find smaller indexing structures.

External construction

Suffix trees quickly outgrow the main memory on standard machines for sequence collections in the order of gigabytes. As such, their construction calls for external memory approaches.

There are theoretical results for constructing suffix trees in external memory. The algorithm by Farach et al. is theoretically optimal, with an I/Os complexity equal to that of sorting. However, as discussed for example in , the overall intricacy of this algorithm has prevented, so far, its practical implementation.

On the other hand, there have been practical works for constructing disk-based suffix trees which scale to (few) GB/hours. The state of the art methods are TDD , TRELLIS , DiGeST , and B^2ST .

TDD and TRELLIS scale up to the entire human genome – approximately 3GB – resulting in a disk-based suffix tree of a size in the tens of gigabytes ,. However, these methods cannot handle efficiently collections of sequences exceeding 3GB . DiGeST performs significantly better and is able to handle collections of sequences in the order of 6GB in about 6 hours . The source code and documentation for the latter is available from . All these methods can efficiently build suffix trees for the case when the tree does not fit in main memory, but the input does. The most recent method, B2ST , scales to handle inputs that do not fit in main memory.

See also

References

  1. ^ P. Weiner (1973). "Linear pattern matching algorithm". 14th Annual IEEE Symposium on Switching and Automata Theory. pp. 1–11. 
  2. ^ Edward M. McCreight (1976). "A Space-Economical Suffix Tree Construction Algorithm". Journal of the ACM 23 (2): 262–272. doi:10.1145/321941.321946. http://doi.acm.org/10.1145/321941.321946. 
  3. ^ E. Ukkonen (1995). "On-line construction of suffix trees". Algorithmica 14 (3): 249–260. doi:10.1007/BF01206331. http://www.cs.helsinki.fi/u/ukkonen/SuffixT1withFigs.pdf. 
  4. ^ R. Giegerich and S. Kurtz (1997). "From Ukkonen to McCreight and Weiner: A Unifying View of Linear-Time Suffix Tree Construction". Algorithmica 19 (3): 331–353. doi:10.1007/PL00009177. http://www.zbh.uni-hamburg.de/staff/kurtz/papers/GieKur1997.pdf. 
  5. ^ a b c d e f g h i j k l m n Gusfield, Dan (1999) [1997]. Algorithms on Strings, Trees and Sequences: Computer Science and Computational Biology. USA: Cambridge University Press. ISBN 0-521-58519-8. 
  6. ^ Martin Farach (1997). "Optimal suffix tree construction with large alphabets". Foundations of Computer Science, 38th Annual Symposium on. pp. 137–143. ftp://dimacs.rutgers.edu/pub/dimacs/TechnicalReports/TechReports/1996/96-48.ps.gz. 
  7. ^ Ricardo A. Baeza-Yates and Gaston H. Gonnet (1996). "Fast text searching for regular expressions or automaton searching on tries". Journal of the ACM (ACM Press) 43 (6): 915–936. doi:10.1145/235809.235810. 
  8. ^ a b Allison, L.. "Suffix Trees". http://www.allisons.org/ll/AlgDS/Tree/Suffix/. Retrieved 2008-10-14. 
  9. ^ Oren Zamir and Oren Etzioni (1998). "Web document clustering: a feasibility demonstration". SIGIR '98: Proceedings of the 21st annual international ACM SIGIR conference on Research and development in information retrieval. ACM. pp. 46–54. 
  10. ^ Martin Farach-Colton, Paolo Ferragina, S. Muthukrishnan (2000). "On the sorting-complexity of suffix tree construction.". J. Acm 47(6) 47 (6): 987–1011. doi:10.1145/355541.355547. 
  11. ^ Smyth, William (2003). Computing Patterns in Strings. Addison-Wesley. 
  12. ^ a b Sandeep Tata, Richard A. Hankins, and Jignesh M. Patel (2003). "Practical Suffix Tree Construction". VLDB '03: Proceedings of the 30th International Conference on Very Large Data Bases. Morgan Kaufmann. pp. 36–47. 
  13. ^ a b Benjarath Phoophakdee and Mohammed J. Zaki (2007). "Genome-scale disk-based suffix tree indexing". SIGMOD '07: Proceedings of the ACM SIGMOD International Conference on Management of Data. ACM. pp. 833–844. 
  14. ^ a b c Marina Barsky, Ulrike Stege, Alex Thomo, and Chris Upton (2008). "A new method for indexing genomes using on-disk suffix trees". CIKM '08: Proceedings of the 17th ACM Conference on Information and Knowledge Management. ACM. pp. 649–658. 
  15. ^ a b Marina Barsky, Ulrike Stege, Alex Thomo, and Chris Upton (2009). "Suffix trees for very large genomic sequences". CIKM '09: Proceedings of the 18th ACM Conference on Information and Knowledge Management. ACM. 
  16. ^ "The disk-based suffix tree for pattern search in sequenced genomes". http://webhome.cs.uvic.ca/~mgbarsky/DIGEST_SEARCH. Retrieved 2009-10-15. 

External links

This article's use of external links may not follow Wikipedia's policies or guidelines. Please improve this article by removing excessive and inappropriate external links or by converting links into footnote references. (August 2010)
v  d  e
Trees in computer science
Binary trees
Self-balancing binary search trees
B-trees
Tries
Suffix tree · Radix tree · Ternary search tree
Binary space partitioning (BSP) trees
Non-binary trees
Spatial data partitioning trees
Other trees
Retrieved from "http://en.wikipedia.org/wiki/Suffix_tree"

All text are available under the terms of the GNU Free Documentation License. Hope this site help you/
Dogs - Zapraszamy na filmy przez internet Jest to najlepszy serwis w sieci - charmere - www.faktury-online.info - Biuro rachunkowe bydgoszcz - Busy Kraków - paznokcie akrylowe - nysa - narzędzia, kompresor , klucze - fajne gry logiczne - nowe gry - fajne gry sportowe - fajne gry dla dziewczyn - wózki paletowe - giętarki do rur