Huwebes, Hunyo 28, 2012

Suffix Trees and Pattern Matching: An Introduction

    With the advent of the Human Genome Project, the need for efficient and fast algorithms for pattern matching has become more apparent. The problem of pattern matching is a simple one, given a text t, one must find a string p called the pattern. Applying this to the genes, given a long string of DNA sample, one is to find if a sequence of nucleotides (that perhaps codes for a trait) exists in that sample. Fairly easy, but as one can see, brute force algorithms could take as long as O(nm) to search for a match. Of course, one has to factor the length of the DNA sample and of the nucleotide sequence. Oftentimes, the DNA sample could contain thousands, if not millions, of nucleotides. This could definitely mean that matching patterns via brute force is not something that could be done quickly.

    Pattern matching are of two types. Exact pattern matching searches for the exact pattern. So if the string that is being searched is AGTAAGTC, then that’s the string that it would search, no variations would result into a positive. The other type of pattern matching is the approximate one. As the name implies, it searches for an approximate match of a sequence.

    One type of data structure that is effective for exact matching is the suffix tree. It is basically a tree of suffixes of the whole text. So for example the word is BANANA, then its suffixes are {BANANA,ANANA,NANA,ANA,NA,A,$}. Suffix trees are basically a keyword tree with nonbranching nodes collapsed. It is a good algorithm because it can match pattern in linear time. It applies the concept of threading. The string that one desires to match is fed into the tree, if the threading is complete, that is it ended on a edge or a vertex after it traversed the tree, then a match is found, otherwise, no match.

    One of the problem of suffix trees is that its construction could be in quadratic time. This happens when you do the keyword style of construction. One good thing is that there are algorithms that exists that can solve this problem in linear time. One such algorithm was designed by Weiner.

    In general, suffix trees are extremely efficient if there are several patterns that is being matched on one long text. It has a linear search time, faster than other known algorithms. The only down side to it is that it searches for a string exactly, so when a nucleotide from the text was mutated, it is possible that the suffix tree would fail to find a match.

References
Jones, Neil C. and Pevzner, Pavel A. An Introduction to Bioinformatics Algorithms.

Levitin, Anany. The Design and Analysis of Algorithms. 2nd ed.

Walang komento:

Mag-post ng isang Komento