Improving Proximity Search for Sequences

Z. Meral Ozsoyoglu, PI

Department of EE and Computer Science
Case Western Reserve University

 

S. Cenk Sahinalp, Co-PI

 School of Computer Science
Simon Fraser University

 

Piotr Indyk, Co-PI

Computer Science and AI lab
Massachusetts Institute of Technology

Contact Information

Z. Meral Ozsoyoglu
EECS Department, CS Division
Case Western Reserve University
10900 Euclid Avenue
Cleveland, Ohio 44106
Phone: 216-368-2818
fax:     216-368-1534
email: meral ατ case.edu
URL: http://zmo.cwru.edu

S. Cenk Sahinalp 
School of Computer Science
Simon Fraser University
Vancouver, BC, Canada
Phone: +1-604-291-5415
Fax:      +1-604-291-3045
email: cenk ατ cs.sfu.ca
URL: http://www.cs.sfu.ca/~cenk/

Piotr Indyk 
Computer Science and Artificial Intelligence Lab
Stata Center, Room G642
Cambridge, Massachusetts 02139
Phone: 617-452-3402
Fax: 617-258-8682
email: indyk ατ theory.lcs.mit.edu
URL: http://theory.lcs.mit.edu/~indyk/

Collaborators

List of Supported Faculty and Students

Project Award Information

Project Summary

Finding sequences similar to a given query sequence in a large data set is a fundamental problem in many database applications including computational genomics and proteomics, computational finance, audio, text and multimedia image processing. This proposal considers proximity search problems for strings and sequences in such applications where existing methods typically employ distance functions based on character edits only  and using index structures to improve search efficiency. However, in many applications block edit operations such as translocations, deletions and copies, as well as linear transformations on strings are as common and important as character edits. Since computation of distance measures that allow block edits are generally known to be NP-hard, it is important to work on practical methods for approximating the block edit distances and performing approximate searches under such measures. These measures are many times not symmetric (transforming a long sequence to a short one may be performed through a single deletion whereas transforming the short sequence to the long one may require many copy operations) or do not satisfy the triangular inequality (string space under edit operations is fundamentally different from Euclidean space) and thus do not provide a metric. However many of them can be proven to be almost-metrics, i.e., the symmetry and/or the triangular inequalities may be satisfied within multiplicative constants. For improved sequence proximity search, we utilize this property and distance based indexing methods extended for use in almost-metric spaces.

The main focus of this proposal is developing sequence proximity search methods which are particularly effective for genomic sequences, with improved generality, reliability and speed.  The investigators propose to advance their on-going work at the conceptual level as well as experimentally test the methods developed on both target genomic sequences and generated sequences so as to understand their full potential and limitations.

Publications and Products

Year 1 (Oct. 2002 - Sept. 2003) 

[1] S. C. Sahinalp, M. Tasan,  J. Macker, Z. M. Ozsoyoglu, Improved Proximity Search for Genomic Sequences, IEEE ICDE 2003, Bangalore, India, March 2003.

[2] A. Funda Ergun, S. Muthukrishnan, S. Cenk Sahinalp, "Comparing Sequences with Segment Rearrangements ", Proceedings of Foundations of Software Technology and Theoretical Computer Science, Mumbai, India, vol. , (2003).

[3] A. F. Ergun, S. Muthukrishnan, C. Sahinalp , "Sequence comparison under block rearrangements", Proceedings of LATIN, Latin American Theoretical Informatics Symposium, Buenos Aires, 2004. 

[4] Alexandr Andoni, Michel Deza, Anupam Gupta, Piotr Indyk and Sofya Raskhodnikova, "Lower bounds for Embedding of Edit Distance to Normed Spaces", Proceedings of Symposium on Discrete Algorithms, SODA, vol. 14, (2003).

Year 2 (Oct. 2003 - Sept. 2004)

[5] M. Tasan and Z.M. Ozsoyoglu, “Improvements in Distance-Based Indexing”, International Scientific and Statistical Conference , Greece , June 2004, pp. 161-170.

[6]  E. Karakoc, Z.M.Ozsoyoglu, S.C.Sahinalp, M.Tasan, X.Zhang, Novel approaches to Biomolecular Sequence Indexing, IEEE Data Engineering Bulletin (invited paper) special issue on biological data management, pp. 37-44, September, 2004.

[7]  S.C.Sahinalp, E.Eichler, P.Goldberg, P.Berenbrink, T.Friedetzky, F.Ergun, "Identifying Uniformly Mutated Segments within Repeats", Journal of Bioinformatics and Computational Biology, 2(4) pp 1-12, Dec. 2004.

[8] S.C.Sahinalp, A.Utis, "Hardness of String Similarity Search and Other Indexing Problems", Proceedings of ICALP, pp 1080-1098, Turku, Finland, Jul. 2004.

[9] S.Muthukrishnan, S.C.Sahinalp, An Improved Algorithm for Sequence Comparison with Block Reversals, Theoretical Computer Science (invited paper) 321(1), pp.95-101, Jun 2004.

[10] Piotr Indyk, Embedded Stringology, Invited talk at the Symposium on Combinatorial Pattern Matching (CPM 2004), June 2004,  http://theory.lcs.mit.edu/~indyk/cpm.ps

Project Impact

The results of this project will have an impact on similarity search for sequences in general.  It will target developing novel techniques for embedding sequences into easier to manage spaces such as binary vectors between which simpler measures of similarity such as the Hamming distance will tightly approximate the original sequence similarity measures such as (any weighted variant  of) the Levenshtein edit distance. It will also target exploiting combinatorial and statistical properties of sequence search spaces (in particular pairwise distance distributions) to enable faster search through improved versions of conventional methods. These include existing sequence proximity search techniques, which are designed to perform range queries on metric spaces which need to be updated so that they can perform nearest neighbor search queries on non-metric, non-geometric spaces such as the sequence spaces in consideration. 

Goals, Objectives and Targeted Activities

In the first year of the project, we generalized distance based indexing methods which are designed to perform range queries for metric spaces so that they can perform nearest neighbor search queries for near-metrics. We tested our methods on the complete human proteome and obtained encouraging results We also worked on limitations of existing methods such as embedding edit like distances to easier to manage vector distances. In particular we showed that sequences under Levenshtein edit distance can not be embedded into binary vectors under Hamming distance within an approximation factor of less than 3/2. This result implies that the use of existing (approximate) nearest neighbor search methods  under  Hamming distance (with guaranteed polynomial time query performance) in the context  edit distance (via embeddings) need to tolerate an approximation factor 3/2 or more. 

In the second year of the project, we generalized optimal nearest neighbor search for distance based indexing, and worked on improving the greedy metric tree construction algorithms by utilizing probabilistic performance analysis. We are also developing an improved distance based indexing method by exploiting the array based methods and tree structure based methods, and using probabilistic analysis as a guide for choosing better reference points during the search. We generalized array based methods which are designed to search for nearest neighbor search so that they can perform near neighbor range queries efficiently. 

We are also working on improving the VP tree based indexing methods through deterministic VP selection and flexible neighborhood determination. We have already proven that the problem is NP-hard, however we have also been able to develop a O(log n) approximation to the best VP selection in this general scenario. We are now trying to improve the approximation factor based on a weighted metric set cover approach. We are also trying to obtain a bottom up version of the algorithm with a Dynamic Programming flavor. The existing top down approach potentially repeats some of the iterations which have already been performed.

In this project we aim to establish theoretical foundations of similarity search for  various measures of sequence similarity under generalizations of vantage point trees and other distance based indexing methods including array based methods. We also aim to apply the recent developments in indexing under Hamming metric and other vector similarity measures to sequences through embeddings with small distortions. The main goal of the project is to provide efficient and reliable indexing tools for sequence similarity search, especially protein sequence search under all desirable similarity measures.

Area Background

Efficient nearest neighbor search is crucial to many applications in computational genomics and in general computational molecular biology. Sequence similarity measured in the form of (weighted or unweighted) Levenshtein edit distance (the minimum -weighted- number of character edit operations to transform one sequence to another) or the block edit distance and its variants (the minimum -weighted- number of both character and block/segment edits) such as transformation distance and compression distance provides means for understanding evolutionary relatedness between them. Sequence similarity also provides a measure of structural relatedness (e.g. how amino-acid sequences fold up into a 3-D protein structure) as well as functional relatedness (in which proteome networks a given protein is active and in what manner). Because sequence information is easy to determine, the evolutionary history, structure and function of a newly discovered protein (or any other biomolecule) can be determined through investigating those proteins which are sequence wise similar. In addition to computational genomics and molecular biology, efficient similarity search (near neighbor, nearest neighbor, k-nearest neighbor) for sequences is also very important for several other applications including  image recognition, document comparisons, voiceprint matching. Distance based indexing techniques for metric spaces and distance metrics for sequences show promise in similarity searching for sequences as an alternative to using traditional multi dimensional indexes together with some mapping techniques for sequence to vector spaces. 

Area References

E. Hunt, M.P. Atkinson, R.W. Irving, A database index to large biological sequences, Proc. of VLDB conference, 2001.

Altschul SF, Gish W, Miller W, Myers EW, Lipman DJ., Basic local alignment search tool, J Mol Biol 1990 Oct 5;215(3):403-10.

T. Bozkaya and M. Ozsoyoglu, Distance-Based Indexing for High-Dimensional Metric Spaces, Proc. ACM SIGMOD Intl. Conf. on Management of Data,  pp. 357-368, 1997.

A.J. Buhler, Efficient Large Scale Sequence Comparison by Locality Sensitive Hashing, Bioinformatics17(5), 2001.

P. Indyk and R. Motwani. Approximate Nearest Neighbors: Towards Removing the Curse of  Dimensionality. Proc. ACM Symp. on Theory of Computing, 1998, 604--613.

T. Kahveci, A.K. Singh, Efficient Index Structures for String Databases, Proc. VLDB, pp. 351-360, 2001.

E. Kushilevitz, R. Ostrovsky and Y. Rabani, Efficient search for approximate nearest neighbor in high dimensional spaces, Proc. ACM Symposium on Theory of Computing, 1998, 614--623.

M. Li, J. H. Badger, C. Xin, S. Kwong, P. Kearney, H. Zhang, An information based sequence distance and its application to whole mitochondrial genome phylogeny, Bioinformatics, 17, 2001

S. Muthukrishnan and S. C. Sahinalp, Approximate nearest neighbors and sequence comparison with block operations, Proc. ACM Symposium on Theory of Computing, 2000.

R. Weber, H. Schek, and S. Blott, A quantitative analysis and performance study for similarity search methods in high dimensional spaces, Proc. VLDB, pp. 194-205, 1998.

Peter N. Yianilos, Data Structures and Algorithms for Nearest Neighbor Search in General Metric Spaces, Proc. ACM-SIAM Symposium on Discrete Algorithms, pp. 311-321, 1993.

IEEE Data Engineering Bulletin, Special Issue on Querying Biological Sequences, Sept. 2004, (ftp://ftp.research.microsoft.com/pub/debull/A04sept/issue1.htm).

Gisli R. Hjaltason, Hanan Samet, Index-driven similarity search in metric spaces, ACM TODS, Vol. 28, No. 4, December 2003, pp. 517-580.

Project Websites

http://zmo.cwru.edu/proximity
This is the main website for our project.

Potential Related Projects

We welcome contacts from other researchers interested in collaborations.