Z. Meral Ozsoyoglu, PI 
S. Cenk Sahinalp, CoPI School of Computer Science

Piotr Indyk, CoPI 
Z. Meral Ozsoyoglu 
S. Cenk Sahinalp 
Piotr Indyk 
Finding sequences similar to a given query sequence in a large data set is a fundamental problem in many database applications including computational genomics and proteomics, computational finance, audio, text and multimedia image processing. This proposal considers proximity search problems for strings and sequences in such applications where existing methods typically employ distance functions based on character edits only and using index structures to improve search efficiency. However, in many applications block edit operations such as translocations, deletions and copies, as well as linear transformations on strings are as common and important as character edits. Since computation of distance measures that allow block edits are generally known to be NPhard, it is important to work on practical methods for approximating the block edit distances and performing approximate searches under such measures. These measures are many times not symmetric (transforming a long sequence to a short one may be performed through a single deletion whereas transforming the short sequence to the long one may require many copy operations) or do not satisfy the triangular inequality (string space under edit operations is fundamentally different from Euclidean space) and thus do not provide a metric. However many of them can be proven to be almostmetrics, i.e., the symmetry and/or the triangular inequalities may be satisfied within multiplicative constants. For improved sequence proximity search, we utilize this property and distance based indexing methods extended for use in almostmetric spaces.
The main focus of this proposal is developing sequence proximity search methods which are particularly effective for genomic sequences, with improved generality, reliability and speed. The investigators propose to advance their ongoing work at the conceptual level as well as experimentally test the methods developed on both target genomic sequences and generated sequences so as to understand their full potential and limitations.
[1] S. C. Sahinalp, M. Tasan, J.
Macker, Z. M. Ozsoyoglu, Improved Proximity Search for Genomic Sequences,
IEEE ICDE 2003, Bangalore, India, March 2003.
[2] A. Funda Ergun, S. Muthukrishnan, S. Cenk Sahinalp, "Comparing Sequences with Segment Rearrangements ", Proceedings of Foundations of Software Technology and Theoretical Computer Science, Mumbai, India, vol. , (2003).
[3] A. F. Ergun, S. Muthukrishnan, C. Sahinalp , "Sequence comparison under block rearrangements", Proceedings of LATIN, Latin American Theoretical Informatics Symposium, Buenos Aires, 2004.
[4] Alexandr Andoni, Michel Deza, Anupam Gupta, Piotr Indyk and Sofya Raskhodnikova, "Lower bounds for Embedding of Edit Distance to Normed Spaces", Proceedings of Symposium on Discrete Algorithms, SODA, vol. 14, (2003).
[5] M. Tasan and Z.M. Ozsoyoglu, Improvements in DistanceBased Indexing,
International Scientific and Statistical
[6] E. Karakoc, Z.M.Ozsoyoglu, S.C.Sahinalp, M.Tasan, X.Zhang, Novel approaches to Biomolecular Sequence Indexing, IEEE Data Engineering Bulletin (invited paper) special issue on biological data management, pp. 3744, September, 2004.
[7] S.C.Sahinalp, E.Eichler, P.Goldberg, P.Berenbrink, T.Friedetzky, F.Ergun, "Identifying Uniformly Mutated Segments within Repeats", Journal of Bioinformatics and Computational Biology, 2(4) pp 112, Dec. 2004.
[8] S.C.Sahinalp, A.Utis, "Hardness of String Similarity Search and Other Indexing Problems", Proceedings of ICALP, pp 10801098, Turku, Finland, Jul. 2004.
[9] S.Muthukrishnan, S.C.Sahinalp, An Improved Algorithm for Sequence Comparison with Block Reversals, Theoretical Computer Science (invited paper) 321(1), pp.95101, Jun 2004.
[10] Piotr Indyk, Embedded Stringology, Invited talk at the Symposium on Combinatorial Pattern Matching (CPM 2004), June 2004, http://theory.lcs.mit.edu/~indyk/cpm.ps
In the second year of the project, we generalized optimal nearest neighbor search for distance based indexing, and worked on improving the greedy metric tree construction algorithms by utilizing probabilistic performance analysis. We are also developing an improved distance based indexing method by exploiting the array based methods and tree structure based methods, and using probabilistic analysis as a guide for choosing better reference points during the search. We generalized array based methods which are designed to search for nearest neighbor search so that they can perform near neighbor range queries efficiently.
We are also working on improving the VP tree based indexing methods through deterministic VP selection and flexible neighborhood determination. We have already proven that the problem is NPhard, however we have also been able to develop a O(log n) approximation to the best VP selection in this general scenario. We are now trying to improve the approximation factor based on a weighted metric set cover approach. We are also trying to obtain a bottom up version of the algorithm with a Dynamic Programming flavor. The existing top down approach potentially repeats some of the iterations which have already been performed.
In this project we aim to establish theoretical foundations of
similarity search for various measures of sequence similarity under
generalizations of vantage point trees and other distance based indexing methods
including array based methods. We also aim to apply the recent developments in
indexing under Hamming metric and other vector similarity measures to sequences
through embeddings with small distortions. The main goal of the project is to
provide efficient and reliable indexing tools for sequence similarity search,
especially protein sequence search under all desirable similarity
measures.
E. Hunt, M.P. Atkinson, R.W. Irving, A database index to large biological sequences, Proc. of VLDB conference, 2001.
Altschul SF, Gish W, Miller W, Myers EW, Lipman DJ., Basic local alignment search tool, J Mol Biol 1990 Oct 5;215(3):40310.
T. Bozkaya and M. Ozsoyoglu, DistanceBased Indexing for HighDimensional Metric Spaces, Proc. ACM SIGMOD Intl. Conf. on Management of Data, pp. 357368, 1997.
A.J. Buhler, Efficient Large Scale Sequence Comparison by Locality Sensitive Hashing, Bioinformatics17(5), 2001.
P. Indyk and R. Motwani. Approximate Nearest Neighbors: Towards Removing the Curse of Dimensionality. Proc. ACM Symp. on Theory of Computing, 1998, 604613.
T. Kahveci, A.K. Singh, Efficient Index Structures for String Databases, Proc. VLDB, pp. 351360, 2001.
E. Kushilevitz, R. Ostrovsky and Y. Rabani, Efficient search for approximate nearest neighbor in high dimensional spaces, Proc. ACM Symposium on Theory of Computing, 1998, 614623.
M. Li, J. H. Badger, C. Xin, S. Kwong, P. Kearney, H. Zhang, An information based sequence distance and its application to whole mitochondrial genome phylogeny, Bioinformatics, 17, 2001
S. Muthukrishnan and S. C. Sahinalp, Approximate nearest neighbors and sequence comparison with block operations, Proc. ACM Symposium on Theory of Computing, 2000.
R. Weber, H. Schek, and S. Blott, A quantitative analysis and performance study for similarity search methods in high dimensional spaces, Proc. VLDB, pp. 194205, 1998.
Peter N. Yianilos, Data Structures and Algorithms for Nearest Neighbor Search in General Metric Spaces, Proc. ACMSIAM Symposium on Discrete Algorithms, pp. 311321, 1993.
IEEE Data Engineering Bulletin, Special Issue on Querying Biological Sequences, Sept. 2004, (ftp://ftp.research.microsoft.com/pub/debull/A04sept/issue1.htm).
Gisli R. Hjaltason, Hanan Samet, Indexdriven similarity search in metric spaces, ACM TODS, Vol. 28, No. 4, December 2003, pp. 517580.
We welcome contacts from other researchers interested in collaborations.