Scalable high-dimensional similarity searches using a query driven dynamic quantization

November 30th, 2022

Categories: Data Mining, Software, Data Science

About

Guadalupe Canahuate, Associate Professor from The University of Iowa presents an EVL colloquium Wednesday, November 30, 2022, Room 2068 ERF (Continuum).

The concept of similarity is used as the basis for many data exploration and data mining tasks. Nearest neighbor (NN) queries identify the most similar items, or in terms of distance the closest points to a query point. Similarity is traditionally characterized using a distance function between multi-dimensional feature vectors. However, when the data is high-dimensional, traditional distance functions fail to significantly distinguish between the closest and furthest points, as few dissimilar dimensions dominate the distance function. Localized similarity functions, i.e. functions that only consider dimensions close to the query, quantize each dimension independently and only compute similarity for the dimensions where the query and the points fall into the same bin. These quantizations are query-agnostic and there is potential to improve accuracy when a query-dependent quantization is used. In this work we propose a query dependent dynamic quantization method for high-dimensional similarity searches that does not only improve the quality of the distance metric, but also improves query time performance by filtering out non relevant data. We propose a distributed indexing and query algorithm to efficiently compute this novel quantization and evaluate the scalability of the approach in relation to the number of dimensions, and the number of compute nodes.

Bio: Guadalupe Canahuate is an Associate Professor at the Department of Electrical and Computer Engineering at the University of Iowa. Her research interests include machine learning, large scale data management, data compression and indexing, and big-data. indexing, and big-data. Dr. Canahuate obtained her Ph.D degree from the Computer Science and Engineering department at The Ohio State University in 2009; and her M.Sc. Degree from the Computer and Information Science department in 2004. Her work includes precision medicine approaches for outcome risk modeling of radiotherapy oncologic patients, distributed indexing and query optimization, high-dimensional data analysis, and semantic-based searches. Before joining academia, Dr. Canahuate worked as a software developer and a certified database administrator in her home country Dominican Republic.