Vector Database

What is a Vector Database?

A vector database is a type of database that stores and manages unstructured data, such as text, images, or audio, in vector embeddings (high-dimensional vectors) to make it easy to find and retrieve similar objects quickly.

Today’s Machine Learning (ML) algorithms can convert a given object (e.g., word or text) into a numerical representation that preserves the information of that object. Imagine you give an ML model a word (e.g., “food”), then that ML model does its magic and returns you a long list of numbers. This long list of numbers is the numerical representation of your word and is called vector embedding.

How do vector databases work?

Vector databases are able to retrieve similar objects of a query quickly because they have already pre-calculated them. The underlying concept is called Approximate Nearest Neighbor (ANN) search, which uses different algorithms for indexing and calculating similarities.

Compare to kNN, with ANN, you can trade in some accuracy in exchange for speed and retrieve the approximately most similar objects to a query.

Indexing — For this, a vector database indexes the vector embeddings. This step maps the vectors to a data structure that will enable faster searching.

You can think of indexing as grouping the books in a library into different categories, such as author or genre. But because embeddings can hold more complex information, further categories could be “gender of the main character” or “main location of plot”. Indexing can thus help you retrieve a smaller portion of all the available vectors and thus speeds up retrieval.

Similarity Measures — To find the nearest neighbors to the query from the indexed vectors, a vector database applies a similarity measure. Common similarity measures include cosine similarity, dot product, Euclidean distance, Manhattan distance, and Hamming distance.