Introduction To Vector Databases
Editor’s note: Steven Pousty is a speaker for both ODSC Europe in September and ODSC West in October! Be sure to check out his session, “Going From Unstructured Data to Vector Similarity Search,” at each event!
Almost everyone in the data science community has heard about vector databases, but not everyone has a good grasp of the technology. You see press releases and blogs on vector capabilities for almost every existing database technology. At the same time, you see a whole bunch of new dedicated vector stores coming to market. Some are open-source, some are not. Some are downloadable, and some only run in SaaS environments.
Let’s have a quick discussion about vector data stores, what they are, and why you should care. This post is also a lightweight encapsulation of the workshops I will be teaching at ODSC Europe and ODSC West this year. You might consider this a side dish to prepare you for the main course.
What is a Vector DB?
A vector database stores, indexes, and queries high-dimensional vectors efficiently. Vectors are essentially arrays of numbers that represent data in a numerical format, often used in machine learning and AI applications. These vectors are typically generated based on data fed through a specific type of neural network called transformers.
The major benefit of these vectors is that they capture “semantic meaning” for unstructured data, such as images, audio, and large amounts of written information. Semantic meaning is another way of saying the transformers are trained to summarize important features in the data and then encode those features into the output vector.
Unlike traditional databases that focus on structured data (like tables of text or numbers), vector databases are optimized for operations involving vectors, such as similarity search and nearest neighbor search. By transforming this data into vectors and storing them in a database, you enable faster and more efficient similarity (or dissimilarity) querying and analysis compared to traditional databases.
How They Help You
“Memory” for Your Model
One of the key benefits of vector databases is their ability to serve as an external memory for machine learning models. Models can store embedding data in a vector database and later retrieve them when needed. Without this feature, doing a similarity search would require keeping all the previous embeddings in memory. For any decent size data set this would quickly exhaust the machine’s memory. This capability is crucial for tasks like recommendation systems, where the model needs to recall previous user interactions to make accurate predictions.
Huge Energy and Capital Savings
As mentioned above, without a database the embeddings would need to be stored in memory and the model would have to run continuously. By using a vector database, the model can be spun down but the embeddings are preserved. The ability to query and retrieve embeddings in the database leads to substantial energy and cost savings, as less hardware and power are needed to achieve the same performance levels and accuracy.
Vital Part of Retrieval-Augmented Generation (RAG)
Retrieval-Augmented Generation (RAG) is an advanced technique in natural language processing that combines embedded retrieval-based information to give more context to the user’s query. In RAG, a model retrieves relevant information from a large corpus (stored in a vector database) and uses this information to generate more accurate and contextually relevant responses. These responses are then added to the original user’s query and this combined information is then sent to the LLM.
Vector databases play a vital role in RAG by providing storage of your embeddings of more accurate and up-to-date information. Developers have a familiar query syntax, the results of which they can then use to improve the information sent to the LLM. When you want to have the LLM base its answer on your specific information (like your knowledgebase or documentation) RAG is a much easier (and possibly cheaper) solution than fine-tuning a model.
Introductory Workshop — Just for You
This blog post is just to give you enough information to make you hungry to learn more about vector databases. There are plenty of learning materials available on the web. If you are interested in learning more about vector databases, I’ll be giving talks at ODSC events this year. At ODSC Europe in September and ODSC West in October, I’ll present “Going From Unstructured Data to Vector Similarity Search.”
These sessions will go more in-depth on this material AND you will get hands-on with working examples. To impress your friends and loved ones, all the examples are GitHub repos that you run back when you get home.
No matter how you choose to learn more about Vector Databases, if you want to work in this brand-new AI/ML world, it should definitely be a tool you have in your toolbelt. If you do decide to join me, be sure to come up and say hi. See you in AI/ML land!
About the author:
Steve is a dad, partner, son, and founder of Tech Raven Consulting. He can teach you about Data Analysis, Java, Python, PostgreSQL, Microservices, Containers, Kubernetes, and some JavaScript. He has deep subject area expertise in GIS/Spatial, Statistics, and Ecology. Before founding his company, Steve was a developer Advocate for VMware, Crunchy Data, DigitalGlobe, Red Hat, LinkedIn, deCarta, and ESRI. Steve has a Ph.D. in Ecology and can easily be bribed with offers of bird watching or fly fishing.
Originally posted on OpenDataScience.com
Read more data science articles on OpenDataScience.com, including tutorials and guides from beginner to advanced levels! Subscribe to our weekly newsletter here and receive the latest news every Thursday. You can also get data science training on-demand wherever you are with our Ai+ Training platform. Interested in attending an ODSC event? Learn more about our upcoming events here.