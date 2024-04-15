Machine learning has ushered in an era where vast amounts of data can be processed and analyzed to uncover insights and make predictions that were previously unimaginable. From personalized recommendations to sophisticated natural language processing, machine learning models are at the core of these advancements, and is the basis of AI systems.

For developers, managing the high-dimensional vector data generated by these models presents a new set of challenges. Your application may already have data in PostgreSQL, and working toward a hybrid vector database approach is time-consuming and expensive. This hybrid approach is unnecessary and adds latency to your application, when you can continue to use the same database for your vector data.

Enter pgvector , a powerful PostgreSQL extension that brings vector storage and similarity search capabilities to your database—and it's available for use on Render in your projects.

In this post, we explore how Render PostgreSQL with pgvector, can supercharge your machine learning applications, making them more efficient and accessible than ever before.

The magic of vectors in machine learning

At the heart of any recommendation system lies the concept of vectors.

Don't worry, this won't be a crash course in linear algebra, but this is where vectors are used, internally, for the necessary calculations.

In machine learning, vectors are numerical representations of data points for input (features) and output (labels or predictions); there are also deeper representations when data is more than word-proximity in our example. These vectors capture the essence of items in a multi-dimensional space, where each "dimension" represents a feature. For music, these text or numeric features might include tempo, genre, or artist similarity — there are more waveform-based vectoring you can do on the sound itself which you can read about here . By comparing vectors, we can determine how similar two songs are, enabling personalized recommendations.

It is typically the role of the developer to determine what these dimension are, and how they relate, in order to properly calculate the values of similarity in these vector dimensions.

What is pgvector?

pgvector is a PostgreSQL extension that brings the power of vector storage and similarity search to your database. It allows you to store dimensional vectors and perform efficient similarity searches directly within PostgreSQL queries.

The project was released in April 2021 by Andrew Kane ( Twitter GitHub ) and is regularly updated with new features for efficient and scalable storage and querying of data for data science and machine learning. Key Features of pgvector include:

Vector Data Type: Allows storage of fixed-length arrays of floating-point numbers. Similarity Search Functions: Perform similarity searches using distance metrics like Euclidean distance and cosine similarity. Indexing: Supports indexing vector columns using PostgreSQL’s indexing mechanisms for efficient search.

Hosting on Render: A Seamless Integration

Render announced public support for pgvector in April 2023 for all PostgreSQL databases. We're a great fit for your machine learning systems with our flexible PostgreSQL pricing and scaling, and easy deployment and management.

Let's look at the steps to get started with a text-only recommendation system.

Example App: Building a Music Recommendation System

I put together a sample application for recommending music based on "favorited" data from other users. I used an open dataset from Kaggle and first filtered the 100M rows of data down to about 25M. (This cleanup removed rows of duplicate data, non-printable characters, and inappropriate content.)

I used a Python library called gensim to calculate the vector similarities, and inserted all data, including all of the vectors, in under 3GB of storage in PostgreSQL. My example code, including the Python scripts, can be found here , including the FastAPI script that you can test

The dataset from Kaggle includes two CSV files:

A list of songs (a song ID, song title, and band name)

A list of user IDs and song IDs that each user "liked"

I used the gensim library in Python to calculate a vector of "similarity" based on the band names of songs liked by users. I was able to calculate this data into vectors in about 15-20 minutes on a MacBook Pro M2 Max, and load it into PostgreSQL in a few more minutes. Querying the vector data using pgvector takes 100ms-250ms.

In comparison, a traditional query-based application would take a band name from the user, search for other users who liked songs by that band in a large database JOIN operation, then go look up all the other songs liked by those users, collate which other bands those users liked, and return a response which calculates some sort of ranking. That's a very large number of queries across millions of rows of data. As the dataset grows, so would your query times.

A 3-Step approach to building text-based recommendations

Step 1: Enable pgvector

psql command you can run in your terminal to connect to your database. Run the following command to enable pgvector: First, enable pgvector in your Render PostgreSQL database. Your database's Info page in the Render Dashboard includes acommand you can run in your terminal to connect to your database. Run the following command to enable pgvector:

Note that the extension is called "vector" and not "pgvector".

Next, create a table to store music vectors. This could be made more efficient with a normalized table of band names and a band ID, but would require a JOIN operation so I kept the table structure more simplified:

Step 2: Train Your Model

Using a model like Word2Vec in gensim, which converts text to vector data, you can train your dataset to generate vectors for each band. The "window" parameter to Word2Vec represents a "distance" between a given word and a predicted similar word. A lower value of 2-5 will look at 2 to 5 words to the left and right of a given word, but lower values might be too "narrow" to understand what the user is searching for. A higher value may learn more "context" for Natural Language Processing (NLP) purposes, but it might also negatively impact calculation times and produce less accurate results.

load_data.py script expects the data to be written to a CSV file, and then uses multithreading to read a chunk of data 1000 rows at a time to do a single insert. This will drastically speed up the insertion time for your data from many hours to only a few minutes. The script above will insert each vector row one at a time into the database, which can be very inefficient for bulk quantities of data. In my example code thescript expects the data to be written to a CSV file, and then uses multithreading to read a chunk of data 1000 rows at a time to do a single insert. This will drastically speed up the insertion time for your data from many hours to only a few minutes.

Step 3: Implement Recommendations

Next, I created a FastAPI application to query the vector data. The code below is a very simplified example for the recommendation endpoint. Since all vector data is in the PostgreSQL database, querying on that data is extremely fast with pgvector, and results are returned often within 200ms or less.

Why Render is the Best Choice

Render simplifies the deployment and scaling of applications with integrated PostgreSQL support, including extensions like pgvector. Here’s why our platform stands out:

Ease of Deployment: Deploying a FastAPI application with PostgreSQL on Render is straightforward. Render’s documentation and community support make the process seamless. Scalability: Render can automatically scale your application, handling increased load without manual intervention, and offers high availability PostgreSQL with point-in-time recovery. Affordability: Render’s pricing structure is designed to be cost-effective, providing powerful infrastructure at a lower cost compared to traditional cloud providers.

Additionally, if you build a front-end application to consume your API, hosting the front-end on Render will reduce latency and increase your security as well.

By leveraging pgvector and Render, you can build powerful recommendation systems and other machine learning applications efficiently and cost-effectively.

Wrapping Up

In the ever-evolving landscape of machine learning and AI for building recommendation systems, pgvector and Render offer a compelling combination. pgvector brings the power of vector similarity search to PostgreSQL, while Render provides a scalable, cost-effective platform for deployment. Together, you can create sophisticated, efficient, and highly personalized recommendation systems, transforming how users discover new information in your application.