We often see customers pull data from their Render-managed PostgreSQL database for data science and AI. Sometimes, they’re doing one-off analysis, and sometimes they’re training generative AI models.
For these use cases, you often need or want to take a random sample of data. However, sampling can be slow if a database table has billions of rows, or if two large tables must be joined before you sample.
Recently, we helped one of our customers resolve these exact problems.
In this post, we share our findings with you. You’ll learn how to randomly sample large datasets in PostgreSQL while keeping your queries performant. Instead of telling you “the answer,” we’ll walk through possible solutions and their tradeoffs, and help you build a deeper understanding of databases.
We’ll cover:
Unfortunately, this approach is very inefficient. To understand why, let’s look at the query plan for an example run.
Note that 10k rows represents just 0.001% of the total data.
As we expected, the resulting query plan is abysmal:
This plan has two heavyweight steps:
This technique is called Bernoulli sampling. Its efficiency comes with a trade-off: the number of rows in the output is non-deterministic. In other words, you can expect to get about 10k rows, but not exactly 10k rows, each time you sample.
Notably, this query runs in a single step, and no longer requires maintaining a secondary data structure or comparing elements.
Unfortunately, it still takes over 45 seconds. The table scan is costly because it fetches a massive number of data pages into memory. In PostgreSQL, these fetches are necessary even when using a relatively efficient index scan, because visibility information is stored in the heap. (For the same reason, simply counting rows in a large table can be slow.)
So, how can we further chop down the runtime?
Here’s a conceptual illustration of what happens under the hood:
Notably, this query, like the Bernoulli sampled query, is non-deterministic. You should expect the number of rows returned to vary.
This technique comes with an additional trade-off: individual rows are no longer independently selected. All rows within a block are selected together, so if you happen to select the same block multiple times across several samplings, it means the same set of rows is selected together each time.
That being said, how much of a problem is this in practice? To find out, we ran a simulation.
How do we use the faster
Note, this query uses a WITH clause for readability.
Broadly speaking, this query works. But there are important details to note depending on the join relationship between the tables.
Many-to-one: In this case, it does matter which table you sample from. You may want to sample from the larger (the “many” side) of the relation. Then, each sampled row joins against a single other row in the join step.
Many-to-many: Many-to-many relations may require additional sampling steps, but it really depends on your data. Often, sampling from one particular side of the relation cuts down the number of candidate rows sufficiently. You may be able to simply join against the other table, then run an
This query is likely not what you want. In fact, it will return zero rows most of the time. This is because you're independently sampling rows from
- What the top Stack Overflow answers get wrong: The most popular answer is not always the most efficient.
- Benefits and tradeoffs of nondeterministic algorithms: We introduce Bernoulli sampling and
TABLESAMPLE
, which are fast but come with potential downsides. - How to use a simulation to build intuition: We walk through a simulation of 10k trials of
TABLESAMPLE
, which provides you with a concrete example of its tradeoffs.
Why take a random sample?
Before we dive into how to randomly sample from PostgreSQL, let’s consider why we might do it. Using a portion of data can make a large dataset more practical to use. And by taking a random sample, your sample will be statistically representative. Therefore, using a random sample can:- Enable and accelerate model training: When you’re building a proof-of-concept for a machine learning model, you might not have enough compute resources to train on the entire dataset. Training on a smaller, random sample can speed up your development. Sampling is also a necessary step in training via stochastic gradient descent (regardless of the model and total training data size).
- Avoid overfitting in a model: Training on a random sample can help prevent overfitting by exposing the model to a variety of data points.
- Alleviate resource constraints: Large datasets can exceed the memory capacity of most machines—and we mean personal laptops as well as database servers. By sampling data, you reduce load on your system, so you can perform computations that would otherwise be infeasible.
- Enable exploratory data analysis: Working with a random sample lets you more quickly uncover patterns in data and test a hypothesis. You save time because you aren’t processing the entire dataset.
Intuitive but inefficient: ORDER BY random()
Let’s start out with the most intuitive, but naive solution. Let’s say you’ve picked a table,my_events
, and you want to sample 10k rows.
You might tell the database to randomly "shuffle" the data you'd like to select from, then take 10k elements from the head of the shuffled list. This approach ranks at the top of many related Stack Overflow posts:
SELECT * FROM my_events
ORDER BY random()
LIMIT 10000;
Our example dataset
For the analysis and query plans in this post, we've constructed thesample_values
table by creating a one-column table and inserting a billion rows:
CREATE TABLE sample_values (id bigserial primary key);
INSERT INTO sample_values SELECT generate_series(1, 1000000000);
Analyzing ORDER BY random()
Using thesample_values
table, let’s analyze the naive ORDER BY random()
approach:
EXPLAIN ANALYZE
SELECT * FROM sample_values
ORDER BY random()
LIMIT 10000;
- It first examines every row in the table. To select just 0.001% of our data, this plan scans all billion rows. In our “best case scenario” demo environment—with no table bloat, no updated or deleted tuples we need to skip based on transaction visibility, and no concurrent clients using shared buffer resources—this scan already takes over a minute. That's eons for the computers of today!
- It then performs over 13 billion comparisons for a top-N heapsort. The top-N heapsort maintains a fixed-sized heap data structure that holds up to 10k rows that have had the lowest
random()
values encountered. When a new row is pulled from the table scan, it is inserted into the heap, which takes around log_2(10,000) comparisons against existing heap elements. When the heap is full, this insertion will cause the highest value to fall out of the set of candidates. After seeing all billion rows, the heap contains the sampled rows ready to be streamed to the user.
Bernoulli sampling removes costly comparisons
To improve this query plan, let’s eliminate the top-N heapsort, so we look at each row only once. In the naive plan, we first paired every row with a random value, then used those random values to compare and select rows. In the improved plan, let’s look at each row once, independently, and determine using a “dice roll” if a row should be selected. Here’s what it looks like as a database query:-- Random returns a value in the range [0, 1)
-- Therefore we compare against (0.001% / 100) to get ~10k rows
SELECT * FROM sample_values WHERE random() < 0.00001;
Analyzing Bernoulli sampling
If we look at the query plan for the Bernoulli sampling approach, we see the runtime has decreased by 50%.Heap blocks to the rescue
To get even better performance, we need a way to avoid looking at every row. Fortunately, table rows in PostgreSQL are organized into groups called heap blocks. Each block is about 8kB, including some bookkeeping information such as the visibility information mentioned above. Using heap blocks, we can randomly select blocks of rows instead of individual rows. Instead of generating a random value for each row, as we did in the naiveORDER BY random()
approach, we can generate a random value for each block. Then we can select or discard all rows within a block.
The key win is that we only scan rows within blocks we select. When our target sample size is much smaller than the total size of the table, this new approach ends up not scanning the vast majority of the rows.
TABLESAMPLE cuts a minute to milliseconds
To put the heap blocks approach into practice, we can use a method calledTABLESAMPLE
.
TABLESAMPLE
helps us get around the fact that PostgreSQL doesn't directly expose heap blocks directly to the user. Thankfully, the SQL:2003 standard introduced TABLESAMPLE
to sample blocks as we described, and PostgreSQL implements it. You can use TABLESAMPLE
together with the SYSTEM
table sampling method, which chooses each block with equal probability.
Here’s how we might query our sample_values
now:
-- Sample 0.001% of the table ~ 10k rows
SELECT * FROM sample_values TABLESAMPLE SYSTEM(0.001);
Aside: Using BERNOULLI instead of SYSTEM
As an alternative toSYSTEM
, there's another built-in table sample method called BERNOULLI
, which basically performs the same approach that we took in our WHERE random() < $percentage
query earlier.
As you might expect, using this sample method is slower. It achieves a similar runtime as our WHERE random() < $percentage
query.
Analyzing TABLESAMPLE
The query plan for theTABLESAMPLE
query is radically different. The entire query outputs ~10k rows in about 50ms.
Exploring bias in TABLESAMPLE through a simulation
In our simulation, we ranTABLESAMPLE
many times on the same dataset. This helped us better understand two things:
- Bias: We looked at how often the same heap blocks were selected. This helped us understand how much bias is introduced by selecting blocks instead of individual rows.
- Non-determinism: We looked at how many rows were returned by each
TABLESAMPLE
query, and at the total runtime. This helped us build intuition around what to expect from the non-deterministic output.
TABLESAMPLE
to select ~100k rows out of 1 billion rows, and we ran this selection 10k times.
Results when selecting ~100k rows
First, let’s talk about bias. In the combined output of our 10k trials, only ~12% of rows in the output appeared more than once. In general, we’d expect that selecting from even larger tables (with a greater number of heap blocks) will bring this percentage down. Note that the size of the rows in the table also influences the number of rows in each block. The bigger each row is, the fewer rows will fit in each block, and thus a smaller number of rows will be selected together. In other words, we’d also expect the percentage of rows that are selected more than once to be lower in tables with bigger rows. Now, let’s look at nondeterminism. Our graphs showed that:- Total runtime followed a normal distribution.
- The number of results followed a normal distribution around the target sample size (100k elements).
More data on non-determinism
To further build intuition around the shape of the non-deterministic output, we also ran the trial with samples of ~1k and ~10k rows. Selection of ~1k rows Selection of ~10k rows Taken together, these graphs show:- Total runtime (which also includes consuming all rows from a remote database), even for a sample size of ~100k, stays efficient. Most queries finish in under 150ms.
- The number of results all follow a normal distribution around the target sample size.
Sampling from JOIN results
Now you know a few ways to randomly sample a table. In the real world, though, we often don’t want to sample a simple table. Sometimes we want to sample the result of a join. Applying our first naive approach to a join, we’d get a query like this:SELECT x.value, y.value
FROM x
JOIN y ON y.x_id = x.id
ORDER BY random()
LIMIT 10000
TABLESAMPLE
technique in this situation? Let’s think about this conceptually. We could first sample from one side of the relation, then simply apply the join as a second step. We’d get a query like this:
WITH candidates AS (
-- First, sample rows from relation `x`
SELECT * FROM x TABLESAMPLE SYSTEM(0.001)
)
-- Then, join on relation `y` without any additional sampling
SELECT x.value, y.value FROM candidates x JOIN y ON y.x_id = x.id
How JOIN relationships affect sampling
There are three general types of join relationships: one-to-one, many-to-one, and many-to-many. Here’s what to note about each type when you’re applying the query we just introduced. One-to-one: The above query will be sufficient, and you can run theTABLESAMPLE SYSTEM
step on either of the two tables. It makes no difference because every row in one table matches exactly one row in the other table.
Many-to-many: Many-to-many relations may require additional sampling steps, but it really depends on your data. Often, sampling from one particular side of the relation cuts down the number of candidate rows sufficiently. You may be able to simply join against the other table, then run an
ORDER BY random()
to cut the joined rows down to your desired sample size.Aside: Avoid this pitfall
We want to flag an easy way to go wrong when sampling a join. You might be tempted to run theTABLESAMPLE
on both tables, like this:
SELECT x.value, y.value
FROM x TABLESAMPLE SYSTEM(0.001)
JOIN y TABLESAMPLE SYSTEM(0.001) ON y.x_id = x.id
x
and y
, and the rows you’ve sampled from x
likely don’t join against the rows you’ve sampled in y
!