Practical Exercise—Your first partitioning
I am currently using a Macbook Pro that has only 20GiB available, so I am not able to create a scenario of 100GiB+. So, what I will do is create a smaller scenario just to explore the impact of Partitioning and how we can approach it, but as aforementioned, Partitioning might not be a good strategy when we are working with little data, we can consider other strategies first, including proper
indexation.
- Creating our sales table
The first thing we will do for this exercise is create the table we were seeing in the examples. I will use Postgres for all the steps here because it has Partitioning in its configurations. We will create our table by doing this:
CREATE TABLE sales (
sale_id SERIAL PRIMARY KEY,
product_id INT NOT NULL,
price DECIMAL(10, 2) NOT NULL,
created_at DATE NOT NULL
);
This will create a regular table, not a partitioned table. But we will use both to compare the query performance on them.
2. Insert some data on it
Let's create a code to insert 10,000,000 rows in our database. It should result in around 4.5GiB of data.
DO $$
DECLARE
batch_size INT := 100000; -- Number of rows per batch
num_batches INT := 100; -- Total number of batches (100 batches * 100,000 rows = 10,000,000 rows of sales)
i INT;
BEGIN
FOR i IN 1..num_batches LOOP
INSERT INTO sales (product_id, price, created_at)
SELECT
(random() * 1000)*i*(random()*100),
(random() * 1000)::NUMERIC(10, 2),
DATE '2020-01-01' + (random() * 1460)::INT -- Random date between 2020-01-01 and 2023-12-31 (4 years)
FROM generate_series(1, batch_size);
END LOOP;
END $$;
3. Query your brand-new table
We will start by summing up all the sales that happened in the year 2020.
select sum(price) from sales where sales.created_at between '2020-01-01' and '2021-01-01'
To get more insights into it, we will run an
EXPLAIN ANALYZE command. If you are not used to this command, I recommend you once again to check the
first article.
EXPLAIN ANALYZE select sum(price) from sales where sales.created_at between '2020-01-01' and '2021-01-01'
This was the return from the explain analyze command:
Finalize Aggregate (cost=129807.80..129807.81 rows=1 width=32) (actual time=252.765..253.219 rows=1 loops=1)
-> Gather (cost=129807.58..129807.79 rows=2 width=32) (actual time=252.699..253.213 rows=3 loops=1)
Workers Planned: 2
Workers Launched: 2
-> Partial Aggregate (cost=128807.58..128807.59 rows=1 width=32) (actual time=238.940..238.941 rows=1 loops=3)
-> Parallel Seq Scan on sales (cost=0.00..126194.19 rows=1045354 width=6) (actual time=0.086..200.745 rows=836168 loops=3)
Filter: ((created_at >= '2020-01-01'::date) AND (created_at <= '2021-01-01'::date))
Rows Removed by Filter: 2497166
Planning Time: 0.141 ms
Execution Time: 253.257 ms
We can pay attention to the execution time; it is currently 253.257ms.
4. Create a partitioned table to hold the yearly reports
Now, we can use the range partitions with the created_at column to create partitions from 2020 to 2024. I will divide it into 4 partitions:
CREATE TABLE sales_partitioned (
sale_id SERIAL,
product_id INT NOT NULL,
price DECIMAL(10, 2) NOT NULL,
created_at DATE NOT null,
PRIMARY KEY (sale_id, created_at)
) PARTITION BY RANGE (created_at);
CREATE TABLE sales_2020 PARTITION OF sales_partitioned
FOR VALUES FROM ('2020-01-01') TO ('2021-01-01');
CREATE TABLE sales_2021 PARTITION OF sales_partitioned
FOR VALUES FROM ('2021-01-01') TO ('2022-01-01');
CREATE TABLE sales_2022 PARTITION OF sales_partitioned
FOR VALUES FROM ('2022-01-01') TO ('2023-01-01');
CREATE TABLE sales_2023 PARTITION OF sales_partitioned
FOR VALUES FROM ('2023-01-01') TO ('2024-01-01');
5. Bring the data from the full table to the partitioned table
INSERT INTO sales_partitioned (sale_id, product_id, price, created_at)
SELECT sale_id, product_id, price, created_at FROM sales;
6. Query the partitioned table and compare performance
If we run the same query on the partitioned table:
EXPLAIN ANALYZE select sum(price) from sales_partitioned where sales_partitioned.created_at between '2020-01-01' and '2021-01-01'
The result from the EXPLAIN ANALYZE is:
Finalize Aggregate (cost=71983.75..71983.76 rows=1 width=32) (actual time=174.616..175.600 rows=1 loops=1)
-> Gather (cost=71983.53..71983.74 rows=2 width=32) (actual time=174.552..175.593 rows=3 loops=1)
Workers Planned: 2
Workers Launched: 2
-> Partial Aggregate (cost=70983.53..70983.54 rows=1 width=32) (actual time=168.801..168.802 rows=1 loops=3)
-> Parallel Append (cost=0.00..68371.27 rows=1044904 width=6) (actual time=0.089..135.930 rows=836168 loops=3)
-> Parallel Seq Scan on sales_2021 sales_partitioned_2 (cost=0.00..31576.31 rows=2742 width=6) (actual time=0.155..138.045 rows=6832 loops=1)
Filter: ((created_at >= '2020-01-01'::date) AND (created_at <= '2021-01-01'::date))
Rows Removed by Filter: 2495297
-> Parallel Seq Scan on sales_2020 sales_partitioned_1 (cost=0.00..31570.44 rows=1042162 width=6) (actual time=0.064..58.711 rows=833890 loops=3)
Filter: ((created_at >= '2020-01-01'::date) AND (created_at <= '2021-01-01'::date))
Planning Time: 0.258 ms
Execution Time: 175.638 ms
We can see that time dropped from 253.257ms to 175.638ms. Now, remember that this is a small database, and partitioning does not make a lot of sense in these cases. However, we can already see that there is an improvement in doing it. We need to think about the impact of it on a large-scale database.
7. I do not see any difference between the execution times (edge-case)
If you followed all the steps and still saw the same or close execution times, you might need to enable pruning. Pruning allows the DBMS to first examine the definition of each partition (in our example, the date ranges) and look exclusively for the partition where the data is stored. If this is not enabled, the DBMS will iterate over each partition. To guarantee you are using it, it is important to run:
SET enable_partition_pruning = on;