placeholder

Tips to Improve Database Performance (In case you do not even know where to start) — Part 2 (Partitioning)

Luã Enrique Zangrade teaches us more practical techniques to enhance database performance.

JAN. 21, 2025
3 Min Read
This is the second part of a series of articles I am writing about improving database performance. If you have just entered this article and do not know about indexation, I recommend you take a look at the first part. Indexation and Partitioning are not extremely connected, but you need to know that Partitioning might not be the first (maybe not even the second) technique you want to apply to solve your database performance issues, so give it a try to indexation first. But do not worry, I will cover possible scenarios where Partitioning can be useful.

Partitioning data using the country as a partition key.

Important Things Before Reading It

  1. There are practical examples in this article to help you better understand what Partitioning is. I am using Postgres for all of them.
  2. As aforementioned, Partitioning might not be the solution for your database performance problems. This is a technique used for large-scale databases.
  3. All the database performance techniques have their Pros and Cons. You need to understand them to check which suits best your scenario.
  4. I am not talking about Sharding here. Sometimes, Sharding is also referred to as Horizontal Partitioning, but the Partitioning I am talking about is dividing the tables into the same machine. I am not working with distributed computer nodes. So, everything explained in this article is happening in a single machine, which means we are talking about vertical-scaling concepts. The next article will talk specifically about Sharding.

What is Partitioning?

The name itself is pretty intuitive; it is the idea of breaking something into parts. In this case, we are breaking database rows into different tables/partitions (horizontal partitioning — Not Sharding) OR database columns into different tables (vertical partitioning) based on a partition key (a date, country, or any other information that helps us to split data). These partitions have the same data representation; they are equal tables, but they represent different portions of your data. Take a look at the example below:
Table with sales from various years
If we take a look at the table above, we have more than 10,000,000 sales in a 6-year range (from 2019 to 2024). Let's suppose that the system you are working on has a functionality that generates a yearly report of sales. In this case, you would have to deal with 10,000,000 rows to generate a report for any of the years, even if a single year does not reach 10,000,000 rows. If you experience slowness while doing this, partitioning might be helpful in this case. You could take this table and break it into partitions on a year range, in other words, you will divide your sales table into six partitions (2019, 2020, 2021, 2022, 2023, and 2024), and now, when you get to the point of generating this yearly report, you will only query the specific year's partition and not all the 10,000,000 rows anymore.
Important note: there is no specific rule to start using partitioning, but you will find some readings saying that might be a good thing to consider when a table has more than 100GiB (which is not the case in our 10,000,000 rows and 4 columns scenario; this is just for explanation and testing purposes).
So, our partitions would look like this:
Partition 1 (data from 2019)
Partition 2 (data from 2020)
Are date ranges the only way of creating partitions? The answer is no. There are a few different ways of partitioning data, they are:
  1. Range Partitioning: This is the one presented above. One important thing to say is that this is not only applied to date columns, we could also use it in ID columns or any other column that makes sense in your context. Check the example below, where I am creating a partitioned table using the created_at field. I am starting a new table using the range between 01/01/2020 and 01/01/2021.
CREATE TABLE sales (
    sale_id INT,
    product_id INT,
    created_at DATE,
    PRIMARY KEY(sale_id, created_at)
) PARTITION BY RANGE (created_at);

CREATE TABLE sales_2020 PARTITION OF sales
    FOR VALUES FROM ('2020-01-01') TO ('2021-01-01');
Note: it is important to specify a primary key between the ID and the field you are using to create the partitions, this guarantees unique values across the partitions. If you do not do it, you will get this error in Postgres:
Error SQL [0A000]: ERROR: unique constraint on partitioned table must include all partitioning columns
Detalhe: PRIMARY KEY constraint on table “sales” lacks column “created_at” which is part of the partition key.
2. Hash Partitioning: Divides partitions like a hash table structure based on the hash of a column. As a hash table, we need to use MODULUS and REMAINDER. If you do not understand how it works, the idea is that the modulus operation (the rest of the division) of the sale_id by the specified modulus value needs to result in the specified remainder. In the presented case, if the sale_id is 4 when we do the modulus operation by the specified value (in the example below, also 4), it needs to result in the specific remainder (in the example below, it is 0). So, the user ID is 4 would be on sales_part1 because 4%4 = 0. Now, if the user_id is 5, then it would join sales_part2 because 5%4 = 1.
CREATE TABLE sales (
    sale_id INT,
    product_id INT
) PARTITION BY HASH (sale_id);

CREATE TABLE sales_part1 PARTITION OF sales FOR VALUES WITH (MODULUS 4, REMAINDER 0);
CREATE TABLE sales_part2 PARTITION OF sales FOR VALUES WITH (MODULUS 4, REMAINDER 1);
3. List Partitioning: It creates the partitions based on a list of specified values. It could be, for example, dividing the sales based on a list of countries.
CREATE TABLE sales (
    sale_id INT,
    product_id INT
    country VARCHAR(50),
) PARTITION BY LIST (country);

CREATE TABLE sales_usa PARTITION OF sales FOR VALUES IN ('USA');
4. Composed Partitioning: It creates the partitions based on more than one column. We could extend the example above and try to create a partition for sales in the USA in a specific year.
CREATE TABLE sales_usa PARTITION OF sales FOR VALUES IN ('USA') PARTITION BY RANGE (YEAR(sale_date));

CREATE TABLE sales_usa_2023 PARTITION OF sales_usa FOR VALUES FROM ('2023-01-01') TO ('2024-01-01');
CREATE TABLE sales_usa_2024 PARTITION OF sales_usa FOR VALUES FROM ('2024-01-01') TO ('2025-01-01');

Pros and Cons

Well, as you can imagine, breaking the data into different tables offers a lot of benefits for scaling, fast query speed, and some other benefits. Let's explore them a little bit better:
  1. Faster query results: as your data is now spread across different tables, your queries for a specific result inside a range or a specific list (such as the country example above) will be faster as you are working with less data. We will see it in practice in the next session.
  2. Scalability: Now that you have broken your tables into different partitions, it is easier to scale them, as you are no longer moving around giant pieces of data but instead reduced versions of it.
  3. Backup is easier: You can choose specific partitions to back up and discard those that will not be useful anymore.
  4. Indexes are faster: Now, the indexes operate in individual partitions, which makes them faster for being updated and queried.
  5. Archiving/Removing old data: You can remove an entire partition if you do not want your sales from 2005 to be stored anymore. And this DELETE operation will be way faster than performing it on a huge table.
But of course, it is not simple to set up a good partitioning technique, so some cons come with it, such as:
  1. It is not always easy to choose the right partitioning strategy: Your data might not be that good for being split in this way. Or you do not have clear ranges to work with.
  2. Managing partitions: One of the advantages of this technique is having good control over partitions of data. However, if you do not manage them, you will keep storing old data forever.
  3. Unbalanced partitions: If you choose the wrong partitioning strategy, your partitions will be unbalanced, and some might still be a problem of performance.
  4. Cross-partition queries: If you need to use multiple partitions, that might become slower. Because now the DBMS needs to perform the operation on different partitions and merge the results at the end.
Looking at these examples might be interesting for theoretical learning, but now let’s dive a little into a hands-on.

Practical Exercise—Your first partitioning

I am currently using a Macbook Pro that has only 20GiB available, so I am not able to create a scenario of 100GiB+. So, what I will do is create a smaller scenario just to explore the impact of Partitioning and how we can approach it, but as aforementioned, Partitioning might not be a good strategy when we are working with little data, we can consider other strategies first, including proper indexation.
  1. Creating our sales table
The first thing we will do for this exercise is create the table we were seeing in the examples. I will use Postgres for all the steps here because it has Partitioning in its configurations. We will create our table by doing this:
CREATE TABLE sales (
    sale_id SERIAL PRIMARY KEY,
    product_id INT NOT NULL,
    price DECIMAL(10, 2) NOT NULL,
    created_at DATE NOT NULL
);
This will create a regular table, not a partitioned table. But we will use both to compare the query performance on them.
2. Insert some data on it
Let's create a code to insert 10,000,000 rows in our database. It should result in around 4.5GiB of data.

DO $$
DECLARE
    batch_size INT := 100000;  -- Number of rows per batch
    num_batches INT := 100;    -- Total number of batches (100 batches * 100,000 rows = 10,000,000 rows of sales)
    i INT;
BEGIN
    FOR i IN 1..num_batches LOOP
        INSERT INTO sales (product_id, price, created_at)
        SELECT
            (random() * 1000)*i*(random()*100),                
            (random() * 1000)::NUMERIC(10, 2),          
            DATE '2020-01-01' + (random() * 1460)::INT  -- Random date between 2020-01-01 and 2023-12-31 (4 years)
        FROM generate_series(1, batch_size);
    END LOOP;
END $$;
3. Query your brand-new table
We will start by summing up all the sales that happened in the year 2020.
select sum(price) from sales where sales.created_at between '2020-01-01' and '2021-01-01'
To get more insights into it, we will run an EXPLAIN ANALYZE command. If you are not used to this command, I recommend you once again to check the first article.
EXPLAIN ANALYZE select sum(price) from sales where sales.created_at between '2020-01-01' and '2021-01-01'
This was the return from the explain analyze command:
Finalize Aggregate  (cost=129807.80..129807.81 rows=1 width=32) (actual time=252.765..253.219 rows=1 loops=1)
  ->  Gather  (cost=129807.58..129807.79 rows=2 width=32) (actual time=252.699..253.213 rows=3 loops=1)
        Workers Planned: 2
        Workers Launched: 2
        ->  Partial Aggregate  (cost=128807.58..128807.59 rows=1 width=32) (actual time=238.940..238.941 rows=1 loops=3)
              ->  Parallel Seq Scan on sales  (cost=0.00..126194.19 rows=1045354 width=6) (actual time=0.086..200.745 rows=836168 loops=3)
                    Filter: ((created_at >= '2020-01-01'::date) AND (created_at <= '2021-01-01'::date))
                    Rows Removed by Filter: 2497166
Planning Time: 0.141 ms
Execution Time: 253.257 ms
We can pay attention to the execution time; it is currently 253.257ms.
4. Create a partitioned table to hold the yearly reports
Now, we can use the range partitions with the created_at column to create partitions from 2020 to 2024. I will divide it into 4 partitions:
CREATE TABLE sales_partitioned (
    sale_id SERIAL,
    product_id INT NOT NULL,
    price DECIMAL(10, 2) NOT NULL,
    created_at DATE NOT null,
    PRIMARY KEY (sale_id, created_at)
) PARTITION BY RANGE (created_at);

CREATE TABLE sales_2020 PARTITION OF sales_partitioned
    FOR VALUES FROM ('2020-01-01') TO ('2021-01-01');

CREATE TABLE sales_2021 PARTITION OF sales_partitioned
    FOR VALUES FROM ('2021-01-01') TO ('2022-01-01');

CREATE TABLE sales_2022 PARTITION OF sales_partitioned
    FOR VALUES FROM ('2022-01-01') TO ('2023-01-01');

CREATE TABLE sales_2023 PARTITION OF sales_partitioned
    FOR VALUES FROM ('2023-01-01') TO ('2024-01-01');
5. Bring the data from the full table to the partitioned table
INSERT INTO sales_partitioned (sale_id, product_id, price, created_at)
SELECT sale_id, product_id, price, created_at FROM sales;
6. Query the partitioned table and compare performance
If we run the same query on the partitioned table:
EXPLAIN ANALYZE select sum(price) from sales_partitioned where sales_partitioned.created_at between '2020-01-01' and '2021-01-01'
The result from the EXPLAIN ANALYZE is:
Finalize Aggregate  (cost=71983.75..71983.76 rows=1 width=32) (actual time=174.616..175.600 rows=1 loops=1)
  ->  Gather  (cost=71983.53..71983.74 rows=2 width=32) (actual time=174.552..175.593 rows=3 loops=1)
        Workers Planned: 2
        Workers Launched: 2
        ->  Partial Aggregate  (cost=70983.53..70983.54 rows=1 width=32) (actual time=168.801..168.802 rows=1 loops=3)
              ->  Parallel Append  (cost=0.00..68371.27 rows=1044904 width=6) (actual time=0.089..135.930 rows=836168 loops=3)
                    ->  Parallel Seq Scan on sales_2021 sales_partitioned_2  (cost=0.00..31576.31 rows=2742 width=6) (actual time=0.155..138.045 rows=6832 loops=1)
                          Filter: ((created_at >= '2020-01-01'::date) AND (created_at <= '2021-01-01'::date))
                          Rows Removed by Filter: 2495297
                    ->  Parallel Seq Scan on sales_2020 sales_partitioned_1  (cost=0.00..31570.44 rows=1042162 width=6) (actual time=0.064..58.711 rows=833890 loops=3)
                          Filter: ((created_at >= '2020-01-01'::date) AND (created_at <= '2021-01-01'::date))
Planning Time: 0.258 ms
Execution Time: 175.638 ms
We can see that time dropped from 253.257ms to 175.638ms. Now, remember that this is a small database, and partitioning does not make a lot of sense in these cases. However, we can already see that there is an improvement in doing it. We need to think about the impact of it on a large-scale database.
7. I do not see any difference between the execution times (edge-case)
If you followed all the steps and still saw the same or close execution times, you might need to enable pruning. Pruning allows the DBMS to first examine the definition of each partition (in our example, the date ranges) and look exclusively for the partition where the data is stored. If this is not enabled, the DBMS will iterate over each partition. To guarantee you are using it, it is important to run:
SET enable_partition_pruning = on;

Partitioning maintenance

One of the advantages of partitioning data is to have good control over your partitions. For example, it allows you to delete old records that will never be used anymore. Or at least detach them from the virtual table (main table).
DROP TABLE sales_2021;
The code above is pretty straightforward and works as a regular DELETE operation. It would delete all the sales from 2021.
ALTER TABLE sales_partitioned DETACH PARTITION sales_2021;
The code above would detach that partition from the sales_partitioned table, but we would still be able to perform some operations before dropping it.

Another scenario

Let's take the EdTech scenario from the last article as an example again: you work for a company that stores the educational data of students daily, they are studying online. To obtain the final results of a specific student, we do not need the data from the past year because schools usually get students' grades based on the current year or even bimonthly/trimonthly. Now, imagine this system has only 5 years of existence and 500,000 students. Let's suppose each student answered at least 20 tests bimonthly during all these years, considering that they do it 4 times a year, we would have something like this:
500,000 students 20 tests 4 (each year Bi-monthly) * 5 years = 200,000,000 rows
Now imagine all the schools are trying to get students' grades at the end of the year, and they are all querying against these 200 million rows, that would be pretty slow at some point.
To solve it, we could create partitions based on the date range of the answers for each test and divide them into partitions that make sense to the business (bimonthly might be a good solution).

Conclusion

Now you know more about Database Partitioning, and you know that there are two different types of it:
  1. Horizontal: break rows.
  2. Vertical: break columns.
You also know that there are different types of partitioning: range, list, hash… And that each type will be chosen based on your context.
You will use partitioning when:
  1. You have a large table and fields that allow a good distribution when divided into parts.
  2. You need to constantly delete/archive old data.
  3. Parallel processing and queries are sped up by the data division.
You will avoid partitioning when:
  1. Your dataset is small, and the complexity of implementing partitioning does overcome the benefits of it.
  2. Your data does not work well with partitions, and they are unbalanced.
Thanks for reading one more part of this series of articles about Database Performance! I hope it helped you to get a better understanding of what Partitioning is.

This article was originally written by Luã Enrique Zangrande and published on Medium.