placeholder

Tips to improve database performance (In case you do not even know where to start) — Part 1

Learn practical techniques to enhance database performance from a software engineer's real industry experience.

JAN. 2, 2025
3 Min Read
I have been a software engineer for almost nine years now, and one of the challenges I have always faced at some point in all seven companies I have worked for was how to improve database performance?
At the beginning of my career, I did not pay that much attention to it, but in my 4th year of software development, I joined an EdTech. This EdTech holds one of Brazil's biggest and most used educational systems, and I joined them during the pandemic. For this reason, classes were happening online, and this system started to be used more as time passed. It was a legacy system with badly structured tables and confusing relationships between them. It all makes things hard enough for you to learn something when solving challenges around it. And that's what happened.
Back in time, I had the chance to work with the Database Administrator, and he shared with me a lot of knowledge that is quite simple to apply, but I noticed that every company I joined after was not applying them or people did not know how to approach these kind of performance problems around databases. My aim with this article is to share simple things that I have been applying since then, and that gave me and these companies great results on database performance improvements.

Indexation

It is impossible to talk about database performance and not mention indexes. Indexation is a database reference that works like a pointer to speed up data fetch. To make an analogy, imagine you are reading a book of 2,500 pages, and there is no index page on it; you would have to go page by page until you finally find what you are looking for on it. A database index works in the same way; it gives your database a way of querying fast this book, which in this case is your data. Now, there are a couple of indexes that you can create, and they vary based on the situation you are facing, but I will mention the two most used ones:
  1. B-Tree Index: A B-Tree Index is created in a Balanced Binary Tree structure, which means it is complexity in BigO notation will be O(log n), a pretty fast way of querying data. B-tree indexes are perfect for range comparisons and good for equality comparisons, such as BETWEEN and IN clauses, <, >, ≤=, ≥=, =, and so on… In relational databases, they are usually the most used ones.
  2. Hash Index: Hash indexes use Hash Tables as their base structure, which means they are faster (O(1) — instant access). So why not only use Hash Indexes, then? Well, because they are made specifically for direct comparisons, which means they will not be used on range searches (BETWEEN, IN, <, >, ≥=, ≤=…), and also they do not have any effect on ORDER BY operations, as they do not have control over it.
These are the two main indexes, but it is important to mention that more indexes are available, and each DBMS might have some additional indexes that can be useful in specific situations (e.g., GIN and GiST in Postgres https://www.postgresql.org/docs/current/textsearch-indexes.html).

Creating your first index

Ok, now that we know a little about indexation, it is time to create our first index. But we need to be careful which columns to choose on index creation, even though indexation is a good thing, we need to know that some cons come with it as well, and they are:
  1. UPDATEs, DELETEs, and INSERTs will become slower — the reason for that is because all these three operations directly change the data, and the index needs to be updated (e.g. if you add a new row to a table of customers, your B-tree index needs to be updated with a new node on it considering this new customer you sent. Consequently, the tree needs to be rebalanced, and it takes time).
  2. Indexes occupy disk space.
Then, what can we consider good columns to be indexed? It depends on case by case, but these are great candidates:
  1. Columns being used in where clauses — as aforementioned, indexes help on comparisons, and usually that is pretty much what we do on WHERE clauses.
  2. Columns being used in JOIN clauses — while the WHERE clauses are constantly considered on indexation, sometimes we end up forgetting to look at the JOIN clauses of our queries. If your queries are slow, you might want to check if the JOIN clauses are indexed as well.
  3. Columns being used in GROUP BY and ORDER BY clauses — most applications suffer from slowness in sorting by created_at/updated_at columns. Creating an index for those columns in some cases might be useful.
  4. Columns with UNIQUE constraints
Here are some examples of indexation:
SELECT department_id FROM users_departments WHERE user_id = 1;
This is a good example of a non-indexed field being used in a WHERE clause. A simple query like this can be slowed down if the foreign key user_id has no indexation. The reason for this is that if there is no index on user_id, the DBMS will perform a full table scan. Now, if you have some experience with databases, you might think that all foreign keys in a table are automatically assigned to an index. Well, that might be true for some DBMS, but for Postgres, that is false. This is another important thing to consider: you need to understand the DBMS you are using. In Postgres, foreign keys are constraints but not indexes by default. If you want to create an index for it, you need to create it manually. You can achieve it by doing this:
CREATE INDEX idx_user_id ON users_departments(user_id);
But, of course, this is a very simple example with a single column. Indexes can be created using multiple columns. Consider this scenario:
CREATE TABLE orders (
    id SERIAL PRIMARY KEY,
    customer_id INT,
    status TEXT,
    order_date DATE
);
Supposing we are always filtering orders by status and order_date together, instead of creating separate indexes for them, we can create a composite index that will handle both at the same time. We can achieve it by doing this:
CREATE INDEX idx_customer_status ON orders(customer_id, status);
In this case, it will be more effective than having a single index for each field. It is important to know that the order of the fields matters. The order will always be from left to right, and this is how the index will be created. In this case, this index is very useful for queries that are filtered by customer_id alone or customer_id and status, but it will not be really useful for filtering only by status. The reason for that is that the status field becomes dependent on the customer_id in the index creation, and its indexation will be a hundred percent based on it. So, if your application is constantly filtering by status, and sometimes by status and customer_id, then it would be interesting to consider doing the opposite, like this:
CREATE INDEX idx_customer_status ON orders(customer_id, status);
I have shown a couple of examples with where clauses, but it is important to remember that these indexes would be useful to GROUP BY and ORDER BY clauses, too.
Alright, but sometimes indexation is not the only thing that does the job. Also, we do not want to be looking all the time at all queries in our system, looking for each column present in a WHERE, GROUP BY, ORDER BY, or JOIN to see if there is at least one column not indexed yet. In real scenarios, when a query is running slow, we usually want to use its execution plan and analyze it using the EXPLAIN command.

EXPLAIN

This command returns how your query is being executed, and through its return, we can check where the bottlenecks are and provide some useful information about the execution plan of the query. It returns lots of information, but as it might be your first time running it, let's focus on some important things you might want to look for:
  1. Nested Loops
  2. Access Types
How to run it? It is really simple, here is an example:
EXPLAIN SELECT * FROM employees WHERE department_id = 10;
To illustrate the usage of it, I will take a real example that I faced in my current job four weeks ago and the things we did to solve the problem using the two things I have mentioned above.

Real-life scenario

We have two different applications, one for admin and another one for regular users and customers. Inside the admin panel, one of our queries was consuming too much time to execute, leading to timeouts and high CPU usage. The first thing we did was get the whole query and run an EXPLAIN command on it, and it returned to us something like this:
{
  "query_block": {
    "select_id": 1,
    "const_condition": "1",
    "filesort": {
      "sort_key": "maskedTableName.`id` desc",
      "temporary_table": {
        "table": {
          "table_name": "maskedTable",
          "access_type": "const",
          "possible_keys": ["PRIMARY"],
          "key": "PRIMARY",
          "key_length": "162",
          "used_key_parts": ["id"],
          "ref": ["const"],
          "rows": 1,
          "filtered": 100
        },



        ... // A LOT OF MORE INFORMATION THAT WE CAN IGNORE


        "table": {
          "table_name": "maskedTableName2",
          "access_type": "const",
          "possible_keys": [
            "masked_field_key_foreign",
            "masked_field_key_perf"
          ],
          "rows": 1,
          "filtered": 100
        },
        "table": {
          "table_name": "maskedTableName3",
          "access_type": "ref",
          "possible_keys": ["masked_field_key_id_foreign"],
          "key": "masked_field_key_id_foreign",
          "key_length": "162",
          "used_key_parts": ["masked_field_id"],
          "ref": ["const"],
          "rows": 1,
          "filtered": 100
        },
        "block-nl-join": {
          "table": {
            "table_name": "maskedTableName11",
            "access_type": "ALL",
            "rows": 13850,
            "filtered": 100
          },
          "buffer_type": "flat",
          "buffer_size": "256Kb",
          "join_type": "BNL",
          "attached_condition": "trigcond(trigcond(0) and trigcond(0) and trigcond(0) and trigcond(0) and trigcond(0) and trigcond(0) and trigcond(0) and trigcond(0)) and trigcond(0)"
        },
        "subqueries": [
          {
            "query_block": {
              "select_id": 4,
              "table": {
                "table_name": "maskedTableName12",
                "access_type": "ref",
                "possible_keys": [
                  "maskedTableName12_id_forein",
                  "maskedTableName12_idx"
                ],
                "key": "maskedTableName12_id_forein",
                "key_length": "163",
                "used_key_parts": ["masked_field_id"],
                "ref": ["const"],
                "rows": 1,
                "filtered": 100,
                "index_condition": "maskedTableName12_id_forein.field_id = 'dbde90fecec23126'",
                "attached_condition": "maskedTableName12_id_forein.`event` = 'event.canceled'"
              }
            }
          },
          {
            "query_block": {
              "select_id": 3,
              "table": {
                "table_name": "maskedTable55",
                "access_type": "ALL",
                "rows": 193,
                "filtered": 100,
                "attached_condition": "'312890132' = tablemasked.provider_id and tablemasked.deleted_at is null"
              }
            }
          }
        ]
      }
    }
  }
}
Our first action with this return was looking at all access types returned and finding those that were not CONST and those that were ALL. The reason for that is that there are a few different access types, and they indicate how the data is being joined and queried. Const means that a constant time is used to query the information, which is a good thing. ALL means that the DBMS is doing a full table scan, which is, most of the time, what causes slowness in these queries.
The access types are system, const, eq_ref, ref, fulltext, ref_or_null, index_merge, unique_subquery, index_subquery, range, index, and ALL.
If you want details about each of them, I highly recommend you to take a look at it: https://dev.mysql.com/doc/refman/8.4/en/explain-output.html. For this article, we will keep it simple and restricted to CONST and ALL.
Well, if we look at one of the first items returned by the EXPLAIN, the maskedTable is not the problem. Take a look at its return:
"table": {
    "table_name": "maskedTable",
    "access_type": "const",
    "possible_keys": ["PRIMARY"],
    "key": "PRIMARY",
    "key_length": "162",
    "used_key_parts": ["id"],
    "ref": ["const"],
    "rows": 1,
    "filtered": 100
  },
The access type is CONST; it is being handled by a PRIMARY KEY that is indexed, and a single row is returned. The problem does not reside there. Now, moving to maskedTableName2 and maskedTableName3, they have the same characteristics, so they are not good candidates for us to worry about. But when we move to the next one, we see this:
"block-nl-join": {
  "table": {
    "table_name": "maskedTableName11",
    "access_type": "ALL",
    "rows": 13850,
    "filtered": 100
  },
  "buffer_type": "flat",
  "buffer_size": "256Kb",
  "join_type": "BNL",
  "attached_condition": "trigcond(trigcond(0) and trigcond(0) and trigcond(0) and trigcond(0) and trigcond(0) and trigcond(0) and trigcond(0) and trigcond(0)) and trigcond(0)"
},
In the access type, we have ALL, which means that for joining this table, we are doing a full table scan (reading the full book to get the information). When we check the number of rows, we see 13850, and it informs us that a Nested Loop Join is happening. Nested Loops are something we need to pay attention to; it is the equivalent of having a for loop inside another for loop (and we know that we need to avoid this kind of operation). It makes this a great candidate for us to work on. The last one we can make changes on is this one:
"query_block": {
    "select_id": 3,
    "table": {
      "table_name": "maskedTable55",
      "access_type": "ALL",
      "rows": 193,
      "filtered": 100,
      "attached_condition": "'312890132' = tablemasked.provider_id and tablemasked.deleted_at is null"
    }
  }
But if we check the number of rows, this is not as bad as the first one. So, our priority should be to handle the maskedTableName11 first. And that's what we did back in the day. In this case, we saw that the problem was residing in the way we were calling our ORM; it was doing a LEFT JOIN on a false condition, and it was returning the wrong data all the time and slowing down the query. After fixing it, the query was faster. We also add proper indexation, as shown above. Then, for maskedTableName55, the problem was just indexation. We added the index, and the query sped up again. In this real scenario, we went from 40s to 2.5s in production.
This is just a simple example of how you could investigate and speed up queries in a real-life scenario. Of course, this is not how you will approach every single situation, and indexation is not the only thing that will speed up and save performance on your project. This is just the first part of a series of articles I intend to write about this subject, but I hope it helps you to at least start taking a look at it.

Additional methods that can improve database performance

  1. Partitioning
  2. Normalization/Denormalization
  3. Caching
  4. Index Rebuilding
  5. Replication
  6. Sharding
Thanks for reading this article. The next parts will be focused on different topics to help you improve your database performance.
This article was originally written by Luã Enrique Zangrande and published on Medium.