How to optimise your SQL Database to handle million records

See Also

Data handling can be a mess, especially when we deal with a huge amount of data. Over the years, we realised the bottleneck of a project mostly is not on the application layer but the database layer instead.

I know we can fix this by :

Upgrade our databases
Scale up our databases to multiple read-only nodes
Replace it with distributed databases (Vitess, Cockroach DB, TiDB, etc)

But are these necessary most of the time? Even GitHub uses only MySQL Cluster with partitioning to handle their load, you can check it out here.

In my opinion, unless your traffic is growing uncontrollable (maybe a million QPS), maybe you can consider going for the top options. Is there any cost-effective way to optimize your SQL database? Yes, the answer is “Avoid toxic queries”.

Having said that,

Prevention is always better than cure.

No matter how strong is your database, one toxic query can basically bring it to an unusable state. There have some guidelines we follow all the time to keep our database stay as healthy and perform even better than usual. And you may refer to it as well :

Offset-based pagination is not scalable compared to cursor-based pagination, it will reduce your query performance over time. (Especially when the records keep increasing without house-keeping)

How to optimise your SQL Database to handle million records — part 1 (3)

Most of the systems today not only target a single region, but it could also be a global market. So auto-increment primary key no longer able to fulfill our needs, because it’s not recyclable and reusable. For future proof, I would suggest going for UUID or Timestamp as id.

Joining is bad, especially without a data range (record limit). As per your record increase, the join operation will take extra milliseconds. We never join our records, we push the operation to the application layer and utilize go routines.

Yes, SQL lock is a good mechanism to ensure data consistency. But when we deal with high throughput scenarios, the SQL database starts to have some performance hit when we have a lot of concurrent locks. We might consider replacing it with a conditional update or redis distributed lock.

SUM, AVERAGE, COUNT, or any similar function are very toxic in SQL databases, you need to have special treatment for these operations. Try to handle this with your Elasticsearch or wide column database.

The query below is pretty bad when you have a million rows of records in table A, try to replace it with generated column (precompute the value), or else your SQL database will execute a million times of SUBSTR function on runtime.

SELECT SUBSTR(“ColumnA”, 1, 3) FROM “TableA”;

OK, a lot of you may ask, if we are not going to use this and skip that, why does SQL database offer these operations natively? Yes, they offered it, but you need to use it wisely based on the use cases, not because you can use it.

For more techniques, you may follow part 2.

To be continued…

How to optimise your SQL Database to handle million records — part 1 (2024)