Jauhar's Blog

Don’t Use Auto Increment ID By Default


Often, when building a software, we assign auto-increment id to our entities. Somehow, this is a very common pattern when working with relational database such as MySQL and PostgreSQL. For most of the time, it’s not a big problem, until things started to scale.

Suppose you build a website where the users can sign up to your website. If you are using relational database like MySQL or PostgreSQL, the most straightforward way to store the user information is by having a users table with auto increment id for the user id. The thing is, you actually don’t need the id to be auto incremented. Most of the time, you just want it to be unique. Sure, auto increment is technically unique, but it has a constraint where it has to be incremental, which you don’t need.

Why do we add a constraint that we don’t need in the first place? Why not just use a random number to identify the users? Adding this incremental constraint has some minor problems, such as:

  1. People can estimate the number of entities you have. If let’s say you create an order in a marketplace website and get an ID 1284, you can tell that there are roughly 1284 orders that have been created in that marketplace. This information shouldn’t be leaked to public, but since the table uses auto increment id, people can see the pattern.
  2. People can guess the id of the entity. In Youtube, there are some videos that can only accessed using a link and won’t appear in search results. This feature will be useless if Youtube used auto increment id in their URL. Let me give you and example. Suppose you have a youtube video with id 51298. Now, don’t you wonder, what is the video with id 51299? If the youtube video URL was using an auto increment id, you can guess the Youtube video link and eventually open a link that supposed to be private.
  3. It’s not globally unique. Sometimes, it is convenient to have a unique id globally. If you are using auto increment id, you will see your id appear in multiple tables. Which is not a big problem, but can be inconvenient. For example when you want to update an entity from a table but you typed the wrong table name, the operation will succeded. Whereas, if you have unique id globally, the operation will failed.

All of those three problems are minor problems that you can technically solve without changing your id. For problem 1 and 2, you can encrypt the id before you send it to public. By doing this, people won’t be able to see the actual id we have in our system. For problem number 3, I actually doubt people will update a record manually in the production database without proper review. And if they do, I think they will do it inside a transaction. So, they can rollback if they made mistake.

The thing that I find more problematic is the fact that we are tightly coupled with our database system just to generate an id. Before you insert your entity to database, you cannot know what the id will be. You can only know the id after the operation is complete and your database system tell you what was the id. And not only that, having an auto incremnent id can be hard to scale in the future.

Hard To Scale

To generate auto increment id, we usually uses a single server to perform this task. If you are using MySQL, there will be a single MySQL server that generates the auto increment id. Even if you are using master-slave pattern in your MySQL, the id will still be generated from a single instance. Now, what happen when your data get bigger? To increase the performance, you might want to partition your data into several database instances. Now, who should generates the id? If the id generation is done by a single instance, then that single instance will become a bottleneck since all operations need to go there first. And it is hard to split the id generation into multiple instance since they have to coordinate to guarantee the uniqueness and incremental property.

Tight Coupling To Databse

When using an auto generated id, usually the flow of your application goes like this:

  1. The request comes from front-end application to your application server.
  2. The application server performs some application logics.
  3. The application sends an SQL query to insert the data into the database system.
  4. The database returns the generated id of your data.
  5. The application server sends the response containing the id to the front-end.

This is not a problem for most application. But, what if we change the step 3 of our flow? Now, you are not only want to insert the data into the database, you also want to put it in the cache and update your search index. Just to make things more clear, let’s just say you want to insert the user data to MySQL databse for persistence, Memcached for caching, and Elasticsearch for searching. You want an eventual consistent result. You don’t want your data to be inserted to Memcached and Elasticsearch, but not the MySQL database.

Now, achieving eventual consistency is not super easy. If we perform all of the three inserts operation from the application server, we can reach inconsistent state if our application crash in the middle of the operation. As a result, we might not inserted the data into all three services we want. To achive eventual consistency, we can use Kafka. Instead of directly inserting all the three records from our application server, we can send it to Kafka. Then, there will be three Kafka consumers that consume our messages:

  1. One consumer will consume the message and insert it to MySQL.
  2. One consumer will consume the message and insert it to Memcached.
  3. One consumer will consume the message and insert it to Elasticsearch.

By doing this, we can achieve eventual consistency. If the application server crash, the message will be either sent or not sent. If the consumer crash, they will just reboot and restart the operation. So, eventually, we will get a consistent state even though all the parties can fail in the middle.

Now, what is the problem with this approach? Well, this approach is impossible to do since we don’t know the id of our data. We can only know the id after we insert it into MySQL. Technically, we can fix this by split the work into 2 phase. The first phase is to insert it to MySQL and the second phase is to insert it into Memcached and Elasticsearch. Well, even though you could do that, it is just not simple and add extra latency.

Then How?

Well, most of the times we only need the uniqueness property of our ids and don’t need the incremental property. What we can do is just generates the id randomly from the application server. Using UUID is one way to do this. Another way is just to use timestamp combined with big random number. Yes, technically it can collides, but it is very very very unlikely. If we do this, the application layer won’t be tightly coupled with the database layer for just generating the id.

There are a lot of problems when dealing with auto increment id. This particular problems shows two problems with auto increment id: scaling problem due to the need of synchronization and the tight coupling beteen application and database layer just to generates an id.

The thing is, we rarely need an incremental id. Most of the time, we just need the uniqueness property of an id. To achieve uniqueness, the UUID is pretty much cover that. Sometimes people do need an incremental id, but it is rare. And when they do need an incremental id, it usually doesn’t have to be strictly incremental. Often, UUID is enough to make it incrementally unique id. What bothers me is that: when using relational database, people tend to use auto increment id by default. Unlike mongodb that uses some kind of UUID. I think, people should use UUID first before they decided to use auto increment id, not the other way around.