Jauhar's Blog


Intro To High Performance Golang: Redis Clone

2025/02/09

I will show you how to improve your Golang application performance by building Redis from scratch. I use Redis as the case study since it’s well-known, fairly easy to implement, and quite challenging to make it perform better. I did a live streaming to demonstrate all of this a while back, but it was in Bahasa. This blog post aims to make the content easier to consume and reach a wider audience by presenting it in a written format and in English. All the code presented here is available, and feel free to ask questions by opening an issue: goredis.

Note that this article is just an introduction, and I have made some simplifications to make it more consumable. There are some important aspects that I have ignored, such as graceful shutdown, handling lost connections, and more realistic benchmarks. Running it on different machines will also affect the results. I wrote this article with more emphasis on the experience of improving Golang’s performance rather than mimicking what typically happens in a real production environment. However, I hope you can still benefit from the techniques presented in this article.

Redis

Redis is an in-memory key-value database. A key-value database is very simple. At its core, there are two kinds of queries: SET and GET. You can SET a key to a certain value, and you can GET a key to return its value. Of course, in practice, we need more capabilities than just SET and GET, such as delete, timeout, scan, etc. But for simplicity, I’m going to focus on GET and SET. Redis is a very popular key-value database, and its protocol is very simple.

Redis has a reputation for being very fast because it stores all of its data in main memory. While accessing main memory is indeed fast, it’s also quite expensive, which is why we often use Redis as a cache, storing only the hot data for fast access while using other databases like MySQL or PostgreSQL to store the actual data. Apart from that, Redis also has other use cases, such as queuing systems and cardinality counting.

To interact with Redis, we can use commands. Redis provides a bunch of commands, but we will only focus on GET and SET. We will implement these two commands and improve their performance. If you look at the SET command documentation, there are actually quite a lot of options available. But for simplicity, we won’t implement all of those options and will only focus on using SET to insert a key-value pair.

To get Redis, you can clone the Redis repository and build it from source using make:

1git clone https://github.com/redis/redis.git --depth=1 --branch 7.4.2
2cd redis
3make -j 24

Once successful, you will get a bunch of executables. We will focus on redis-server and redis-benchmark. These binaries are located in the ./src folder after you successfully build Redis. You can use redis-server to run the Redis server and redis-benchmark to benchmark your Redis server.

You can start the Redis server by just running the redis-server executable. To interact with the Redis server, you can use redis-cli, which is also compiled when you run make.

1# you can specify the port using --port flag
2❯ ./src/redis-server --port 6380
3...
41038846:M 23 Jan 2025 21:35:57.171 * Server initialized
51038846:M 23 Jan 2025 21:35:57.172 * Ready to accept connections tcp

Then you can use redis-cli to interact with the Redis server:

1❯ ./src/redis-cli -p 6380 SET the_key_1 "hello world"
2OK
3❯ ./src/redis-cli -p 6380 GET the_key_1
4"hello world"

Now, this is just one way to interact with Redis. There are a bunch of other ways as well. You can interact with Redis from Go using a library like go-redis. There are also graphical user interfaces to interact with Redis. It’s just like other databases such as MySQL and PostgreSQL; you can use the CLI, call it from Go, or open it using tools like DBeaver.

Redis Protocol

All the libraries and tools that can interact with Redis follow the same protocol specified by Redis called RESP. Fortunately, they have done a good job writing the details of RESP in their documentation. It’s actually a very simple protocol, but we won’t go through all of its specifications. We will only implement the protocol enough so that we can serve SET and GET commands.

RESP is quite simple. A Redis client connects to Redis server through TCP. Once connected, the client can send a command to the server and the server will reply with the response of that command. For commands like GET, the response might be the value you are trying to get from Redis. For commands like SET, the response is just a string "OK" if it succeeds. The command and response itself can be depicted as ordinary data structures like strings, integers, and arrays. When you run GET the_key, basically your request is just an array of strings, where the first element is "GET" and the second element is "the_key". The same goes for the response, it’s just an ordinary data structure. In the case of GET, the response might either be a string containing the value fetched by the user, or NULL if the key doesn’t exist.

In the RESP documentation, it is explained how to represent each data type. There are a bunch of data types supported by Redis, but we will only care about 3 types: bulk strings, arrays, and nulls.

  1. Bulk string will be represented like this: $<length>\r\n<data>\r\n
  2. Simple string will be represented like this: +<the_string>\r\n
  3. Arrays will be represented like this: *<number-of-elements>\r\n<element-1>...<element-n>.
  4. Nulls will be represented by _\r\n.

The graph below will demonstrate what happens in the network when you interact with Redis server through the redis-cli.

Redis Protocol Example

One interesting property of RESP is that its encoding was designed to be human readable. So, if you want to print their communication protocol, you can actually read it as plain text without additional tools.

Implementing Basic TCP Server

Let’s start simple by implementing a basic echo-tcp-server in Go. The implementation is available in this commit. An echo server is just a server that receives a message from a client and sends back the exact same message to the client.

I like to represent the whole redis server as a struct called server. This struct will contain all the states we need to be a redis server, including the list of connected clients and the key-value data itself. I like to separate this from the main package because I want to be able to use this struct as a library. It is quite useful to do that because we can test it without actually running the program.

 1type server struct {
 2    listener net.Listener
 3    logger   *slog.Logger
 4    started  atomic.Bool
 5}
 6
 7func NewServer(listener net.Listener, logger *slog.Logger) *server {
 8    return &server{
 9        listener: listener,
10        logger:   logger,
11        started:  atomic.Bool{},
12    }
13}

The server will have Start and Stop methods to start and stop the server. Starting the server is pretty straightforward; we just need to start accepting TCP connections from the clients by calling Accept. Calling Accept blocks the current goroutine until there is a client requesting to connect. Once connected, we will handle the client in a separate goroutine. You almost always want to handle one client from one separate goroutine because most of the operations you do will block your goroutine. Reading the client request or sending response blocks the current goroutine. Your goroutine can’t do anything else while blocked. That’s why you need one goroutine that accepts new clients, and separate goroutines to handle those clients. The method handleConn will handle a single connected TCP client. Because we want to run it on a separate goroutine, we write go s.handleCon(conn).

 1func (s *server) Start() error {
 2    if !s.started.CompareAndSwap(false, true) {
 3        return fmt.Errorf("server already started")
 4    }
 5    s.logger.Info("server started")
 6
 7    for {
 8        conn, err := s.listener.Accept()
 9        if err != nil {
10            // TODO: handle shutdown.
11            // at this point, the server might be shutting down.
12            // or there is a network problem that is causing the
13            // server to be unable to accept new connections.
14            // If it's the former, we just need to do some cleanup.
15            // If it's the latter, we just need to return the error,
16            // nothing we can do anyway.
17            break
18        }
19        go s.handleConn(conn)
20    }
21    return nil
22}

To handle a single client, we just need to reserve a buffer to hold the message from the client and send that exact same buffer back to the client. If the client closes the connection, calling Read will return 0, and we can use this to detect if the client is disconnected.

Note that this is a very simplified version. There could be a case where the client is lost due to network failure or misconfiguration. In this case, the client can’t tell the server that they’re disconnected. From the server’s point of view, the client just stalls there and doesn’t send anything. To avoid this problem, you need to go further and implement some kind of timeout to close the connection if the client doesn’t send anything for a long time.

 1func (s *server) handleConn(conn net.Conn) {
 2    for {
 3        buff := make([]byte, 4096)
 4        n, err := conn.Read(buff)
 5        if err != nil {
 6            // TODO: handle the error
 7            break
 8        }
 9        if n == 0 {
10            // TODO: this means the client is closing the connection
11            break
12        }
13        if _, err := conn.Write(buff[:n]); err != nil {
14            // TODO: handle the error
15            break
16        }
17    }
18}

Stopping the server is quite straightforward; we just need to call Close on the TCP listener to stop listening. Although, this is not the most ideal way because there might be some ongoing connections that are currently processing the user request. When you stop the server like this, those connections might be disconnected and the client will receive an error. This is not OK in a production environment because this means whenever you restart this redis server, there might be a lot of requests failing.

 1func (s *server) Stop() error {
 2    if err := s.listener.Close(); err != nil {
 3        s.logger.Error(
 4            "cannot stop listener",
 5            slog.String("err", err.Error()),
 6        )
 7    }
 8
 9    // Notice that we haven't closed all remaining active connections here.
10    // In Linux and in this particular case, it's technically not necessary
11    // because closing the listener will also close all remaining connections.
12    // But, in some cases, we might need some kind of graceful shutdown logic
13    // when closing the client.
14
15    return nil
16}

Now, let’s think about a better shutdown mechanism. To shutdown a server, we can just stop everything and exit. The operating system will handle all the cleaning up of resources we have like open sockets, files, allocated memory, etc. The thing is, this might not be the ideal way to shutdown, especially when you are running a service like Redis. The TCP connection in your process might be in the middle of a request. Maybe your client sent you a command and you exit the program before the response is sent to the client. When this happens, your client won’t receive the response and will consider it as an error. It would be better to have a mechanism to wait until all in-flight requests are finished while at the same time stopping new requests. Think of it like a “last order” in a restaurant. When a restaurant is closing, they usually make a last order, where people can no longer order food, but the people that have already ordered will still be able to wait for the food and eat there. This is called graceful shutdown, and it’s a common thing to implement when building a service like Redis.

You might wonder: wait, if I shutdown my server, those clients won’t be able to connect to my server anyway. What’s the point of doing all of those works if in the end, those clients will get an error because the server itself is gone? Well, usually, when you shutdown a server, it is because you want to migrate it to a new version or to a new machine with more capacity. To do this, you can run two servers concurrently and shutdown the old one after that. Because of this, we still need to shutdown gracefully so that the client can move to the new server with zero downtime.

Some other kinds of services such as relational databases might have their own shutdown mechanisms. For example, databases like MySQL typically have a log that records all the operations that happened on the database like when a row is added, deleted, and updated. When the database is restarted, it will replay all the operations recorded in the log to get to the same state as before. Doing this is time-consuming when the log is very large. To reduce the amount of replay needed, they usually perform a snapshot of their data periodically. By doing this, they only need to replay the operations that have been done after the last snapshot. If you are building a service like this, you may want to run the snapshot mechanism when you shutdown the service. By doing that, when you start your service again, it will start instantly because there are no operations that need to be replayed.

I won’t implement those graceful shutdown mechanisms here for simplicity reasons. Just be aware that in real systems, this is something that you should think carefully about.

The next step is to write the main function and run our server. The main function is pretty simple; we just need to prepare the logger and listener for our server. I will use Golang’s slog package for the logger and I will listen to port 3100 for the Redis server. There is no particular reason why I use 3100; it’s just a number that I chose. Once the server is created, I will run it on a separate goroutine while the main goroutine will be waiting for a signal to terminate the server. Once the user sends a termination signal, we will call server.Stop to stop the server.

 1func main() {
 2    logger := slog.New(slog.NewTextHandler(os.Stderr, &slog.HandlerOptions{Level: slog.LevelDebug}))
 3    address := "0.0.0.0:3100"
 4    logger.Info("starting server", slog.String("address", address))
 5    listener, err := net.Listen("tcp", address)
 6    if err != nil {
 7        logger.Error(
 8            "cannot start tcp server",
 9            slog.String("address", address),
10            slog.String("err", err.Error()),
11        )
12        os.Exit(-1)
13    }
14    server := goredis.NewServer(listener, logger)
15
16    go func() {
17        if err := server.Start(); err != nil {
18            logger.Error("server error", slog.String("err", err.Error()))
19            os.Exit(1)
20        }
21    }()
22
23    c := make(chan os.Signal, 1)
24    signal.Notify(c, os.Interrupt)
25    <-c
26
27    if err := server.Stop(); err != nil {
28        logger.Error("cannot stop server", slog.String("err", err.Error()))
29        os.Exit(1)
30    }
31}

Up to this point, we already have a TCP echo server. We can test it by running our program and connecting to it. You can check the full source code on this commit.

1❯ go run ./goredis/main.go
2time=2025-02-03T03:31:03.218+07:00 level=INFO msg="starting server" address=0.0.0.0:3100
3time=2025-02-03T03:31:03.218+07:00 level=INFO msg="server started"

Once you’ve started the server, you can connect to it. To connect to our server, we can use nc. You can run nc localhost 3100 in your terminal and start typing what you want to send to the server there. After you press enter, the message you typed will be sent to the server and the server will send back the exact same message. You can type Ctrl+C once you’re done. Pressing Ctrl+C will send an interrupt to nc and nc will treat it as a termination signal and exit the process.

1❯ nc localhost 3100
2Hi, there
3Hi, there
4How are you?
5How are you?

Implement RESP

Now we have a working TCP server, let’s start serving Redis requests. As described by RESP documentation, a Redis command is just an array of strings encoded using the RESP specification. Let’s write a function that takes bytes and parses those bytes into a Go data structure. So, an encoded string becomes Go’s string and an encoded array becomes Go’s slice.

Before that, I want to point out that instead of taking bytes and returning a Go data type, we may want to take an io.Reader instead. By taking io.Reader, we can work with more types instead of just bytes. If you look at the io.Reader interface, it’s basically just an interface to read a bunch of bytes. If you want to pass bytes directly, you can still use bytes.Buffer. In our case, when we accept a client connection, we will get a net.Conn that also implements io.Reader. Accepting io.Reader will make it easier for us to work with multiple data sources. This can be handy for testing. In our real service, the io.Reader is backed by net.Conn. But, if we want to test our implementation, we can just pass bytes.Buffer to it and it will work. We don’t have to make a real TCP connection just to test our RESP parser.

Ok, first, let’s implement a helper function to read a single byte from an io.Reader.

 1func readOne(r io.Reader) (byte, error) {
 2    buff := [1]byte{}
 3    n, err := r.Read(buff[:])
 4    if err != nil {
 5        return 0, err
 6    }
 7    // n represents the number of bytes read when we call `r.Read`.
 8    // since `buff`'s length is one, we should read one byte. Otherwise,
 9    // the client must've been disconnected.
10    if n != 1 {
11        return 0, io.EOF
12    }
13    return buff[0], nil
14}

Parsing the RESP data type is quite simple. The first byte of the stream basically gives us information about what data type we should read next. If it’s a *, then you should parse an array. If it’s a $, then you should parse a string. There are other data types such as integers, booleans, maps, etc., but we won’t be focusing on those.

 1func readObject(r io.Reader) (any, error) {
 2    b, err := readOne(r)
 3    if err != nil {
 4        return nil, err
 5    }
 6    switch b {
 7    case '*':
 8        return readArray(r, false)
 9    case '$':
10        return readBulkString(r)
11    default:
12        return nil, fmt.Errorf("unrecognized character %c", b)
13    }
14}

Next, we just need to parse it as described by the standard. If it’s a string, we just need to parse its length and read that many bytes.

 1func readBulkString(r io.Reader) (string, error) {
 2    size, err := readLength(r)
 3    if err != nil {
 4        return "", err
 5    }
 6    buff := make([]byte, size)
 7    if _, err := io.ReadFull(r, buff); err != nil {
 8        return "", err
 9    }
10    b, err := readOne(r)
11    if err != nil {
12        return "", err
13    }
14    if b != '\r' {
15        return "", fmt.Errorf("expected carriage-return character for array, but found %c", b)
16    }
17    b, err = readOne(r)
18    if err != nil {
19        return "", err
20    }
21    if b != '\n' {
22        return "", fmt.Errorf("expected newline character for array, but found %c", b)
23    }
24    return string(buff), nil
25}
26
27func readLength(r io.Reader) (int, error) {
28    result := 0
29    for {
30        b, err := readOne(r)
31        if err != nil {
32            return 0, err
33        }
34        if b == '\r' {
35            break
36        }
37        result = result*10 + int(b-'0')
38    }
39    b, err := readOne(r)
40    if err != nil {
41        return 0, err
42    }
43    if b != '\n' {
44        return 0, fmt.Errorf("expected newline character for length, but found %c", b)
45    }
46    return result, nil
47}

If it’s an array, we just need to read its length and recursively parse the array items as many times as the length of the array.

 1func readArray(r io.Reader, readFirstChar bool) ([]any, error) {
 2    if readFirstChar {
 3        b, err := readOne(r)
 4        if err != nil {
 5            return nil, err
 6        }
 7        if b != '*' {
 8            return nil, fmt.Errorf("expected * character for array, but found %c", b)
 9        }
10    }
11    n, err := readLength(r)
12    if err != nil {
13        return nil, err
14    }
15    result := make([]any, n)
16    for i := 0; i < n; i++ {
17        val, err := readObject(r)
18        if err != nil {
19            return nil, err
20        }
21        result[i] = val
22    }
23    return result, nil
24}

We also need a way to encode our response in RESP format. However, encoding the response is way easier than parsing it. We could just use fmt.Sprintf or fmt.Fprintf to do that easily. For example, if you want to encode a string, just write fmt.Sprintf("$%d\r\n%s\r\n", len(theString), theString). Because of that, we don’t need a helper to do this.

Implement SET and GET

Now that we already know how to parse RESP messages, we can use them to read user’s commands and handle them. Instead of just echoing whatever the client sends to our server, now we will parse it and handle it. We can use the readArray helper function above to read the command from the user, take its first element as the command name, and handle them based on the command.

Check out this commit for the full code.

 1func (s *server) handleConn(clientId int64, conn net.Conn) {
 2    ...
 3    for {
 4        request, err := readArray(conn, true)
 5        ...
 6        commandName, ok := request[0].(string)
 7        ...
 8        switch strings.ToUpper(commandName) {
 9        case "GET":
10            s.logger.Debug("handle get command", slog.Int64("clientId", clientId))
11        case "SET":
12            s.logger.Debug("handle set command", slog.Int64("clientId", clientId))
13        default:
14            s.logger.Error("unknown command", slog.String("command", commandName), slog.Int64("clientId", clientId))
15        }
16        ...
17        conn.Write([]byte("+OK\r\n"))
18    }
19    ...
20}

As you can see, we just need to read the client message, parse it using readArray, take the first element and treat it as the command name. In the code above, we just log the command; next, we will write a function for each command we want to handle. In this case, we have a function that handles the SET command and a function that handles the GET command.

 1func (s *server) handleConn(clientId int64, conn net.Conn) {
 2    ...
 3    switch strings.ToUpper(commandName) {
 4    case "GET":
 5        err = s.handleGetCommand(clientId, conn, request)
 6    case "SET":
 7        err = s.handleSetCommand(clientId, conn, request)
 8    }
 9    ...
10}
11
12func (s *server) handleGetCommand(clientId int64, conn net.Conn, command []any) error {
13    if len(command) < 2 {
14        _, err := conn.Write([]byte("-ERR missing key\r\n"))
15        return err
16    }
17    key, ok := command[1].(string)
18    if !ok {
19        _, err := conn.Write([]byte("-ERR key is not a string\r\n"))
20        return err
21    }
22    s.logger.Debug("GET key", slog.String("key", key), slog.Int64("clientId", clientId))
23    // TODO: the GET logic goes here
24    _, err := conn.Write([]byte("_\r\n"))
25    return err
26}
27
28func (s *server) handleSetCommand(clientId int64, conn net.Conn, command []any) error {
29    if len(command) < 3 {
30        _, err := conn.Write([]byte("-ERR missing key and value\r\n"))
31        return err
32    }
33    key, ok := command[1].(string)
34    if !ok {
35        _, err := conn.Write([]byte("-ERR key is not a string\r\n"))
36        return err
37    }
38    value, ok := command[2].(string)
39    if !ok {
40        _, err := conn.Write([]byte("-ERR value is not a string\r\n"))
41        return err
42    }
43    s.logger.Debug(
44        "SET key into value",
45        slog.String("key", key),
46        slog.String("value", value),
47        slog.Int64("clientId", clientId),
48    )
49    // TODO: the SET logic goes here
50    _, err := conn.Write([]byte("+OK\r\n"))
51    return err
52}

There is nothing magical in the above code. We just write functions to handle each command and write a few lines of code to fetch its parameters. In the case of the GET command, we want to fetch the key, and in the case of the SET command, we want to fetch the key and value. We just return OK for the SET command and (nil) for the GET command. Next, we will implement the actual logic to handle SET and GET.

You can think of Redis as a map[string]string. SET and GET commands are basically just commands to interact with that map. Running SET user1 Andy is basically just saying m["user1"] = "Andy". And running GET user1 is basically just like m["user1"]. So, let’s just do that: create a map in the server struct and handle SET by inserting to that map, and handle GET by fetching a key from that map. The problem is, you need to make sure that there are no conflicting operations happening concurrently. You can’t have two goroutines accessing the same map when one of them is a write operation. This is what we call a race condition. It’s a condition where you have two threads that concurrently access the same memory region and their operations are conflicting. When this happens, you might end up with corrupted data structures. In Go, the runtime might check this condition and throw a panic if that happens.

Race Condition: to understand this concept, imagine having a stack of integers. You can push to the stack like this:

1type stack struct {
2    data []int
3    top  int
4}
5
6func (s *stack) Push(value int) {
7    s.data[s.top] = value
8    s.top += 1
9}

It looks very simple and perfect, but it’s actually not thread-safe and can have problems if you call Push from two concurrent goroutines. Imagine you have goroutine A and goroutine B. The stack was empty and both of them are calling Push at the same time. It just happens that both goroutine A and B execute line 7 at the same time. As a result, both of them set s.data[0]. Now, goroutine A might execute s.top += 1 first and then be followed by goroutine B, or the other way around. As a result, once both goroutines finish, the stack’s top field is incremented twice, yet s.data[1] is never filled and s.data[0] is filled twice. This is an example of a race condition. Because there are two conflicting goroutines touching the same memory region, they might corrupt each other.

To avoid race conditions, we can use sync.Mutex or sync.RWMutex. The way we avoid race conditions is by grabbing a lock before executing critical sections. We can write our code such that every time we access the map, a lock has to be acquired first. After we finish accessing the map, we can release the lock. Other goroutines that might want to access the map will also need to acquire the lock first. If a lock is already acquired by another goroutine, acquiring the lock will block the goroutine until the one that holds the lock releases it. To acquire a lock, you can call the Lock method on sync.Mutex, and you can use the Unlock method to release the lock.

Now, if you think about it, when two or more goroutines are reading the same memory region, it shouldn’t be a problem because they’re just reading it. The race condition will only happen when those goroutines are conflicting. Reading is not conflicting with another reading. You will only conflict when one of the operations is a write operation. Because of that, instead of having a simple locking mechanism, you want the lock to be smart enough so that if the goroutine that holds the lock only uses it to read data from your map, another goroutine can also acquire the lock if it’s just for reading the data. This is where sync.RWMutex can help you. Now, you have 4 methods: Lock, Unlock, RLock, RUnlock. Lock and Unlock behave similarly to the sync.Mutex. You can use RLock to acquire the lock for reading the map. When you hold the lock for reading, another goroutine that calls RLock won’t be blocked.

 1type server struct {
 2    ...
 3    dbLock   sync.RWMutex
 4    database map[string]string
 5}
 6
 7func (s *server) handleGetCommand(clientId int64, conn net.Conn, command []any) error {
 8    ...
 9    s.dbLock.RLock()
10    value, ok := s.database[key]
11    s.dbLock.RUnlock()
12    ...
13
14    var err error
15    if ok {
16        resp := fmt.Sprintf("$%d\r\n%s\r\n", len(value), value)
17        _, err = conn.Write([]byte(resp))
18    } else {
19        _, err = conn.Write([]byte("_\r\n"))
20    }
21    return err
22}
23
24func (s *server) handleSetCommand(clientId int64, conn net.Conn, command []any) error {
25    ...
26    s.dbLock.Lock()
27    s.database[key] = value
28    s.dbLock.Unlock()
29    ...
30}

You can find the code in this commit.

Testing And Benchmark

Congratulations, we now have a very simple Redis server. We can start our Redis server by running go run ./goredis/main.go.

1❯ go run ./goredis/main.go
2time=2025-01-31T22:23:54.458+07:00 level=INFO msg="starting server" address=0.0.0.0:3100
3time=2025-01-31T22:23:54.458+07:00 level=INFO msg="server started"
4...

We can now run commands from redis-cli:

1❯ ./redis-cli -h localhost -p 3100 SET key1 value1
2OK
3❯ ./redis-cli -h localhost -p 3100 GET key1
4"value1"
5❯ ./redis-cli -h localhost -p 3100 GET key2
6(nil)

As you can see, our Redis server works nicely.

Ideally, you want to test your implementation very often. Every time you make a change, you need to make sure that you don’t break the Redis server. To be sure, you need to test it every time you make changes. Testing manually like this is tedious, boring, and prone to error. To make things easier and more reliable, you can write an automated test. Then, you can run your automated test every time you make a change and make sure that the tests don’t break. You can check this commit, this commit, and this commit to see how I wrote the automated tests.

Now, let’s try to see the performance of our server. We are going to benchmark the server using the redis-benchmark tool that we’ve compiled before.

1❯ ./redis-benchmark -h localhost -p 3100 -r 100000000000 -c 50 -t SET,GET -q
2WARNING: Could not fetch server CONFIG
3SET: 104602.52 requests per second, p50=0.263 msec
4GET: 110132.16 requests per second, p50=0.247 msec
5...

In the command above, we run the redis benchmark using 100 billion possible keys and using 50 concurrent connections. After you run that command, your server will be hit by 100,000 requests, and the result is reported. If you see the result above, our server can handle roughly 104K SET operations per second and 110K GET operations per second. It also reports that the latency of SET requests has a median of 0.2ms and the latency of GET requests has a median of 0.2ms. You might get different results depending on the machine you’re running.

Note that this number might not be very useful in the real world. In the real world, the access pattern is usually not uniform like this. Some keys are accessed more often than other keys, for example. The timing is also rarely this regular. We also rarely use median to measure the latency. Instead, the 95th percentile is more representative. Having a median of 0.26ms means half of your SET requests are served below 0.2ms. But half of the requests is a very small number. What people usually want is to have good latency on 95%, 99%, or even 100% of the requests. But again, we are going to stick with this benchmark mechanism for simplicity’s sake. If you want to do a proper benchmark, you should think about your production environment and try to mimic its access pattern as closely as possible.

Looking at the benchmark result, it’s hard to know whether our Redis server has good performance. In order to evaluate this number, we can compare it to the actual Redis implementation. First, let’s start the redis-server.

 1❯ ./redis-server --port 3200
 2182161:C 31 Jan 2025 22:50:45.805 * oO0OoO0OoO0Oo Redis is starting oO0OoO0OoO0Oo
 3182161:C 31 Jan 2025 22:50:45.805 * Redis version=7.4.2, bits=64, commit=a0a6f23d, modified=0, pid=182161, just started
 4182161:C 31 Jan 2025 22:50:45.805 * Configuration loaded
 5182161:M 31 Jan 2025 22:50:45.805 * monotonic clock: POSIX clock_gettime
 6                _._
 7           _.-``__ ''-._
 8      _.-``    `.  `_.  ''-._           Redis Community Edition
 9  .-`` .-```.  ```\/    _.,_ ''-._     7.4.2 (a0a6f23d/0) 64 bit
10 (    '      ,       .-`  | `,    )     Running in standalone mode
11 |`-._`-...-` __...-.``-._|'` _.-'|     Port: 3200
12 |    `-._   `._    /     _.-'    |     PID: 182161
13  `-._    `-._  `-./  _.-'    _.-'
14 |`-._`-._    `-.__.-'    _.-'_.-'|
15 |    `-._`-._        _.-'_.-'    |           https://redis.io
16  `-._    `-._`-.__.-'_.-'    _.-'
17 |`-._`-._    `-.__.-'    _.-'_.-'|
18 |    `-._`-._        _.-'_.-'    |
19  `-._    `-._`-.__.-'_.-'    _.-'
20      `-._    `-.__.-'    _.-'
21          `-._        _.-'
22              `-.__.-'
23
24182161:M 31 Jan 2025 22:50:45.806 * Server initialized
25182161:M 31 Jan 2025 22:50:45.806 * Loading RDB produced by version 7.4.2
26182161:M 31 Jan 2025 22:50:45.806 * RDB age 105 seconds
27182161:M 31 Jan 2025 22:50:45.806 * RDB memory usage when created 16.70 Mb
28182161:M 31 Jan 2025 22:50:45.846 * Done loading RDB, keys loaded: 199983, keys expired: 0.
29182161:M 31 Jan 2025 22:50:45.846 * DB loaded from disk: 0.041 seconds
30182161:M 31 Jan 2025 22:50:45.846 * Ready to accept connections tcp
31...

And then, we can re-run the benchmark command above but for port 3200 since we use that port to run the redis-server.

1❯ ./redis-benchmark -h localhost -p 3200 -r 100000000000 -c 50 -t SET,GET -q
2SET: 136612.02 requests per second, p50=0.183 msec
3GET: 122399.02 requests per second, p50=0.223 msec

As you can see, your performance is similar to the actual Redis. We don’t do any tricks here. What we’ve done is roughly the same as what Redis has done as well.

Redis can actually perform better than that. During benchmarking, there is actually some overhead in the communication. Each time the client (in this case the client is redis-benchmark) sends a command to Redis server, the command needs to be copied from the client’s memory to kernel memory and copied again from kernel memory to Redis server memory. There are a bunch of copying operations between user space and kernel space here and it might affect the performance. In order to reduce this overhead, we can use pipelined benchmark. Without pipeline, in order to send N requests to the Redis server, the client will need to send a command to Redis server and then wait for the response for N times. This means there are roughly 2*N times copy performed from user space to kernel space. By using pipeline, the client can send those N requests at once and wait for N responses at once as well. By doing that, the number of interactions between the user space and kernel will be minimized.

Redis Benchmark Pipelining

Again, note that most real-world applications don’t do pipelining. What we do here is not really depicting what will happen in a production environment. We do pipelining to make things more interesting and see how high the throughput we can get, and how low the latency we can get.

Let’s run the benchmark again for the actual Redis server with pipeline enabled.

1❯ ./redis-benchmark -h localhost -p 3200 -r 100000000000 -c 50 -t SET,GET -P 10000 -q
2SET: 2130160.75 requests per second, p50=3.079 msec
3GET: 2652540.00 requests per second, p50=14.255 msec

By adding -P 10000, we set the redis-benchmark to pipeline 10,000 requests into a single message. As you can see, now we have higher throughput. Now we get about 2.1 million SET operations per second and 2.6 million GET operations per second. It’s almost 10x higher. But, you might notice that the latency is increasing. The median of SET’s latency has now increased to 3ms from previously only 0.23ms. This is the effect of pipelining. When we pipeline requests together, some requests might end up being served a little bit longer. Previously, to handle a request, you just needed to send the command to Redis server and wait for the response. With pipelining, when you want to execute a command, you don’t send the command directly, but you need to wait for the other 9,999 commands first to be sent together as a batch.

Let’s try benchmarking our Redis server written in Go using pipelining:

1❯ ./redis-benchmark -h localhost -p 3100 -r 100000000000 -c 50 -t SET,GET -P 10000 -q
2WARNING: Could not fetch server CONFIG
3SET: 103413.09 requests per second, p50=6.135 msec
4GET: 109966.45 requests per second, p50=5.279 msec

As you can see, the throughput doesn’t change that much. This makes sense because we don’t really have any special code to handle pipelined requests. As far as we know, the requests are coming one by one and we handle them one by one as well. There is no optimization we’ve done to handle pipelined requests. The latency is also increasing, which also makes sense due to the same reason the latency increased in the actual Redis server benchmark.

Now, instead of using redis-benchmark, we will benchmark our Redis server implementation using Golang’s benchmarking framework. By writing our own benchmark, we can have more flexible benchmarking logic. We can, for example, simulate a load where the access pattern resembles the access pattern in our actual production environment. We can also run profiler to trace our CPU and memory usage. This is very useful to find our app’s bottleneck, which will help us to build applications with better performance. For simplicity’s sake, our benchmark will just use uniformly distributed keys for SET command, just like redis-benchmark. We are not going to write benchmark test for GET command. There is nothing special about the GET command, so it can be left as an exercise for the reader.

To create a benchmark test, you just need to write a function with Benchmark prefix, accept a *testing.B as its parameter, and return nothing. This is the convention used by Golang’s testing framework. I’ll put the benchmark logic in server_test.go:

1func BenchmarkRedisSet(b *testing.B) {
2    ...
3}

To do the benchmark, first we will initialize the server. Ideally, we want our benchmark code to be efficient so that it won’t affect the code we’re running. To reduce the TCP overhead, we are going to use sockfile instead of opening a socket. By using sockfile, the communication between the client and the server will not be using TCP, but it will be done internally in the kernel.

 1const sockfilePath string = "/tmp/redisexperiment.sock"
 2
 3func BenchmarkRedisSet(b *testing.B) {
 4    _ = os.Remove(sockfilePath)
 5    listener, err := net.Listen("unix", sockfilePath)
 6    if err != nil {
 7        b.Errorf("cannot open sock file: %s", err.Error())
 8        return
 9    }
10    noopLogger := slog.New(slog.NewTextHandler(io.Discard, nil))
11    server := NewServer(listener, noopLogger)
12    go func() {
13        if err := server.Start(); err != nil {
14            b.Errorf("cannot start server: %s", err.Error())
15        }
16    }()
17
18    // you need to reset the timer here to make sure that
19    // the initialization code above won't be considered as
20    // part of the benchmark.
21    b.ResetTimer()
22    ...
23}

Now, we just need to spawn some goroutines that will simulate the client. One goroutine will simulate a single client connection. Instead of spawning goroutines manually, we can use b.RunParallel helper function provided by Golang’s testing framework. We don’t spawn goroutines manually because using b.RunParallel is quite handy. It provides you a way to distribute the load to multiple goroutines efficiently. Suppose you want to hit your server with 100,000 requests. Now you need to make sure that the sum of the number of requests sent by each one of your spawned goroutines should be 100,000. b.RunParallel can help you with this. b.RunParallel gives you a parameter *testing.PB that you can use to check if your goroutine should send a request or not. Calling pb.Next() returns true if you need to send another request to the server, and it will return false if you’ve sent 100,000 requests to the server and you can stop now. Additionally, by using b.RunParallel you can configure the number of goroutines that will be used without changing the code. You can configure it from the command line when you run the benchmark. You can also do all of those manually, but for most cases, it’s more convenient to use b.RunParallel.

 1func BenchmarkRedisSet(b *testing.B) {
 2    ...
 3    id := atomic.Int64{}
 4    b.RunParallel(func(pb *testing.PB) {
 5        // First, you need to create a connection to the server
 6        client, err := net.Dial("unix", sockfilePath)
 7        ...
 8        randomizer := rand.New(rand.NewSource(id.Add(1)))
 9        pipelineSize := 100
10        buff := make([]byte, 4096)
11        writeBuffer := bytes.Buffer{}
12        count := 0
13
14        // pb.Next() returns true if you still need to send a request
15        // to the server. When it returns false, it means you already
16        // reached your target and you can stop sending requests.
17        for pb.Next() {
18            // These lines generate SET command with randomized key and value.
19            // We put the command to a buffer called `writeBuffer`. After the buffer
20            // is big enough, we send it to the server in one go. That's how we
21            // pipeline the requests.
22            writeBuffer.WriteString("*3\r\n$3\r\nset\r\n$12\r\n")
23            for i := 0; i < 12; i++ {
24                writeBuffer.WriteByte(byte(randomizer.Int31()%96 + 32))
25            }
26            writeBuffer.WriteString("\r\n$12\r\n")
27            for i := 0; i < 12; i++ {
28                writeBuffer.WriteByte(byte(randomizer.Int31()%96 + 32))
29            }
30            writeBuffer.WriteString("\r\n")
31
32            // After reaching a certain threshold, we send the generated requests
33            // to the server and wait for the response. We don't do anything
34            // with the response other than read it.
35            count++
36            if count >= pipelineSize {
37                if _, err := writeBuffer.WriteTo(client); err != nil {
38                    b.Errorf("cannot write to server: %s", err.Error())
39                    return
40                }
41                if _, err := io.ReadFull(client, buff[:5*count]); err != nil {
42                    b.Errorf("cannot read from server: %s", err.Error())
43                    return
44                }
45                count = 0
46            }
47        }
48
49        if count > 0 {
50            if _, err := writeBuffer.WriteTo(client); err != nil {
51                b.Errorf("cannot write to server: %s", err.Error())
52                return
53            }
54            if _, err := io.ReadFull(client, buff[:5*count]); err != nil {
55                b.Errorf("cannot read from server: %s", err.Error())
56                return
57            }
58            count = 0
59        }
60        ...
61    })
62    ...
63}

Don’t forget to close the server after the test is finished.

 1func BenchmarkRedisSet(b *testing.B) {
 2    ...
 3    b.StopTimer()
 4    if err := server.Stop(); err != nil {
 5        b.Errorf("cannot stop server: %s", err.Error())
 6        return
 7    }
 8
 9    // We can also add our own metrics to the benchmark report like this:
10    throughput := float64(b.N) / b.Elapsed().Seconds()
11    b.ReportMetric(throughput, "ops/sec")
12}

You can find the full benchmark code here.

To run the benchmark, you can execute this command:

 1# -bench=BenchmarkRedisSet is used to specify which benchmark function you want to run
 2# -benchmem                is used to tell the framework that you want to benchmark the memory usage
 3# -benchtime=10s           is used to specify the duration of the benchmark
 4# -run="^$"                is saying that you don't want to run any unit test
 5❯ go test . -v -bench=BenchmarkRedisSet -benchmem -benchtime=10s -run="^$"
 6goos: linux
 7goarch: amd64
 8pkg: github.com/jauhararifin/goredis
 9cpu: AMD Ryzen 9 7900X 12-Core Processor
10BenchmarkRedisSet
11BenchmarkRedisSet-24             9774513              1289 ns/op            776083 ops/sec           585 B/op         42 allocs/op
12PASS
13ok      github.com/jauhararifin/goredis 14.199s

Once the benchmark finished, you can see the result. As you can see above, the benchmark runs about 14 seconds, sending about 9.7 million requests. The throughput of the system is roughly 776K requests per second and each request is handled at 1,289 nanoseconds on average. Since we add -benchmem, the result also shows the number of allocations we made for each request.

You might wonder, how come we are able to handle 776K requests per second when running it from Golang’s benchmark framework, but we only get about 28K requests per second on redis-benchmark? The latency also differs quite a lot; using redis-benchmark shows that the latency is about 5ms, while using Golang’s testing framework it’s about 1,289 nanoseconds. Well, for latency, they use different metrics. redis-benchmark shows the median, but the Golang’s testing framework shows the average. The difference is expected, but in our case, the differences are huge, so it’s a little suspicious.

TCP Overhead

Let’s analyze what is happening here. How is benchmarking using Golang’s framework outperforming redis-benchmark by a lot? In theory, they shouldn’t, since they are practically doing the same thing. We can start questioning ourselves about what is different between our benchmark vs redis-benchmark. One difference is that our benchmark is using a unix sockfile instead of an actual TCP connection. By using a sockfile, we reduce some overhead done by TCP. Let’s try to use a TCP connection instead of a unix sockfile.

1func BenchmarkRedisSet(b *testing.B) {
2    ...
3    listener, err := net.Listen("tcp", "0.0.0.0:3900")
4    ...
5    b.RunParallel(func(pb *testing.PB) {
6        client, err := net.Dial("tcp", "localhost:3900")
7        ...
8    })
9}

Now, let’s run our benchmark again:

1❯ go test . -v -bench=BenchmarkRedisSet -benchmem -benchtime=10s -run="^$"
2goos: linux
3goarch: amd64
4pkg: github.com/jauhararifin/goredis
5cpu: AMD Ryzen 9 7900X 12-Core Processor
6BenchmarkRedisSet
7BenchmarkRedisSet-24             7970709              1620 ns/op            617151 ops/sec           615 B/op         42 allocs/op
8PASS
9ok      github.com/jauhararifin/goredis 14.783s

As you can see, the throughput drops a little. Previously it was 776K ops/sec, now it is 617K ops/sec. It drops about 20%. However, it’s still much higher than the benchmark result from redis-benchmark. The redis-benchmark only reaches about 28K ops/sec, about 20x smaller than our current benchmark. Let’s analyze further.

Logging Overhead

The second thing that differentiates between our benchmark and redis-benchmark is that we don’t log anything in our test. If you notice, in the benchmark logic above, I initialize the logger like this:

1noopLogger := slog.New(slog.NewTextHandler(io.Discard, nil))

You see, I put io.Discard as the log destination. This means whenever we log something, it will just be dropped and not written anywhere. This is not what we did when benchmarking with redis-benchmark. If you notice, in our main function, I initialize the logger by doing this:

1logger := slog.New(slog.NewTextHandler(os.Stderr, &slog.HandlerOptions{Level: slog.LevelDebug}))

When we benchmark using redis-benchmark, we use os.Stderr as the log destination. And if you notice, every time we handle a SET and GET request, we log its parameters. This means when we benchmark our code using redis-benchmark, we add extra overhead by logging it, while we don’t have that overhead in our benchmark code that uses Golang’s testing framework.

Let’s see how our benchmark performs if we also use os.Stderr to print our log. We also use slog.LevelDebug as the log level so that our debug print will be printed.

1noopLogger := slog.New(slog.NewTextHandler(io.Stderr, nil))
1❯ go test . -v -bench=BenchmarkRedisSet -benchmem -benchtime=10s -run="^$"
2...
3time=2025-02-03T04:01:09.250+07:00 level=DEBUG msg="SET key into value" key=z&k1:4QJIKHh value=PJNL-B4HTz@ clientId=14
4time=2025-02-03T04:01:09.250+07:00 level=INFO msg="closing client" clientId=14
5BenchmarkRedisSet-24             1434164              8348 ns/op            119787 ops/sec           601 B/op         43 allocs/op
6PASS

Now our benchmark result matches the one we did using redis-benchmark. We still have different results for the latency, but this is expected because we are reporting different metrics here. Here, the latency is the average latency while redis-benchmark reports the median of the latency. Neither metric is a good measure of latency, but we won’t be focusing on that too much for now.

I will keep using unix sockfile and turn off the logging for benchmarking purposes. This is because we want to focus on Redis itself instead of the overhead of TCP and logging. In the real world, we don’t usually enable debug logs; we usually only log when a client is connected or disconnected. So, disabling the log here makes more sense because it mimics what we’ll be doing in a production environment. We also use sockfile instead of TCP connection here because there is not much we can do with TCP connection. In the end, the performance of TCP depends on many external factors as well, such as the cable we are using, how far apart the client and server are, etc. There is technically something we can do to reduce the overhead of TCP connection like using io_uring, but I won’t cover that in this article because it’s quite involved and this article is supposed to be an introduction.

Profiling

To improve the performance of an application, you need to know where the bottleneck is. You need to identify what makes your app slow or has low throughput. You don’t want to improve your app on things that don’t matter. Applications consist of multiple parts working together. Sometimes, it’s not obvious which component we should improve. Improving a component that plays a little role in the whole application performance won’t give us meaningful performance improvement. For example, in our Redis server, there is logic to accept client connections, read their requests, process the requests, and send back the responses. Improving things like accepting client connection logic or shutdown logic won’t increase our throughput or reduce our latency.

The more complex your application, the harder it is to find the performance bottleneck. Fortunately, we can profile our code to find it. We will first focus on CPU profiling. There are other types of profiling like memory profiling, block profiling, mutex profiling, etc. Conceptually, what CPU profiling does is just sample our CPU periodically. When we run our benchmark and turn on the CPU profiler, Golang’s runtime will periodically snapshot our CPU position. It records what function our CPU is currently on (for each thread). So after the benchmark finishes, we will have a file containing the list of samples. Each sample basically is just a stacktrace of the function that our CPU is currently executing at that moment. The idea is: if there is a particular function that is sampled many times (compared to other functions), then this function must be the bottleneck. The more a function is sampled, the more our CPU is spending its time there. So, it makes sense that those functions could be the bottleneck of our application. This is not always the case, though. Some other types of applications might have bottlenecks on the network. When that happens, our CPU won’t do anything meaningful. If our app is slow because its internet speed is slow, then our CPU doesn’t actually do anything, and the result of CPU profiler won’t be meaningful.

Most programming languages have ways to help us profile our code. Profiling in Golang is quite straightforward as well. You can add -cpuprofile=cpu.out to enable CPU profiling during benchmark test. Once the benchmark is finished, you will have a cpu.out file containing all the sampled CPU positions.

 1❯ go test . -v -bench=BenchmarkRedisSet -benchmem -benchtime=10s -cpuprofile=cpu.out -run="^$"
 2goos: linux
 3goarch: amd64
 4pkg: github.com/jauhararifin/goredis
 5cpu: AMD Ryzen 9 7900X 12-Core Processor
 6BenchmarkRedisSet
 7BenchmarkRedisSet-24             8788383              1436 ns/op            696278 ops/sec           600 B/op         42 allocs/op
 8PASS
 9ok      github.com/jauhararifin/goredis 14.241s
10❯ ls -l cpu.out
11-rw-rw-r-- 1 jauhararifin jauhararifin 52565 Feb  3 06:47 cpu.out

As you can see above, after you run the benchmark test, you will have a file called cpu.out. This file contains the result of the CPU profiling. There are a bunch of ways to analyze this file. Go itself has a tooling we can use to analyze it, you can execute go tool pprof cpu.out to analyze it:

1❯ go tool pprof cpu.out
2File: goredis.test
3Type: cpu
4Time: Feb 3, 2025 at 6:50am (WIB)
5Duration: 12.97s, Total samples = 251.16s (1935.99%)
6Entering interactive mode (type "help" for commands, "o" for options)
7(pprof)

Once you run the above command, you can interact with the pprof tool to analyze it. For example, you can write top 10 to find the top 10 functions that have been sampled by CPU profiler.

 1(pprof) top 10
 2Showing nodes accounting for 196.87s, 78.38% of 251.16s total
 3Dropped 298 nodes (cum <= 1.26s)
 4Showing top 10 nodes out of 70
 5      flat  flat%   sum%        cum   cum%
 6   166.71s 66.38% 66.38%    166.71s 66.38%  runtime/internal/syscall.Syscall6
 7     5.21s  2.07% 68.45%      5.21s  2.07%  runtime.(*timeHistogram).record
 8     3.92s  1.56% 70.01%      3.92s  1.56%  runtime.procyield
 9     3.66s  1.46% 71.47%      4.68s  1.86%  runtime.deductAssistCredit
10     3.52s  1.40% 72.87%     12.30s  4.90%  runtime.exitsyscall
11     3.47s  1.38% 74.25%    164.78s 65.61%  internal/poll.(*FD).Read
12     3.08s  1.23% 75.48%    136.93s 54.52%  github.com/jauhararifin/goredis.readOne
13     3.07s  1.22% 76.70%     11.85s  4.72%  runtime.mallocgc
14     2.16s  0.86% 77.56%      7.49s  2.98%  runtime.casgstatus
15     2.07s  0.82% 78.38%      4.45s  1.77%  runtime.reentersyscall

From the result above, you can see 10 lines of output, each line representing a sampled function. You can see the first line is for the function runtime/internal/syscall.Syscall6. This is the function used by Go’s runtime to perform syscall. There are about 5 numbers on the left side of the function name. Those 5 numbers are basically rooted in 2 sources, the flat and cum. The flat% is basically the same as flat, but in percentage format. The same applies to cum and cum%.

flat represents how long your CPU spends its time inside that function. Higher numbers mean that your CPU takes a lot of time executing that function. But, it doesn’t count any function it called. So, if you have a function foo that calls another function bar, the execution of bar won’t be counted in the flat. For example, if let’s say you have a function that takes a slice of integers and filters the prime numbers from that slice, you might write that function like this:

 1func filterPrime(s []int) []int {
 2    result := make([]int, 0, len(s))
 3    for _, x := range s {
 4        if isPrime(x) {
 5            result = append(result, x)
 6        }
 7    }
 8    return result
 9}
10
11func isPrime(x int) bool {
12    ...
13}

The execution of isPrime won’t be counted in the flat number. So if let’s say your filterPrime takes 10 seconds to finish, but the isPrime itself is executed for 8 seconds, your flat number on filterPrime will be 2 seconds.

The flat% is just the percentage of total CPU time. In the report above, you see the first line says that you have 251.16 seconds of profile data. This is a little bit misleading because you remember we only ran the benchmark test for 14.241 seconds. Well, this is because CPU time is not the same as wall-clock time. CPU time measures the time used by CPU, not elapsed time. On my PC, I have about 24 CPUs, which means each CPU takes about 10 seconds. Of course, this is just an approximation; one CPU might execute more instructions than the others. Ok, back to the flat%. You see the flat number in the first function is 166.71 seconds. If you divide it by the total profiling time (251.16 seconds), you will get 66.38%, which is the number of flat%.

The sum% is basically just the sum of flat%. The value of sum% of the 5th row is the sum of flat% at 1st, 2nd, 3rd, 4th, and 5th rows.

1            flat%   sum%
2        +--66.38% 66.38%                    runtime/internal/syscall.Syscall6
3        |   2.07% 68.45%                    runtime.(*timeHistogram).record
4sum up -+   1.56% 70.01%                    runtime.procyield
5to      |   1.46% 71.47%                    runtime.deductAssistCredit
672.87%  +---1.40% 72.87%                    runtime.exitsyscall
7                  ^
8                  |
9                  +-- here 72.87% was the sum of 66.38% + 2.07% + 1.56% + 1.46% + 1.40%

Now, let’s talk about cum. cum is actually pretty straightforward. It’s the cumulative sample from a function and any other function it called. Let’s use the filterPrime function above. While flat shows how long the CPU spends its time on that function only, cum shows how long the CPU spends its time on that function plus any other function called by it. So, if our filterPrime runs for 10 seconds, and the isPrime alone takes 8 seconds, then the cum will be 10 seconds.

There are a bunch of other commands you can use other than top 10 to analyze the profiling result. One of my favorites is the web command. If you run web command, a browser will be opened showing the result in graphical format.

CPU Profiling 1

Similarly, each box in the graph represents a single function. Boxes with redder color mean their cum number is bigger. The greyer color means its cum number is smaller. You can see, each box has some numbers as well. For example, in goredis.readObject box, there is 0.36s (0.14%) of 135.92s (54.12%), this shows the flat and cum numbers. While the color shows the magnitude of cum value, the font size shows the magnitude of flat value. So, bigger flat numbers will be represented as bigger font.

Now, apart from the box, if you use graphical representation, you will also see arrows pointing from one box to another box. These arrows represent function calls. Take a look at the picture below:

CPU Profiling 2

In the above picture, there are 3 arrows coming out of goredis.readObject, and those arrows point to runtime.convTstring, goredis.readBulkString, and goredis.readOne. This means that goredis.readObject calls those 3 functions. A function might not be shown in the graph if it’s too small. So, be aware that if you don’t see a function in the profiling graph, it doesn’t mean it’s not being called.

Now, notice that each arrow has a text containing a duration. This duration represents the time our CPU takes to execute the function called by goredis.readObject. If you sum up the flat number of goredis.readObject and the numbers of the arrows, you will get: 0.36s + 1.64s + 117.21s + 16.71s = 135.92s which is the same as the cum number of goredis.readObject.

The arrows also have different thicknesses and colors. Red arrows mean the CPU takes more time executing that path. You can think that the color represents the number of the arrow. Thick arrows mean that the CPU takes a lot of time executing that path.

Using graph representation can help us trace back the origin of our bottleneck. We can see here that the syscall takes a lot of CPU time and it is coming originally from readArray.

There is one more way you can analyze the CPU profiling result: you can use speedscope. You can drag and drop your cpu.out file to this app and it will show you a flamegraph.

CPU Profiling 3

Similarly, each block in this graph represents a function. If you notice, the blocks are stacked on top of each other. The stack represents a function call. Wider blocks mean the CPU takes more time executing that block. You can also find the cum and flat values in the bottom left corner of the graph.

Well, there are still other ways to visualize the CPU profiling result, but we will focus on the three methods above.

Syscall

Let’s go back to our CPU profile graph.

CPU Profiling 4

You will see that the box with the biggest font is syscall.Syscall6. This indicates that our CPU spends a lot of time there. And if you look carefully, this syscall originates from readArray, which calls readObject, readBulkString, readLength, readOne, and then calls (*net.conn).Read. So, the CPU spends a lot of time reading the client’s request and parsing it. The parsing part is actually not that bad, judging from the flat value of readArray, readObject, readBulkString, readLength, and readOne. All of them executed less than 1% of the CPU time. The biggest contribution is actually from syscall.Syscall6. What is this syscall.Syscall6?

You can think of a syscall as an API for the kernel. Your computer most likely runs on an operating system. The kernel is part of the operating system. It manages the resources in your computer. CPU, memory, and storage are some of the resources it handles. It also helps us talk to other hardware like monitors, printers, and keyboards through drivers. Our application is like an isolated environment where we perform some basic tasks like adding two numbers, looping, calling a function, etc. But when it comes to allocating memory, showing something on the screen, listening to a socket, or sending data to other processes, we can’t do those alone without the kernel. Syscall is an API to talk to the kernel to do all of that. There are syscalls for allocating memory, writing to a file, receiving bytes from other processes, sending TCP packets through network interfaces, etc. In Go, syscall.Syscall6 has this signature:

1func Syscall6(trap, a1, a2, a3, a4, a5, a6 uintptr) (r1, r2 uintptr, err Errno)

It accepts a bunch of integers and returns two integers and an error code. These numbers are the parameters of the syscall. One of the numbers might represent what kind of syscall we want to call. There is a code to call syscall to write a file, another code to allocate memory, etc. One of the parameters might also contain what file we want to read, or which address we want to write to, etc. We won’t go into too much detail on these parameters. You just need to know they’re there.

Now, the problem with syscalls is that they are quite slow. It won’t be that bad if you call syscalls fifty or a hundred times. But if you call them millions of times, the overhead adds up. Unlike an ordinary function call, syscalls have additional overhead. Your CPU needs to switch to kernel space, which means all of your application states need to be preserved during the syscall and restored after the syscall finishes. It also needs to switch into a higher privilege level in the CPU, and this has its own overhead too. Some syscalls like reading data from a socket also need to copy the data from kernel buffer to user buffer. There are other overheads that I don’t mention here, but the point is they have big overhead.

Buffered Reader

One trick to improve performance is by making expensive operations less frequent. We can consider syscalls as expensive operations here. In our code, when we parse the request from the client, we use the readOne helper function that reads one byte from net.Conn. Reading a byte from net.Conn requires one syscall. If we read 10 bytes, then we need 10 syscalls. This overhead adds up and makes our application slower. To reduce the overhead, we can batch bytes together. For example, instead of asking for one byte at a time, we read 1000 bytes at once using one syscall. We can store those 1000 bytes in a []byte in our app. Whenever we need to read a byte, we just check our local []byte. If it has a byte, then we read from it. But if it’s empty, we use syscall again to read another 1000 bytes. By doing this, reading 2000 bytes doesn’t have to use 2000 syscalls; we can just use 2 syscalls. That sounds great, but why limit it to 1000? Why don’t we just read 1 million bytes in one syscall? Well, smaller batch sizes mean more syscalls, which means a lot of overhead. But bigger batch sizes mean we need more memory to store it, and a single syscall might take longer. If we use a very big batch size, the available bytes might not be able to catch up to our consumption. We might read it faster than the client sends it, which is not really a problem, but we will waste some memory. People often use 4096 as the batch size. This is because that’s usually the minimum size of the buffer that resides in the kernel. Also, it’s the size of a single memory page, which means we can map it into a single entry in the page table. I personally don’t have a very strong opinion on this number. If you want to get the best number, you can just experiment with it. If your server and client are in the same data center, you might have more throughput with a bigger batch size because your bandwidth is bigger.

Buffered Reader

Fortunately, Go has a really nice standard library that can help us with this. We can use bufio.Reader to do this. bufio.Reader does what I described above. It wraps an io.Reader and buffers it. By default, bufio.Reader uses 4096 bytes to buffer the read bytes. To use bufio.Reader, we can just call bufio.NewReader(...) and pass our net.Conn (remember, net.Conn is an io.Reader as it implements the Read function).

 1func (s *server) handleConn(clientId int64, conn net.Conn) {
 2    ...
 3    // We wrap the conn with bufio.NewReader.
 4    // Now, every `Read` call will use the buffer provided by bufio.Reader
 5    // until there are no more available bytes. Once the buffer is empty,
 6    // it will call `conn.Read` to fill in the buffer again.
 7    reader := bufio.NewReader(conn)
 8    for {
 9        // Now we use the wrapper connection instead of the original `conn`.
10        request, err := readArray(reader, true)
11        ...
12    }
13    ...
14}

That’s it. We just wrap the conn with bufio.Reader and it should be done. You can check the code in this commit. Next, let’s run the benchmark again to see the result.

1❯ go test . -v -bench=BenchmarkRedisSet -benchmem -benchtime=10s -cpuprofile=cpu.out -run="^$"
2goos: linux
3goarch: amd64
4pkg: github.com/jauhararifin/goredis
5cpu: AMD Ryzen 9 7900X 12-Core Processor
6BenchmarkRedisSet
7BenchmarkRedisSet-24            11713911               903.7 ns/op         1106589 ops/sec           566 B/op         42 allocs/op
8PASS
9ok      github.com/jauhararifin/goredis 11.868s

Now, our throughput is about 1.1 million SET operations per second. That’s a 40% improvement from the previous 776K SET operations per second.

CPU Profiling 5

If you check the benchmark result again, now readArray is not the bottleneck anymore.

Premature Optimization: Buffered Writer

Remember, the key to improving your app’s performance is identifying your bottleneck first. Improving the performance in places that are not bottlenecks won’t give you much performance improvement. There is nothing wrong with improving things out of context if it’s a low-hanging fruit. Maybe you can improve your app performance by 1% by just adding one line of code; that’s fine. It’s not a difficult thing to do, so you might as well do it. But if the improvement is very complex and it is not the bottleneck, you might end up wasting a lot of time for just 1% of performance improvement.

On some occasions, improving your performance by 1% can matter, especially when you’re working on a very performance-critical system like a database or operating system.

Since our read is now buffered, we may think that it makes sense to buffer our write as well. When we send a response to the client, we can put it in a buffer first and send it only when the buffer is full. By doing this, we will have a smaller number of syscalls to send the response.

However, unlike making a buffered reader, making a buffered writer is not as easy as wrapping our connection with bufio.Writer. The way bufio.Writer works is similar to bufio.Reader. It allocates a slice of bytes in our application memory and writes the data there instead of directly to the underlying writer. In our case, if we wrap our connection with bufio.Writer, we will write to local bytes first and send it through the socket interface once the buffer is full. This is not what we want. If we do it naively like this, there might be some messages that will never be sent to the socket. Imagine our bufio.Writer uses 4096 bytes for its buffer size. Now, let’s say there is a client sending SET "key1" "value" to our server. As a result, our server needs to send back OK. Now, instead of sending it through the socket, we write it to our local buffer. Because OK is just 2 bytes (not 4096 or more), the bufio.Writer won’t send it through the socket connection and we’re stuck here.

As an analogy, imagine you want to ride a roller coaster in a theme park. The roller coaster runs for about 5 minutes. To maximize utility, you want to wait until the roller coaster is full before you run it. This works pretty well when it’s crowded. But imagine when the theme park is not very crowded; there might be only 2 people who want to ride the roller coaster. In this case, the roller coaster will never be full and it will never run.

In order to solve the above problem, we need to have some timeout. Instead of waiting until the buffer is full, we need to limit our waiting time to some timeout, let’s say 10ms. So, if in 10ms the buffer is not full, we should send it anyway. This design, however, has some downsides. Now, for a response to be sent, it has to wait at most 10ms. If our traffic is high, our buffer might be full faster than 10ms and we get a lot of improvement there. But if our traffic is very low, a single request might take 10ms before the response itself is sent. This tradeoff is something you should be aware of before doing performance improvement. You should ask yourself whether it’s okay to increase the latency a little bit in order to get more throughput. In some cases, it might not be okay. In other cases, it might be okay.

There is a similar algorithm in computer networking called Nagle’s algorithm. Basically, when sending a packet through TCP, we can delay it first so we can send more packets. The details are a little bit different. It doesn’t use timeout as a fallback mechanism, but the idea is similar.

Let’s start by creating a buffered writer struct:

 1type bufferWriter struct {
 2    w    io.Writer
 3    buff []byte
 4    n    int
 5}
 6
 7func newBufferWriter(w io.Writer) *bufferWriter {
 8    return &bufferWriter{
 9        w:    w,
10        buff: make([]byte, 4096),
11        n:    0,
12    }
13}

Let’s continue implementing the Write method:

 1func (b *bufferWriter) Write(buff []byte) (int, error) {
 2    totalWrite := 0
 3    for len(buff) > len(b.buff)-b.n {
 4        var n int
 5        if b.n == 0 {
 6            n, _ = b.w.Write(buff)
 7        } else {
 8            n = copy(b.buff[b.n:], buff)
 9            b.n += n
10            _ = b.flush()
11        }
12        totalWrite += n
13        buff = buff[n:]
14    }
15    n := copy(b.buff[b.n:], buff)
16    b.n += n
17    totalWrite += n
18    return totalWrite, nil
19}
20
21func (b *bufferWriter) flush() error {
22    panic("todo")
23}

In the above code, the logic is pretty straightforward. We are going to write the data in the buff parameter to our buffer. If adding the buff to an existing buffer will make the buffer full, we just put part of buff to our buffer so that our buffer becomes full and then flush it. We have a special case where our internal buffer is empty. In this case, we can just write the whole buff directly. The rest of the code is straightforward; you just need to put the remaining buff to the internal buffer and update the metadata.

Implementing the flush function is quite straightforward as well:

 1func (b *bufferWriter) flush() error {
 2    if b.n == 0 {
 3        return nil
 4    }
 5
 6    n, err := b.w.Write(b.buff[:b.n])
 7    if n < b.n && err == nil {
 8        err = io.ErrShortWrite
 9    }
10    if err != nil {
11        if n > 0 && n < b.n {
12            copy(b.buff[0:b.n-n], b.buff[n:b.n])
13        }
14        b.n -= n
15        return err
16    }
17    b.n = 0
18    return nil
19}

To flush our internal buffer to the socket, we just need to call b.w.Write and pass our buffer. This might work well if our underlying writer is a socket. But let’s think bigger and assume that the internal writer can be any writer. When we call b.w.Write, it returns an integer n indicating how many bytes are written. This n is not guaranteed to be as big as the bytes we pass to b.w.Write; it can be less. When this happens, we want to return an error, io.ErrShortWrite to be precise, and update our internal buffer so that it only includes the bytes that failed to be written. This is technically unnecessary for our use case, but it’s nice to have more generic functions like this.

Now, so far we’ve implemented the buffered writer like bufio.Writer. We haven’t added the timeout yet. We can spawn a separate goroutine that will flush our buffer every 10ms. Note that I pick 10ms as the timeout out of thin air here. You might wonder, why not 1ms or 20ms? Well, there is no particular reason for that. You want the number to be big enough so that it can gather a lot of bytes before it sends them to the socket. But it also needs to be small enough so that your latency won’t be that bad.

One thing that you need to be careful about here is race conditions. Now because we spawn a new goroutine that flushes the buffer periodically, there is now a chance that two goroutines will access our local buffer concurrently. You need a lock to avoid this problem.

Periodically calling flush also has another problem. Now, your CPU needs to wake up every 10ms and check whether there are bytes in the buffer that can be sent through the socket. You might think that 10ms is not that bad, but actually if you have 1000 clients, your CPU might need to wake up way more often than 10ms. So, instead of doing that, let’s make the CPU completely sleep when the buffer is empty. Only when someone fills it with some bytes do we want to start the 10ms timer to send the message. So, if nothing is written to the local buffer, we just sleep endlessly. Once something is written to the local buffer, we will wake up, wait for 10ms and send the buffer through the socket if the buffer is not empty.

To implement the above idea, we can use the sync.Cond utility provided by Golang. sync.Cond is almost like sync.Mutex. You can acquire a lock using .Lock() just like sync.Mutex. You can also unlock it using .Unlock(). Additionally, it provides two more methods, which are .Wait() and .Signal(). Well, actually it also has one more method called Broadcast, which is almost like Signal(), but we won’t be using that.

So, what does Wait() do? It’s a little bit complex, so buckle up. In order to call Wait, you must first lock the sync.Cond. If you call Wait without locking the sync.Cond, a panic will be raised. Now, calling Wait will unlock the sync.Cond (remember, you need to lock the sync.Cond before calling Wait), and pause the current goroutine until someone calls Signal or Broadcast. So, once you call Wait, the lock will be released and you will be sleeping until Signal is called to wake you up. Once you are woken up by someone that calls Signal, you will acquire the lock of sync.Cond again. Notice that before you call Wait(), you have the lock and you will have the lock again after the Wait() returns. So, apparently, this property is very useful to wait for some condition. That’s why it’s called Cond and the method is called Wait.

What do I mean by “wait for some condition”? Let’s use our example. Remember we want to spawn a new goroutine that will periodically flush the buffer? Let’s call that goroutine the “flusher goroutine”. As an optimization, we want our flusher goroutine to just sleep when the buffer is empty. Only when the buffer is filled do we want to wait for 10ms to flush its content. “When the buffer is filled” is what we call the condition. In other words, our condition is b.n == 0. To sleep when a condition is met, you can use this pattern:

1cond.Lock()
2for checkTheCondition() {
3    c.Wait()
4}
5// At this line, the condition is false, and you can do anything you want here
6cond.Unlock()

Later, somewhere, you should call Signal if you change the condition. So for our case, the flusher goroutine will look like this:

 1func (b *bufferWriter) Start() error {
 2    b.cond.L.Lock()
 3    for b.n == 0 {
 4        b.cond.Wait()
 5    }
 6
 7    time.Sleep(10 * time.Microsecond)
 8
 9    // Note that because flush accesses b.buff, we must call it
10    // while holding the lock
11    if err := b.flush(); err != nil {
12        b.cond.L.Unlock()
13        return err
14    }
15    b.cond.L.Unlock()
16}

In the code above, we will sleep as long as b.n == 0. Once someone signals us, we check again if b.n == 0. If it’s still true, we sleep again. Until one day someone signals us and b.n > 0, we start to sleep for 10ms and flush the buffer. There is a little problem here. We call time.Sleep while holding the lock. This is not okay because this means while we are sleeping, other goroutines can’t do anything that requires the lock. But we don’t actually need the lock because we’re just sleeping. So we don’t want to block other people from acquiring the lock. To fix that issue, we can unlock the condition before sleeping:

 1func (b *bufferWriter) Start() error {
 2    b.cond.L.Lock()
 3    for b.n == 0 {
 4        b.cond.Wait()
 5    }
 6
 7    b.cond.Unlock()
 8    time.Sleep(10 * time.Microsecond)
 9    b.cond.Lock()
10
11    // Note that because flush accesses b.buff, we must call it
12    // while holding the lock
13    if err := b.flush(); err != nil {
14        b.cond.L.Unlock()
15        return err
16    }
17    b.cond.L.Unlock()
18}

Unfortunately, the above code still has a problem. It only runs once. We want it to run infinitely. Let’s fix that:

 1func (b *bufferWriter) Start() error {
 2    for {
 3        b.cond.L.Lock()
 4        for b.n == 0 {
 5            b.cond.Wait()
 6            b.cond.L.Unlock()
 7            time.Sleep(10 * time.Microsecond)
 8            b.cond.L.Lock()
 9        }
10        if err := b.flush(); err != nil {
11            b.cond.L.Unlock()
12            return err
13        }
14        b.cond.L.Unlock()
15        // This unlock is immediately followed by Lock on line 3 above
16        // Can you improve this function so that we don't have to Unlock here?
17    }
18}

This implementation is still not perfect yet because we haven’t implemented the graceful shutdown feature yet. But let’s not make it too complicated.

Next, we just need to update our Write method so that it calls Signal whenever it writes a message that makes the buffer not empty.

 1func (b *bufferWriter) Write(buff []byte) (int, error) {
 2    // notice that since now our buffer can be accessed by
 3    // the flusher goroutine, we need to acquire the lock first
 4    // before accessing it
 5    b.cond.L.Lock()
 6    defer b.cond.L.Unlock()
 7    ...
 8    // now, after every write, we need to call Signal so that
 9    // our flusher goroutine will wake up from its sleep, check
10    // whether the buffer is empty, and start the 10ms timer to
11    // flush the buffer if it's not empty.
12    b.cond.Signal()
13    return totalWrite, nil
14}

Now we just need to wire our buffered writer to the TCP connection:

 1func (s *server) handleConn(clientId int64, conn net.Conn) {
 2    ...
 3    writer := newBufferWriter(conn)
 4    go func() {
 5        // We haven't handled the shutdown here. Ideally, when the client is
 6        // disconnected, we should call writer.Shutdown() and cleanup the
 7        // flusher goroutine so that it won't cause memory leak.
 8        if err := writer.Start(); err != nil {
 9            s.logger.Error(
10                "buffered writer error",
11                slog.Int64("id", clientId),
12                slog.String("err", err.Error()),
13            )
14        }
15    }()
16    ...
17}

Let’s try to benchmark our redis server again.

1❯ go test . -v -bench=BenchmarkRedisSet -benchmem -benchtime=10s -cpuprofile=cpu.out -run="^$"
2goos: linux
3goarch: amd64
4pkg: github.com/jauhararifin/goredis
5cpu: AMD Ryzen 9 7900X 12-Core Processor
6BenchmarkRedisSet
7BenchmarkRedisSet-24            13776495               963.5 ns/op         1037839 ops/sec           640 B/op         42 allocs/op
8PASS
9ok      github.com/jauhararifin/goredis 14.556s

Now, we got 1.037 million SET operations per second. It’s not much of an improvement, is it? In fact, we actually perform worse. If you run the benchmark multiple times, some of them might perform better and some might perform worse, but the difference is not that significant. What’s going on here? Well, this is because writing was never the bottleneck. We haven’t checked what’s the bottleneck of our application and jumped into an idea to buffer the write operations. There are other bigger bottlenecks in our application, and the overhead of write operations is very small compared to those. Improving the write won’t give you more throughput if you don’t address the bigger bottleneck. As you can see, all of that work was just wasted without any meaningful performance improvement. This is why profiling our application before making performance improvements is very important.

Lock Contention

Let’s actually find the bottleneck, shall we? Let’s open the cpu.out on the speedscope app.

CPU Profiling 6

Now, you can see that things are not as straightforward as before. There are several blocks that look large, but not super large. In the topmost boxes, you can see the largest box is handleConn, which makes sense because that’s the thing we are trying to optimize. Looking at handleConn alone won’t help us that much. There’s a lot going on inside handleConn - which one is the bottleneck? Well, to answer that, you can see the box below handleConn. The biggest box below handleConn is handleSetCommand, which makes sense, as that’s the command we’re currently benchmarking. You can double click handleSetCommand to zoom in.

CPU profiling of handleSetCommand

You can see in the flamegraph that the CPU spends most of its time on sync.(*RWMutex).Lock, runtime.mapassign_faststr, and sync.(*RWMutex).Unlock. We know sync.(*RWMutex).Lock is the function we use to acquire the mutex lock and sync.(*RWMutex).Unlock is the function we use to release the lock. Apparently, when we handle the SET command, we spend most of our time locking and unlocking the mutex. We can consider this as one of our potential bottlenecks. I use the word “potential” here because sometimes there are things that we have to do and can’t be removed. We don’t know yet whether we can improve or remove the mutex locking here. Spoiler alert: we can. But sometimes, knowing the bottleneck itself is just half of the story. Addressing the bottleneck is another story. OK, so we know one potential bottleneck, i.e., locking the mutex.

At this point, you may ask why locking the mutex is so expensive. To answer that question, we need to look into how it works behind the scenes. We know that accessing a memory region concurrently can cause race conditions, and it’s not OK. So, things like sync.Mutex might seem like an impossible data structure to implement. Well, actually, accessing the same memory region can be safe if you’re using atomic operations. There are a bunch of atomic operations provided by sync/atomic. These are special functions that will be translated into special instructions in the CPU. With these operations, you can concurrently access the same memory region safely. Unfortunately, there are not a lot of atomic instructions we can use. We will focus on one instruction for now, i.e., CompareAndSwapInt32. The logic of CompareAndSwapInt32 is something like this:

 1// This is the illustration of how CompareAndSwapInt32 behaves.
 2// This is not the actual implementation of CompareAndSwapInt32
 3// because the implementation below can cause data race when
 4// called concurrently. In practice, every call to
 5// CompareAndSwapInt32 will be translated into special CPU
 6// instruction that performs the operations below, but atomically.
 7func CompareAndSwapInt32(addr *int32, oldV, newV int32) bool {
 8    if *addr == oldV {
 9        *addr = newV
10        return true
11    }
12    return false
13}

The above code shows roughly the internal logic of CompareAndSwapInt32. However, those operations happen atomically. So calling CompareAndSwapInt32 on the same address concurrently will never cause a race condition. Suppose we have an int32 variable called x with a value of 100. Two goroutines call CompareAndSwapInt32 on variable x:

 1var a, b bool
 2
 3// goroutine A:
 4go func() {
 5    a = atomic.CompareAndSwapInt32(&x, 100, 200)
 6}()
 7
 8// goroutine B:
 9go func() {
10    b = atomic.CompareAndSwapInt32(&x, 100, 300)
11}()

If both goroutine A and B happen to execute the CompareAndSwapInt32 at the same time, what will happen is: one of them will succeed, and one of them will fail. The successful call to CompareAndSwapInt32 will return true, and the failed one will return false. So, in the above example, the final outcomes after both goroutines execute are either:

You can think that either the operation in goroutine A is executed first or the operation in goroutine B is executed first. This is called sequential consistency. Go memory model guarantees sequential consistency access. Other programming languages like C++ and Rust have more flexible memory models.

So, let’s get back to the question of how mutex works behind the scenes. Well, we can have a single integer that represents the mutex state; it can be locked or unlocked. To lock the mutex, we can use atomic.CompareAndSwapInt32(&the_state, 0, 1) to switch the mutex from unlocked state to locked state. If the CompareAndSwapInt32 returns true, that means we’ve successfully locked the mutex. In this case, we can just return. This is called the fast-path of mutex locking. CompareAndSwapInt32 is pretty fast; it’s almost like just assigning a value to a variable. If we don’t get the lock right away, we will lock the mutex in the slow-path.

 1type Mutex struct {
 2    state int32
 3}
 4
 5func (m *Mutex) Lock() {
 6    // Fast path: grab unlocked mutex.
 7    if atomic.CompareAndSwapInt32(&m.state, 0, 1) {
 8        return
 9    }
10    m.lockSlow()
11}
12
13func (m *Mutex) lockSlow() {
14    ...
15}

What happens in the slow path? Well, one way we can do it is just check the state over and over again and try to swap it from 0 to 1 until we succeed.

1func (m *Mutex) lockSlow() {
2    for {
3        if atomic.CompareAndSwapInt32(&m.state, 0, 1) {
4            break
5        }
6    }
7}

However, this is a very bad implementation of Mutex. This implementation is called spin-lock. This is bad because it wastes our CPU cycles just to check whether our mutex is released and we are able to lock it. To improve this, we can just sleep until woken up by another goroutine that unlocks the mutex. But it’s not that simple. What if the mutex is unlocked right before we have the chance to sleep? Our goroutine will end up sleeping forever. To address this issue, Go uses semaphore. Semaphore is like a counter that we can wait on. When we wait on a semaphore, we will decrease the counter in the semaphore by one and sleep if the new counter is negative. This will happen atomically. On the other side, we can send a signal to the semaphore to increase the counter. We can use this to unlock the mutex. I won’t go into more detail on the code because it’s quite involved. You can check the lockSlow implementation yourself for the details.

There is one more problem. Making the goroutine sleep is not easy. Our goroutine is managed by the Golang runtime. Behind the scenes, there is this thing called a thread that executes our goroutine. When our goroutine is sleeping, the thread should execute another goroutine. The problem is, which goroutine should it execute next? Well, it should execute goroutines that are not sleeping and ready to be executed. Finding this next goroutine to execute is not easy either. If two goroutines are sleeping at the same time, then our threads might be fighting over the next goroutine that they want to execute. Go needs to protect itself from race conditions as well. On top of that, we need to somehow remember that our current goroutine is sleeping and should be woken up later when the semaphore counter is increased. This makes things quite complicated and takes quite a lot of CPU cycles also. This process of finding the next goroutine to execute, saving the state of the current goroutine, and starting to execute the next goroutine is called context switching.

To avoid the above problem, Go actually uses a hybrid strategy. Go will do the spin-lock strategy first for a few times. Spin-locking once or twice is cheaper than making the goroutine sleep because we don’t have to context switch. But the more you spin, the more you use CPU cycles. There will be a limit where keeping spinning is not worth it, and you might as well sleep until the one who holds the lock wakes you up by calling Unlock. So, that’s how the slow-path works, i.e., it will spin for a few times and sleep if it fails to acquire the lock after many retries.

The condition where you can’t acquire the mutex lock because it’s being held by another goroutine is called lock contention. As you see above, higher lock contention means more CPU cycles wasted for spinning and context switching. If you take a look at the profiling result below (the same as above), you will see the runtime.mcall box is quite large.

CPU Profiling 6

If you notice, in the flamegraph, runtime.mcall calls runtime.parm_m, runtime.schedule, runtime.findRunnable, etc. What all these functions do is perform the context switching we just talked about. runtime.findRunnable is the function used by Go runtime to find the next runnable goroutine to execute. Here, you can think a goroutine is “runnable” if it’s not sleeping.

As you can see here, the overhead of the lock contention is quite high. The total CPU time spent in locking, unlocking, and context switching is about 35%.

sync.Map

Identifying the bottleneck of our application is the first step in improving the performance of our app. Knowing the bottleneck helps us find the correct improvements to apply to our application. However, it might not be the whole story. Often, we also need to know the access pattern of our application to choose the right strategy that can improve our application.

Ok, so now we know that we have high lock contention. How do we reduce it? Well, we can ask ourselves why we need that lock in the first place. We need it to protect our map from concurrent access. Fortunately, Go has a sync.Map data structure that we can use. From the documentation, it says that this particular map is just like map[any]any, but safe for concurrent access without additional locking or coordination. Loads, stores, and deletes run in amortized constant time. That sounds like the map we need. Let’s just swap our map implementation with that.

 1type server struct {
 2    ...
 3    database sync.Map
 4}
 5
 6func (s *server) handleGetCommand(clientId int64, conn io.Writer, command []any) error {
 7    ...
 8    valueObj, ok := s.database.Load(key)
 9    if ok {
10        value := valueObj.(string)
11        ...
12    }
13    ...
14}
15
16func (s *server) handleSetCommand(clientId int64, conn io.Writer, command []any) error {
17    ...
18    s.database.Store(key, value)
19    ...
20}

You can find the full code here.

Let’s run our benchmark again.

1❯ go test . -v -bench=BenchmarkRedisSet -benchmem -benchtime=10s -cpuprofile=cpu.out -run="^$"
2goos: linux
3goarch: amd64
4pkg: github.com/jauhararifin/goredis
5cpu: AMD Ryzen 9 7900X 12-Core Processor
6BenchmarkRedisSet
7BenchmarkRedisSet-24             9882496              1237 ns/op            808707 ops/sec           626 B/op         47 allocs/op
8PASS
9ok      github.com/jauhararifin/goredis 13.738s

Surprisingly, our throughput drops quite a lot now. It drops about 27%. What’s going on here? Well, before we use sync.Map, we should know its limitations.

From the Go documentation, you can see what the sync.Map is optimized for:

The Map type is optimized for two common use cases: (1) when the entry for a given key is only ever written once but read many times, as in caches that only grow, or (2) when multiple goroutines read, write, and overwrite entries for disjoint sets of keys. In these two cases, use of a Map may significantly reduce lock contention compared to a Go map paired with a separate Mutex or RWMutex.

Those two cases are not the ones we’re currently testing. In our benchmark, each client generates random keys and calls SET operations using those keys. This is not case number (1) described by Go documentation above. A single key is not set once and read many times. Instead, it is set many times. Additionally, each client can set any key. Each client generates random keys, so keys might overlap. Two different clients might end up using the same key. This is why using sync.Map doesn’t give us any improvement. It just adds overhead of using any instead of string.

However, sync.Map might be more relevant in a real production environment. Usually, when we have an application, most of the data is never accessed. It could be that 80% of traffic accesses only 20% of the total data. On top of that, an entry in the map is rarely changed but often read. Imagine you are building a social media platform. You can store the user profile in Redis using a key like this: user/<the_user_id>. If this user is a very popular person, this entry will be read many times but rarely changed. This entry may change occasionally when the user updates their profile information, which is rare.

This is why understanding our access pattern is important. Knowing the access pattern helps us find the best strategy to improve the performance of our app.

You might wonder how Go implements data structures like sync.Map. I won’t go into detail here, but at a high level, Go’s sync.Map combines map, mutex and atomic operations. If we want to set a key that already exists, we can just swap the value of the key with the new value. We can use compare-and-swap operation for this. Remember, compare-and-swap is quite cheap, so it can have better performance. Because we only access the map using atomic operations, we can just lock the map using a shared-lock. Now, if the key is not in the map yet, we need to lock the map exclusively to add that key. By doing this, adding a new key is quite expensive because we need to acquire the lock exclusively. But, if the key is already there, we can just lock the map in shared mode and use atomic operations to swap the value of the key.

Sharding

Ok, it seems sync.Map doesn’t work well in our case. We can try something else.

The most common trick for performance optimization is to make expensive operations less frequent. I mentioned this in previous section as well. In this case, our expensive operation is the mutex locking. Or to be more precise, our expensive operation comes from the fact that there are a lot of concurrent attempts to lock the mutex that cause a single locking to fall into the slow path more often. Remember how mutex works? If we’re lucky, acquiring the lock from mutex is just as simple as swapping the mutex state from unlocked to locked. Only when the lock is held by another goroutine do we fail to do this and have to pick the slow path, i.e., spinning for a few times and sleeping until woken up when the mutex is unlocked.

So, following the principle of making expensive operations less frequent, we can try to make it less often for the mutex to pick the slow path. To do this, we can do sharding. Instead of having one map and one mutex, we can have 1000 maps and mutexes. We can do this because most of the SET operations are independent of each other. If we send a SET command for keys a and b, we can perform the insertion for key a on one map, and insertion for key b on another map. Now, what we can do is assign each key to one of 1000 maps. For example, key a goes to map number 12, key b goes to map number 930, key c goes to map number 8, etc. By doing this, we distribute the load into 1000 maps.

Distributing the load into 1000 maps might not seem to have any benefit. At the end of the day, inserting 1 million keys means locking mutexes 1 million times, and inserting to maps 1 million times as well. But here’s the thing: the chance that there are two concurrent accesses to the same map is smaller now. So, yes, we are going to lock the mutexes 1 million times, but most of those operations will fall into the fast-path because it’s less likely that 2 goroutines lock the same mutex.

As an analogy, imagine a busy 10-lane highway with a single toll gate. If the traffic is low, this is not a problem; when a car arrives at the gate, they can pay quickly and move on. This whole process might finish before the next car arrives at the gate. So, serving a single car is fast because they don’t have to wait for another car. Now, imagine when the traffic is high. Cars arrive at the gate faster than the gate can handle. As a result, there will be a lot of cars waiting for each other. Because of this, now a single car takes longer to serve. To solve this, you can add more gates. You can imagine, instead of having one gate, we can have 15 gates. Now, each car can choose which gate they want, and the load will be distributed to 15 gates instead of one. By doing this, the waiting time for a car will also be reduced.

To implement sharding, first we need a way to choose which shard a key will go to. The same key has to go into the same shard, because otherwise we can lose it. If you use map number 412 for key user/1, then you should always use that same map for key user/1. If you insert the key user/1 to map 412, but fetch it from map 989, you won’t get the key. So, we need a function that maps a string to an integer in the range from 0 to 999 that deterministically tells us which shard should get the key. We can use a hash function to do this. In fact, our map actually already does this, that’s why it’s called a hashmap.

Essentially, a hash function is just a function that takes something and produces an integer. In our case, the hash function will take a string (our key) and return an int64. There are a lot of hash functions out there. Different hash functions are used for different purposes as well. For example, there are hash functions used for cryptography. Those hash functions are optimized to be practically collision-free and reasonably fast. In our case, we want the hash function to be very fast and uniformly distributed. We want our hash function to be fast because, of course, we want our server to be fast as well. We don’t want this hash function to become the next bottleneck. We want it to be uniformly distributed so that each shard has a uniformly distributed load as well. We don’t want one shard to be the bottleneck. We don’t want one of the shards to handle 80% of the traffic while the other 999 shards handle 20% of the traffic.

Fortunately, Go’s standard library provides us with a hash function that is ready to use. It’s called fnv. There are other hash functions too, and you are free to explore all the options, but I will use fnv here. We can use the hash/fnv package to use this hash function:

1const nShard = 1000
2
3func calculateShard(s string) int {
4    hasher := fnv.New64()
5    _, _ = hasher.Write([]byte(s))
6    hash := hasher.Sum64()
7    return int(hash % uint64(nShard))
8}

In the above code, calculateShard is the function that maps our key to the shard index. Since our hash function returns a 64-bit integer, we need to convert it so that it ranges from 0 to 999. To do that, we just need to mod the hash result with the number of shards we have, i.e., 1000.

Now, we need to make 1000 maps and mutexes instead of just one.

1type server struct {
2    ...
3    dbLock   [nShard]sync.RWMutex
4    database [nShard]map[string]string
5}

And don’t forget to update the SET and GET handlers to use the correct shard.

 1func (s *server) handleGetCommand(clientId int64, conn io.Writer, command []any) error {
 2    ...
 3    shard := calculateShard(key)
 4    s.dbLock[shard].RLock()
 5    value, ok := s.database[shard][key]
 6    s.dbLock[shard].RUnlock()
 7    ...
 8}
 9
10func (s *server) handleSetCommand(clientId int64, conn io.Writer, command []any) error {
11    ...
12    shard := calculateShard(key)
13    s.dbLock[shard].Lock()
14    s.database[shard][key] = value
15    s.dbLock[shard].Unlock()
16    ...
17}

You can find the full source code here.

Let’s run our benchmark again. But, instead of running the benchmark for 10 seconds, I will run it for 20 million SET operations. You can do this by using -benchtime=20000000x flag instead of -benchtime=10s. The reason for doing this is because when you set the benchmark time to 10 seconds, Go’s test framework doesn’t actually run it for 10 seconds; it runs your benchmark several times first to get a rough estimate of your test’s throughput. From that estimation, it deduces how many operations it should run to make the test last 10 seconds. The problem is, sometimes our throughput is higher in the beginning but slowly drops as the program runs. Because of this, Go will incorrectly estimate the number of operations needed to last 10 seconds. I set the benchmark to run exactly 20 million operations to avoid this problem. You might wonder why our program has higher throughput during startup but slowly drops. There might be a reason for that, but it would take quite a long time to explain in this article, so I won’t be covering that. Ok, let’s run our benchmark.

1❯ go test . -v -bench=BenchmarkRedisSet -benchmem -benchtime=20000000x -cpuprofile=cpu.out -run="^$"
2goos: linux
3goarch: amd64
4pkg: github.com/jauhararifin/goredis
5cpu: AMD Ryzen 9 7900X 12-Core Processor
6BenchmarkRedisSet
7BenchmarkRedisSet-24            20000000               179.2 ns/op         5581010 ops/sec           580 B/op         42 allocs/op
8PASS
9ok      github.com/jauhararifin/goredis 3.926s

Wow, the throughput increased more than 5x 🤯. That is quite some improvement, isn’t it? Now, if you open the CPU profiling result on speedscope, you won’t see big boxes of runtime.mcall, sync.(*RWMutex).Lock, sync.(*RWMutex).Unlock anymore.

CPU Profiling 7

Runtime Functions

If you take a look at the profiling result, now there are only two topmost boxes with significant size, i.e., handleConn and runtime.gcBgMarkWorker. We know handleConn is our handler and its bottleneck is the handleSetCommand. If you look closely at the handleSetCommand, you can see that now the biggest bottleneck is not the Lock and Unlock anymore, but the runtime.mapassign_faststr.

CPU Profiling 8

If you look closely again, you will see that the second bottleneck of handleSetCommand is runtime.convT, which calls runtime.mallocgc. So, now we have 3 candidate bottlenecks we can analyze, i.e., runtime.mapassign_faststr, runtime.convT, and runtime.gcBgMarkWorker. What are these things?

mapassign_faststr

Let’s start with runtime.mapassign_faststr. What is this? We never write any function that resembles this name. Where does this function come from? Well, this function is provided by Golang’s runtime, that’s why it starts with the runtime package. You can think of a function that is placed in the runtime package as a function provided by Golang’s runtime. So, what does this function do? Well, from its name, we can actually tell that it assigns a value to a map. When you assign a value to a map, behind the scenes, the Go compiler will rewrite your code to call runtime.mapassign_faststr. If your map’s key is string, it has a special case and uses mapassign_faststr. It has other functions that can be used for other types like a struct as well. So, the profiling result tells us that the next potential bottleneck is assigning a value to a map.

Now, you might wonder, how do I know all these things? Well, you can actually experiment with it yourself. Try to write code that inserts a key to a map of string, and debug it. Run the code line by line using the debugger and see which function is called when you insert a value to a map. Let’s try that. First, let’s write a main function that inserts a value to a map.

1package main
2
3func main() {
4    m := make(map[string]int)
5    m["hello"] = 10
6    println(m["hello"])
7}

Now, we can save it as map.go, build it to map, and use delve to debug our application. If you don’t have delve yet, you can follow their installation guide.

1❯ go build -o map map.go
2❯ dlv exec ./map
3Type 'help' for list of commands.
4(dlv)

You can type help to get the list of instructions to operate the debugger. One of the most useful things to do in a debugger is adding breakpoints. To add a breakpoint in your program, use break <the_function_name>. We can use break main.main to add a breakpoint to our main function. When we add a breakpoint to a function, our debugger will stop when we execute that function. So, later when we run the program inside the debugger, we will be stopped at main.main. You may ask why we need to do this? main is the first function we call, why do we need to pause it there? Why can’t the debugger be smart enough to know that I want to run the program line by line? Well, the thing is, main is actually not the first function run when you start your program. There are a bunch of setup steps before the main function is run. Go runtime needs to set up the stack, garbage collection, etc. And also, most of the time, a program is very complex and we don’t want to break on the main function. Most of the time, when we debug a specific function, we want to break at that particular function.

Ok, now let’s run our program. We can type continue to start running the program inside the debugger.

 1❯ dlv exec ./map
 2Type 'help' for list of commands.
 3(dlv) break main.main
 4Breakpoint 1 set at 0x457676 for main.main() ./map.go:3
 5(dlv) continue
 6> main.main() ./map.go:3 (hits goroutine(1):1 total:1) (PC: 0x457676)
 7Warning: debugging optimized function
 8     1: package main
 9     2:
10=>   3: func main() {
11     4:         m := make(map[string]int)
12     5:         m["hello"] = 10
13     6:         println(m["hello"])
14     7: }
15     8:

As you can see, now the program is paused at main.main. Next, we can type step twice so that we are about to execute m["hello"] = 10.

 1(dlv) c
 2> main.main() ./map.go:3 (hits goroutine(1):1 total:1) (PC: 0x457676)
 3Warning: debugging optimized function
 4     1: package main
 5     2:
 6=>   3: func main() {
 7     4:         m := make(map[string]int)
 8     5:         m["hello"] = 10
 9     6:         println(m["hello"])
10     7: }
11     8:
12(dlv) step
13> main.main() ./map.go:4 (PC: 0x45767d)
14Warning: debugging optimized function
15     1: package main
16     2:
17     3: func main() {
18=>   4:         m := make(map[string]int)
19     5:         m["hello"] = 10
20     6:         println(m["hello"])
21     7: }
22     8:
23(dlv) step
24> main.main() ./map.go:5 (PC: 0x4576c9)
25Warning: debugging optimized function
26     1: package main
27     2:
28     3: func main() {
29     4:         m := make(map[string]int)
30=>   5:         m["hello"] = 10
31     6:         println(m["hello"])
32     7: }
33     8:

Now, here is the tricky part. If you use step, delve will skip builtin functions like runtime.mapassign_faststr. We need to use step-instruction to step one instruction instead of one logical Golang source code line.

 1(dlv) step
 2> main.main() ./map.go:5 (PC: 0x4576c9)
 3Warning: debugging optimized function
 4     1: package main
 5     2:
 6     3: func main() {
 7     4:         m := make(map[string]int)
 8=>   5:         m["hello"] = 10
 9     6:         println(m["hello"])
10     7: }
11     8:
12(dlv) step-instruction
13> main.main() ./map.go:5 (PC: 0x4576d0)
14Warning: debugging optimized function
15        map.go:4        0x4576b8        4889442438      mov qword ptr [rsp+0x38], rax
16        map.go:4        0x4576bd        0f1f00          nop dword ptr [rax], eax
17        map.go:4        0x4576c0        e89be9feff      call $runtime.fastrand
18        map.go:4        0x4576c5        89442434        mov dword ptr [rsp+0x34], eax
19        map.go:5        0x4576c9        488d05f06c0000  lea rax, ptr [rip+0x6cf0]
20=>      map.go:5        0x4576d0        488d5c2428      lea rbx, ptr [rsp+0x28]
21        map.go:5        0x4576d5        488d0debef0000  lea rcx, ptr [rip+0xefeb]
22        map.go:5        0x4576dc        bf05000000      mov edi, 0x5
23        map.go:5        0x4576e1        e8ba5efbff      call $runtime.mapassign_faststr
24        map.go:5        0x4576e6        48c7000a000000  mov qword ptr [rax], 0xa
25        map.go:6        0x4576ed        488d05cc6c0000  lea rax, ptr [rip+0x6ccc]

Now, instead of Golang’s code, you see a bunch of assembly code. This is because we chose to step one instruction instead of one line. If you look at 3 lines below the current instruction, you will see an instruction that looks like this: call $runtime.mapassign_faststr. This assembly instruction basically tells us to call a function called runtime.mapassign_faststr. That’s how we know that inserting a string to a map will call runtime.mapassign_faststr.

You can think of assembly code as the actual code that will be executed by your CPU. CPUs don’t understand our Go code; they only understand machine instructions. Machine instructions are very tedious to write manually because of several reasons. Different CPUs might have different kinds of instructions. And a single machine instruction is usually very primitive. It doesn’t have high-level concepts like loops and variables. It is the job of the Go compiler to translate our Go code into machine instructions. You can learn more about x86 assembly here.

With a similar technique, you can also see that fetching the value from a map of string will call runtime.mapaccess1_faststr.

 1> main.main() ./map.go:6 (PC: 0x4576ed)
 2Warning: debugging optimized function
 3     1: package main
 4     2:
 5     3: func main() {
 6     4:         m := make(map[string]int)
 7     5:         m["hello"] = 10
 8=>   6:         println(m["hello"])
 9     7: }
10     8:
11(dlv) si
12> main.main() ./map.go:6 (PC: 0x4576f4)
13Warning: debugging optimized function
14        map.go:5        0x4576d5        488d0debef0000  lea rcx, ptr [rip+0xefeb]
15        map.go:5        0x4576dc        bf05000000      mov edi, 0x5
16        map.go:5        0x4576e1        e8ba5efbff      call $runtime.mapassign_faststr
17        map.go:5        0x4576e6        48c7000a000000  mov qword ptr [rax], 0xa
18        map.go:6        0x4576ed        488d05cc6c0000  lea rax, ptr [rip+0x6ccc]
19=>      map.go:6        0x4576f4        488d5c2428      lea rbx, ptr [rsp+0x28]
20        map.go:6        0x4576f9        488d0dc7ef0000  lea rcx, ptr [rip+0xefc7]
21        map.go:6        0x457700        bf05000000      mov edi, 0x5
22        map.go:6        0x457705        e8d65afbff      call $runtime.mapaccess1_faststr
23        map.go:6        0x45770a        488b00          mov rax, qword ptr [rax]
24        map.go:6        0x45770d        4889442420      mov qword ptr [rsp+0x20], rax

You can continue experimenting with it. Try to see which runtime function is called when you create a slice, append to a slice, send to a channel, etc.

Memory Allocations

Now, let’s talk about runtime.convT. From the profiling result, we know handleSetCommand calls runtime.convT somewhere and it takes a lot of our CPU time. You can follow the method I used in the previous section to know which part actually calls runtime.convT. However, it’s quite tedious to actually run the server, trigger the SET command, and step the code line-by-line until the runtime.convT is called. To save more time, I’ll show you another way to find which line triggers runtime.convT. First, you can build your project and open it with delve.

1❯ go build -o server goredis/main.go
2❯ dlv exec server
3Type 'help' for list of commands.
4(dlv)

But now, instead of setting a breakpoint and running it, we can just disassemble the code. To do that, we can use the disassemble command. We want to disassemble the handleSetCommand function in the goredis package. To do that, we can type disassemble -l goredis.handleSetCommand.

 1TEXT github.com/jauhararifin/goredis.(*server).handleSetCommand(SB) /home/jauhararifin/Repositories/goredis/server.go
 2        server.go:244   0x4f4620        4c8da42450ffffff                lea r12, ptr [rsp+0xffffff50]
 3        server.go:244   0x4f4628        4d3b6610                        cmp r12, qword ptr [r14+0x10]
 4        server.go:244   0x4f462c        0f86e0040000                    jbe 0x4f4b12
 5        server.go:244   0x4f4632        55                              push rbp
 6        server.go:244   0x4f4633        4889e5                          mov rbp, rsp
 7        server.go:244   0x4f4636        4881ec28010000                  sub rsp, 0x128
 8        server.go:244   0x4f463d        4889b42458010000                mov qword ptr [rsp+0x158], rsi
 9        server.go:244   0x4f4645        48898c2448010000                mov qword ptr [rsp+0x148], rcx
10        server.go:244   0x4f464d        4889bc2450010000                mov qword ptr [rsp+0x150], rdi
11...

There are a bunch of assembly codes there. You can type /convT to search any occurrence of convT text in the assembly code.

1        server.go:264   0x4f4800        e81b73f1ff                      call $runtime.convT
2        ...
3        server.go:265   0x4f4827        e8f472f1ff                      call $runtime.convT
4        ...
5        server.go:266   0x4f484e        e8cd72f1ff                      call $runtime.convT

As you can see, it shows where the runtime.convT is called. We know it’s called at server.go:264. Those lines are:

262s.logger.Debug(
263    "SET key into value",
264    slog.String("key", key),
265    slog.String("value", value),
266    slog.Int64("clientId", clientId),
267)

If you look at the function signature of Debug, it looks like this:

1func (l *Logger) Debug(msg string, args ...any)

Which indicates that the Debug function is expecting any, but slog.String returns slog.Attr. runtime.convT is called when you need to convert a type to an interface.

Now, that’s actually not the most important part. If you look at the profiling result, you’ll see that most of the runtime.convT CPU is used to call runtime.mallocgc. What is this runtime.mallocgc?

CPU Profiling 9

runtime.mallocgc is just like another internal runtime function used by the Golang runtime. From the name, it’s actually very clear what it does: it allocates memory. Just like malloc in C. The difference is that the allocated object lifetime is managed by the Golang runtime as well. So, you don’t have to free the allocated memory once you’ve done using it. Golang’s runtime garbage collector will deallocate those objects for you.

In this article, I’ll assume that you already know what memory is and how our variables are stored in memory.

Now, the thing is, converting a concrete type to an interface will most likely allocate an object. You can think of an interface in Go as a struct containing a pointer to the real object and a bunch of function pointers. Let me give you an example. You know that bufio.Reader implements the io.Reader interface. You can think of an io.Reader interface, behind the scenes, as just a struct that looks like this:

1package io
2
3type ReaderInterface struct {
4    theObject uintptr
5
6    Read func(theObject uintptr, buff []byte) (n int, err error)
7}

And bufio.Reader is just an ordinary struct:

1package bufio
2
3type Reader struct{...}
4
5func (*Reader) Read(buff []byte) (n int, err error) { ... }

But behind the scenes, you can think of bufio.Reader as something that looks like this:

1package bufio
2
3type Reader struct{...}
4
5// the Read method is just a function that takes `bufio.Reader` as its first parameter.
6func Reader_Read(theObject uintptr, buff []byte) (n int, err error) { ... }

When you call the Read method, what happens behind the scenes is something like this:

1// originally, calling Read method is written like this
2r := bufio.Reader{...}
3r.Read(...)
4
5// but, you can think, internally, it looks like this
6r := bufio.Reader{...}
7Reader_Read(&r, ...) // calling a method is just calling a function
8                     // and passing the object as its first argument

Normally, you can convert a bufio.Reader to io.Reader by doing this:

1r := bufio.Reader{...}
2var s io.Reader = r

But you can think that behind the scenes, what happens is something like this:

1r := bufio.Reader{...}
2var s io.ReaderInterface = io.ReaderInterface {
3    theObject: &r,
4    Read: bufio.Reader_Read,
5}

I need to note here that this is not exactly what actually happens. This is more like an analogy. The point is that an interface is just another data structure that contains information about the underlying value and the list of methods it has (it’s called the vtable). And a method is just an ordinary function where the first parameter is the object it operates on. If you notice, in my analogy, I convert the Read method of Reader struct into Reader_Read. This is called name mangling, and it’s actually more complicated than that. All functions and methods are actually just functions with mangled names. So a method w of type x in package y/z will be mangled into a function called y/z.(x).w, for example. The mangling rule of Go is not really standardized yet, so we can’t really rely on that.

Now, because an interface needs a pointer to its underlying type, it means the underlying type must be moved into the heap. Well, not always, but most of the time. That’s why converting a struct to an interface might call mallocgc. The Go compiler actually has a technique called escape analysis to remove the heap allocation if possible. In our case, the Go compiler can’t be sure that it’s safe to remove the heap allocation when we pass slog.Attr to Debug, so the runtime.mallocgc is still there.

Allocating memory is actually quite a complex task. There are a bunch of libraries and algorithms that try to do it. Go uses an algorithm similar to TCMalloc. You can read the implementation of mallocgc in malloc.go. There is a comment explaining that the algorithm was based on tcmalloc.

1// Memory allocator.
2//
3// This was originally based on tcmalloc, but has diverged quite a bit.
4// http://goog-perftools.sourceforge.net/doc/tcmalloc.html

The allocation algorithm is quite complex because you need to consider the fact that there can be multiple goroutines trying to allocate objects concurrently. The objects you want to allocate have different sizes as well. Because of that, you can have memory fragmentation that can slow down your application. That’s why it’s best not to allocate too much. It’s better to allocate large long-living objects once in a while, compared to allocating small short-living objects very often.

Garbage Collection

Now, let’s talk about runtime.gcBgMarkWorker. Well, this is the Go garbage collection process. Go uses a generational garbage collection technique. The way it works is, first it will try to mark all objects that are reachable from our variables as “live” objects. Whichever objects that are not “live” are considered dead and should be deallocated. This phase is called the mark phase. The second phase is freeing all of the dead objects. This phase is called the sweep phase. That’s why the algorithm is called mark and sweep. From the name, we can assume runtime.gcBgMarkWorker is the function that runs the first phase (the mark phase). You can also read the source code to be sure:

 1func init() {
 2    go forcegchelper()
 3}
 4
 5func forcegchelper() {
 6    ...
 7    gcStart(...)
 8    ...
 9}
10
11func gcStart(trigger gcTrigger) {
12    ...
13    gcBgMarkStartWorkers()
14    ...
15}
16
17// gcBgMarkStartWorkers prepares background mark worker goroutines. These
18// goroutines will not run until the mark phase, but they must be started while
19// the work is not stopped and from a regular G stack. The caller must hold
20// worldsema.
21func gcBgMarkStartWorkers() {
22    // Background marking is performed by per-P G's. Ensure that each P has
23    // a background GC G.
24    //
25    // Worker Gs don't exit if gomaxprocs is reduced. If it is raised
26    // again, we can reuse the old workers; no need to create new workers.
27    for gcBgMarkWorkerCount < gomaxprocs {
28        go gcBgMarkWorker()
29        ...
30    }
31    ...
32}

As you can see, when your Go program starts, it spawns some threads to run gcBgMarkWorker that run the mark phase. I’m actually not really sure whether this gcBgMarkWorker is run inside a goroutine or not; I might need to read more about the internal garbage collection logic.

So, the way it works is, when your memory allocation reaches a certain threshold, Go garbage collection will be triggered. And what this means is that the more often you allocate, the more often the GC (garbage collector) runs. So, the fact that runtime.mallocgc is one of our bottlenecks means we allocate a lot and it affects the GC as well. That’s why, usually, when your CPU spends a lot of time on runtime.mallocgc, it will also spend a lot of time on runtime.gcBgMarkWorker.

For a more detailed guide on the Go Garbage Collector, you can read A Guide to the Go Garbage Collector

Reducing Allocations

Let’s try to reduce our allocations. By reducing the allocations, we should reduce the CPU time spent on runtime.mallocgc, which also will reduce the CPU time spent on runtime.gcBgMarkWorker. But before that, let me bring the last benchmark result.

1❯ go test . -v -bench=BenchmarkRedisSet -benchmem -benchtime=20000000x -cpuprofile=cpu.out -run="^$"
2goos: linux
3goarch: amd64
4pkg: github.com/jauhararifin/goredis
5cpu: AMD Ryzen 9 7900X 12-Core Processor
6BenchmarkRedisSet
7BenchmarkRedisSet-24            20000000               179.2 ns/op         5581010 ops/sec           580 B/op         42 allocs/op
8PASS
9ok      github.com/jauhararifin/goredis 3.926s

Notice that since we added -benchmem flag, the benchmark result is not only reporting the latency and throughput, it also reports the number of bytes allocated per operation and the number of allocations per operation. We can see that we only allocate 580 bytes per operation and we make 42 allocations per operation. Later, we can use these numbers to confirm that our improvement indeed reduces the number of allocations. We will only focus on the number of allocations per operation. We won’t be analyzing the number of bytes allocated per operation since it’s already quite small. Usually, when the size of allocations is this small, it doesn’t matter that much compared to the number of allocations.

In the previous section, we’ve established that the runtime.mallocgc comes from the Debug function that we use to print debug logs. Let’s just try to remove that debug log.

 1func (s *server) handleConn(clientId int64, conn net.Conn) {
 2    ...
 3-   s.logger.Debug(
 4-       "request received",
 5-       slog.Any("request", request),
 6-       slog.Int64("clientId", clientId),
 7-   )
 8    ...
 9}
10
11func (s *server) handleGetCommand(clientId int64, conn io.Writer, command []any) error {
12    ...
13-   s.logger.Debug("GET key", slog.String("key", key), slog.Int64("clientId", clientId))
14    ...
15}
16
17func (s *server) handleSetCommand(conn io.Writer, command []any) error {
18    ...
19-   s.logger.Debug(
20-       "SET key into value",
21-       slog.String("key", key),
22-       slog.String("value", value),
23-       slog.Int64("clientId", clientId),
24-   )
25    ...
26}

Let’s try benchmarking our code again:

1❯ go test . -v -bench=BenchmarkRedisSet -benchmem -benchtime=20000000x -cpuprofile=cpu.out -run="^$"
2goos: linux
3goarch: amd64
4pkg: github.com/jauhararifin/goredis
5cpu: AMD Ryzen 9 7900X 12-Core Processor
6BenchmarkRedisSet
7BenchmarkRedisSet-24            20000000               139.0 ns/op         7192789 ops/sec           316 B/op         36 allocs/op
8PASS
9ok      github.com/jauhararifin/goredis 3.075s

Nice, now we only have 36 allocations per ops. We’ve removed about 6 allocations per operation. I’m not sure why we removed 6 allocations, to be honest. In my calculation, we’ve removed 2 debug logs on the SET operation’s path. Those 2 debug logs should correspond to 5 allocations. I’m not sure yet where the additional one allocation comes from.

Now, let’s try to reduce the allocations further. Let’s open the profiling result again. This time, let’s look into the “Sandwich” tab, and look for runtime.mallocgc. Using the sandwich tab, you can see all the call stacks of runtime.mallocgc to determine who calls runtime.mallocgc and what functions are called by runtime.mallocgc.

runtime.mallocgc Call Stack

The flamegraph shows that the function readOne calls runtime.mallocgc. Which makes sense because if you look into that function, we allocate a slice of bytes with the size of one.

 1func readOne(r io.Reader) (byte, error) {
 2    buff := [1]byte{}
 3    n, err := r.Read(buff[:])
 4    if err != nil {
 5        return 0, err
 6    }
 7    if n != 1 {
 8        return 0, io.EOF
 9    }
10    return buff[0], nil
11}

We can actually use the Go compiler to find out where the allocations are performed in our code. We can use go build command and pass -gcflags "-m -l" to output the memory analysis.

 1❯ go build -gcflags "-l -m" *.go
 2# command-line-arguments
 3./buff_writer.go:17:22: leaking param: w
 4./buff_writer.go:18:23: &sync.Mutex{} escapes to heap
 5./buff_writer.go:19:9: &bufferWriter{...} escapes to heap
 6./buff_writer.go:22:13: make([]byte, 4096) escapes to heap
 7./buff_writer.go:91:7: leaking param content: b
 8./buff_writer.go:28:7: leaking param content: b
 9./buff_writer.go:59:7: leaking param content: b
10./buff_writer.go:59:30: leaking param: buff
11./resp.go:82:14: leaking param: r
12./resp.go:83:2: moved to heap: buff
13./resp.go:94:17: leaking param: r
14...
15./resp.go:48:24: ... argument does not escape
16./resp.go:48:79: b escapes to heap
17./resp.go:51:16: string(buff) escapes to heap
18...

There is one line saying ./resp.go:83:2: moved to heap: buff. This line tells us that our variable buff has escaped to heap and needs an allocation each time we call readOne. You can use this information to find out where your allocations are and remove them.

It is very unfortunate that the buff in readOne needs an allocation. The reason it allocates memory in the heap is that we want to call Read that accepts a slice instead of a single byte. We don’t really want to read a slice of bytes. It just happens that the API requires a slice instead of a single byte.

On top of that, in function readBulkString, we allocate another slice of bytes to read the raw string as bytes and convert it to string before return. As you can see in the above analysis, the line ./resp.go:51:16: string(buff) escapes to heap tells us that converting []byte to string triggers another allocation.

 1func readBulkString(r io.Reader) (string, error) {
 2    ...
 3    buff := make([]byte, size)
 4    if _, err := io.ReadFull(r, buff); err != nil {
 5        return "", err
 6    }
 7    ...
 8
 9    // the line below triggers another allocation
10    return string(buff), nil
11}

To eliminate those allocations, we can use some kind of global variable to store the []byte we are going to use for reading inputs from client. Now, instead of allocating a new []byte each time we read the user input, we can just allocate it once and reuse it each time we read the input. We can do this because the []byte is just a temporary buffer. In the readOne function, the []byte is no longer used after return. In the readBulkString function, the []byte will be converted into string and will no longer be used. However, we can’t just make it a global variable because multiple clients can read the input concurrently. So, instead, we are going to have a buffer for each client.

Now, let’s refactor the resp.go. First, we can create a new struct called messageReader that will handle all the input reader. This struct will be created one for each client. This is where we put our buffer.

 1type messageReader struct {
 2    r             *bufio.Reader
 3    temp          []byte
 4    stringBuilder strings.Builder
 5}
 6
 7func newMessageReader(r io.Reader) *messageReader {
 8    return &messageReader{
 9        r:             bufio.NewReader(r),
10        temp:          make([]byte, 4096),
11        stringBuilder: strings.Builder{},
12    }
13}

Instead of purely using []byte, I will also use strings.Builder here to let you know that such a thing exists. strings.Builder is a convenient helper to build a string without allocating a new string each time we append to it. Internally, strings.Builder uses []byte to store the temporary string. We can call the String() method to convert those bytes into a string.

Next, let’s implement the ReadString method to read a string from the input.

 1func (m *messageReader) ReadString() (string, error) {
 2    b, _ := m.r.ReadByte()
 3    if b == '$' {
 4        length, _ := m.readValueLength()
 5
 6        // Reset resets the strings.Builder. We do this so that the old data
 7        // won't affect our new string
 8        m.stringBuilder.Reset()
 9
10        // If we know that the string we're going to receive will be of a
11        // certain length, we can force the strings.Builder to pre-allocate
12        // a buffer with our known length.
13        m.stringBuilder.Grow(length)
14        if err := m.pipeToWriter(&m.stringBuilder, length); err != nil {
15            return "", err
16        }
17
18        // This is how we build the string without additional allocation
19        s := m.stringBuilder.String()
20        if err := m.skipCRLF(); err != nil {
21            return "", err
22        }
23
24        return s, nil
25    } else {
26        return "", fmt.Errorf("malformed input")
27    }
28}
29
30func (m *messageReader) readValueLength() (int, error) {
31    b, _ := m.r.ReadByte()
32    result := int(b - '0')
33
34    for b != '\r' {
35        b, err = m.r.ReadByte()
36        if b != '\r' {
37            result = result*10 + int(b-'0')
38        }
39    }
40
41    b, _ = m.r.ReadByte()
42    if b != '\n' {
43        return 0, fmt.Errorf("malformed input")
44    }
45
46    return result, nil
47}
48
49func (m *messageReader) pipeToWriter(w io.Writer, length int) error {
50    remaining := length
51    for remaining > 0 {
52        n := remaining
53        if n > len(m.temp) {
54            n = len(m.temp)
55        }
56        io.ReadFull(m.r, m.temp[:n])
57        remaining -= n
58        w.Write(m.temp[:n])
59    }
60
61    return nil
62}

I omit a lot of error handling to make the code more concise.

Writing the rest of the functionality of messageReader is quite straightforward. You can check this commit to see the full implementation.

Next, we need to wire the messageReader to our handler:

 1func (s *server) handleConn(clientId int64, conn net.Conn) {
 2    ...
 3-   reader := bufio.NewReader(conn)
 4+   reader := newMessageReader(conn)
 5    ...
 6    for {
 7-       request, err := readArray(reader, true)
 8+       length, err := reader.ReadArrayLen()
 9        ...
10-       commandName, ok := request[0].(string)
11+       commandName, err := reader.ReadString()
12+       unsafeToUpper(commandName)
13        ...
14
15        switch commandName {
16        case "GET":
17-           err = s.handleGetCommand(writer, request)
18+           err = s.handleGetCommand(reader, writer)
19        case "SET":
20-           err = s.handleSetCommand(writer, request)
21+           err = s.handleSetCommand(reader, writer)
22        ...
23        }
24
25    }
26}
27
28func (s *server) handleGetCommand(reader *messageReader, conn io.Writer) error {
29-   ...
30+   key, err := reader.ReadString()
31    ...
32}
33
34func (s *server) handleSetCommand(reader *messageReader, conn io.Writer) error {
35-   ...
36+   key, err := reader.ReadString()
37-   ...
38+   value, err := reader.ReadString()
39    ...
40}

You may notice a new function called unsafeToUpper. This function is used to convert a string to uppercase. It’s similar to strings.ToUpper, but instead of returning a new string, unsafeToUpper converts the passed string directly. This function is unsafe because, in Go, strings are supposed to be immutable. But, here, we mutate it. We can do this using the unsafe package provided by Go. Using unsafe.StringData, we can get the pointer to the raw bytes of the string. Then, we can use unsafe.Slice to convert the pointer to the raw bytes into []byte. Then, we just iterate through the bytes and convert every lowercase byte into uppercase.

 1func unsafeToUpper(s string) {
 2    bytes := unsafe.Slice(unsafe.StringData(s), len(s))
 3    for i := 0; i < len(bytes); i++ {
 4        b := bytes[i]
 5        if b >= 'a' && b <= 'z' {
 6            b = b + 'A' - 'a'
 7            bytes[i] = b
 8        }
 9    }
10}

Ok, that’s quite some changes. Let’s run the benchmark again.

1❯ go test . -v -bench=BenchmarkRedisSet -benchmem -benchtime=20000000x -cpuprofile=cpu.out -run="^$"
2goos: linux
3goarch: amd64
4pkg: github.com/jauhararifin/goredis
5cpu: AMD Ryzen 9 7900X 12-Core Processor
6BenchmarkRedisSet
7BenchmarkRedisSet-24            20000000                99.89 ns/op       10011157 ops/sec           164 B/op          4 allocs/op
8PASS
9ok      github.com/jauhararifin/goredis 2.245s

Wow, look at that, now we only have 4 allocations per operation, and we have 10 million operations per second throughput. That is impressive, isn’t it? Now, if you look at the profiling result again, you won’t see any runtime.mallocgc anymore because we’ve reduced to only 4 allocations per operation. We still have runtime.gcBgMarkWorker, though. But, it’s not as bad as before. I think we can reduce the allocations to zero allocation per ops, but it would be quite complex to explain here.

CPU Profiling 11

Swisstable

So far, this article uses Go 1.21 where the map is not implemented using swisstable. In Go 1.24, map will be implemented using swisstable.

From the profiling result, it’s quite clear that our next bottleneck is runtime.mapassign_faststr. To improve our map operation, clearly, we can just use a faster map. But, before that, let’s talk about how map works internally.

In Go, and other programming languages as well, maps are usually implemented using hash table data structure. Internally, a hash table is just an array. When you insert a key xyz to a map, there is an algorithm inside the map implementation to decide which index this key should be put on. That algorithm uses a hash function to convert xyz into an integer and use it to determine the index where xyz should be put into.

Hash Table

What happens if two different keys have the same hash value? This condition is called collision. When there are two different keys, but they have the same hash value, we say that those keys collide. There are two major techniques to solve collision, i.e., chaining and open addressing. Up until version 1.23, Go uses chaining to solve the collision. In chaining, each element in the array is actually a linked list. So, when two keys collide, both of them will be inserted into the same list in linear fashion.

Collision on Chaining Hash Table

Because of this, the more collisions you have, the slower your operation will be.

Ideally, you want a hash function that has a low collision rate. There are a bunch of hash functions out there such as MurmurHash and xxHash. In the previous section, we also used a hash function called fnv provided by the Go standard library. Those hash functions have good enough collision rates.

There are other hash functions that are used in cryptography like MD5, and SHA1. Cryptographic hash functions typically have very small collision rates, but they are very slow. We want our hash function to be super fast because we are going to call it millions of times. Each time we insert a key to a map, or fetch a key from the map, we need to hash them first. Cryptographic hash functions are not suitable for our purposes because even though they’re designed to have extremely low collision rates, they are very slow. It’s ok for them to be a little slower in favor of smaller collision rates. In contrast, for us, it’s ok to have higher collision rates in favor of computation speed.

Choosing a good hash function is not enough. The more keys you insert into the hash table, the more likely they are to collide with each other. If you represent the hash table using an array of 100 elements, no matter how good your hash function is, you will always have collisions when you have 101 unique keys.

To reduce the collision rate, typically, a hash table is designed so that it can grow when the number of elements gets bigger. In Go, when the ratio between the number of entries in the map vs the size of the internal array reaches a certain threshold, the map will grow twice as large. By doing this, the collision rate can be maintained at a low level.

The problem with the chaining method is it’s not cache friendly when a collision occurs. When your key collides, you need to iterate through a linked list, which means you will access objects that are placed far from each other in the memory. This is where open addressing can perform better than chaining.

In open addressing, when a collision occurs, instead of iterating through a linked list, we just insert the key into the next index. This is called probing.

Collision on Open Addressing Hash Table

By doing this, when a collision occurs, we just need to check the next slot. Checking the next slot is cheaper than iterating through a linked list because there is a high chance that the next slot element is in the same cache line as the original slot.

Of course, the next slot might already be occupied by another key. In this case, you need to check the next-next slot. This goes on until you find one empty slot. This is called linear probing. There are other probing techniques, one of them is quadratic probing, which found to be better at handling collision, but I won’t go into much detail on the probing technique.

We can improve the probing performance by using SIMD. Modern CPUs typically have SIMD instructions. With SIMD instructions, we can perform one instruction on multiple data. We can utilize these instructions for probing. We can check the value of the current index and next few indexes using one instruction. Open addressing allows us to do this because of the way it handles collision. We can’t easily utilize SIMD in chaining-based hash tables.

There is a version of hash table called Swisstable that uses open addressing and utilizes SIMD instructions. There is also a library we can use that already implements it in Go called SwissMap. The Go team also plans to use swisstable for their map in Go 1.24.

As of now, Go doesn’t support any SIMD operations yet. So, to use SIMD, you need to manually write assembly code.

Let’s try to replace our map with the swisstable version. I will copy the implementation of SwissMap and modify it a little bit to match our needs. Their implementation uses generics to allow any key and value type. But, as of Go 1.21, using generics has its own performance overhead.

 1type server struct {
 2-   database [nShard]map[string]string
 3+   database [nShard]*swisstable.Map
 4}
 5
 6func (s *server) handleGetCommand(reader *messageReader, conn io.Writer) error {
 7    ...
 8-   value, ok := s.database[shard][key]
 9+   value, ok := s.database[shard].Get(key)
10    ...
11}
12
13func (s *server) handleSetCommand(reader *messageReader, conn io.Writer) error {
14    ...
15-    s.database[shard][key] = value
16+    s.database[shard].Put(key, value)
17    ...
18}

Let’s run our benchmark again:

1❯ go test . -v -bench=BenchmarkRedisSet -benchmem -benchtime=20000000x -cpuprofile=cpu.out -run="^$"
2goos: linux
3goarch: amd64
4pkg: github.com/jauhararifin/goredis
5cpu: AMD Ryzen 9 7900X 12-Core Processor
6BenchmarkRedisSet
7BenchmarkRedisSet-24            20000000                75.42 ns/op       13258318 ops/sec           149 B/op          4 allocs/op
8PASS
9ok      github.com/jauhararifin/goredis 1.724s

Wow, that’s almost 33% improvement in throughput 🤯. Fortunately, Go 1.24 will use this technique to power Golang maps. So, we don’t have to bother implementing this ourselves.

What’s The Catch

While our implementation achieved impressive performance numbers, there are several important caveats to consider:

First, our benchmark is highly artificial. We used uniformly distributed keys which is rarely the case in real-world applications. In practice, some keys are accessed much more frequently than others, often following patterns like the Pareto principle where 20% of keys receive 80% of accesses. This access pattern can significantly impact performance, especially with our sharding strategy.

Second, the benchmarks were run on a high-end desktop processor (AMD Ryzen 9 7900X with 24 cores) with high single-thread performance and high clock speeds. Most servers use processors with different characteristics:

  1. Lower clock speeds to maintain reasonable power consumption and heat generation
  2. Higher core counts optimized for parallel workloads
  3. Different cache hierarchies and memory architectures
  4. Power and thermal constraints that can affect sustained performance

Third, our architecture differs significantly from Redis in important ways:

  1. Redis is primarily single-threaded, which simplifies implementation of features like transactions and atomic operations
  2. Our multi-threaded approach adds complexity for handling concurrent access
  3. We haven’t implemented key features like expiration which would add overhead and complexity
  4. Adding expiration would require additional data structures and background processing to track and remove expired keys
  5. Our implementation lacks many Redis features like persistence, replication, and the full command set

The key takeaway is that raw performance numbers in isolation don’t tell the whole story. Real-world performance depends heavily on actual usage patterns, hardware characteristics, and required feature set.

Where To Go From Here

If you’re interested in diving deeper into the topics covered, here are some excellent topics you can learn:

  1. A Guide to the Go Garbage Collector.
  2. What Every Programmer Should Know About Memory.
  3. Lock Free Data Structure
  4. Systems Performance book by Brendan Gregg
  5. Designing SIMD Algorithm From Scratch.