Migrating millions of Redis keys without downtime
Last year in September I joined the Job Patterns team at Shopify.
The mission of the team is to provide a stable platform so that developers can write their background jobs to power one of the biggest e-commerce platforms in the world.
Since I joined, I have been gathering context around the various components that create the unique Shopify ecosystem.
To provide some context, Shopify is a massive Ruby on Rails monolith application and the background job architecture consists of ActiveJob, Resque, and Redis. Besides the functionality that those libraries provide by default, we have created many additional modules that allow developers to define custom behaviour for their jobs:
Status, and many more.
At Shopify, we have many Redis instances; Each instance stores information that belongs to different parts of the platform.
This post is going to focus on how we managed to migrate millions of keys from one of our Redis instances to another without downtime or incidents.
The module we’ll be discussing here is the
Locking module. Developers use this module to prevent multiple jobs of the same class with the same arguments to be executed by multiple processes at the same time. It provides the same functionality as unique jobs for Sidekiq.
Before enqueuing the job, it checks for the existence of the lock key. If the lock key does not exist, we acquire it until the job is done and finally release it. If the lock key does exist at the time of enqueueing it means that another job already exists, so we do not enqueue the new job.
At the current growth rate of Shopify, we are looking into multiple ways to optimize the background jobs infrastructure for performance.
To reduce load from a single Redis for jobs queues, we plan on deploying more Redis instances so we can multiplex both the enqueue and dequeue operations.
A blocker for this idea is that we would need to know at all times where the unique locks are stored 🤔.
We decided on the solution to move lock keys from the Redis instance holding the queue information to a separate Redis instance. That way, we know at all times where the lock keys are stored unlike job queues that could span across multiple Redis instances in the future.
We process hundreds of thousands of jobs per minute, and those jobs are time-sensitive, so stopping the system, migrating the keys and deploying the changes is not a possible solution for us. We, therefore, had to perform the migration without a maintenance window or downtime.
How did we manage to achieve this?
We devised a 3-step plan that would allow us to do it. All steps required code changes in the application, so the full migration took roughly 2 weeks.
Let’s introduce our
Locking module. The following is going to be a simplified version of the one currently maintained at Shopify:
class Locking AlreadyAcquireLockError = Class.new(StandardError) attr_reader :lock_key, :token def initialize(lock_key, token: SecureRandom.uuid) @lock_key = lock_key @token = token @have_lock = false end def have_lock? @have_lock end def acquire(duration) raise AlreadyAcquireLockError if have_lock? @have_lock = redis.set(key, token, ex: duration, nx: true) end def relase redis.del(key) @have_lock = false end def locked? redis.exists(key) end private def redis Resque.redis end end
In the following steps, we will refer to the Redis instance holding the queue information as the
jobs Redis (the source of the migration), and the Redis instance holding the locks information as the
locks Redis (the destination of the migration).
We modify the
locked? method to check on the locks Redis and then on the resque Redis. With this change, the functionality stays the same, but we introduce the locks Redis as a new dependency.
def locked? redises.each do |redis| break(true) if redis.exists(key) end false end private def redis Resque.redis end def redises [Lock.redis, redis] end
We are going to start
acquiring the lock key on the
locks Redis. The
release method tries to release the lock from the
locks Redis instance first, and if not successful, it will try releasing the lock from the Resque Redis instance. The
locked? method stays the same as in the first step.
def acquire raise AlreadyAcquireLockError if have_lock? @have_lock = lock_redis.set(key, token, ex: duration, nx: true) end def release redises.each do |redis| # redis returns the number of keys deleted if redis.del(key) > 0 @have_lock = false break end end end private def redis Resque.redis end def lock_redis Lock.redis end def redises [lock_redis, redis] end
Note: After deploying this change, we monitored the platform for a couple of days to make sure everything was working as expected (meaning, lock keys were being acquired and released without any issue).
We change all the code to make sure that the only Redis instance involved with the
Locking module is the
locks Redis. All acquiring, releasing and checking actions of the keys have now been migrated over.
private def redis Lock.redis end
With these steps, we were able to migrate the lock keys successfully without impacting the platform 🎉 🎉.
Before starting the migration we asked ourselves questions like: Would the
locks Redis be able to handle the load? Is the
locks Redis a single point of failure?
The changes weren’t as straightforward as described above. There were other components involved, many tests to modify and some infrastructure changes to be done in other areas for this to happen but those are out of the scope of the post.
Of course, there is no simple, one-size-fits-all solution, but I wanted to share our approach with everyone, and hopefully, if you encounter a similar situation this could be of use.
If you have any thoughts or questions, please share, and I will be happy to answer in the comments.