Sending an e-mail to millions of users

November 20, 2017 | Adrien Siami | 7-minute read

Recently, we had to send an e-mail to all our active users. For cost reasons, we decided to invest a bit of tech time and to go with transactional e-mails instead of using an e-mail marketing platform.

While it would certainly be quite straightforward for, say, hundreds or even thousands of users, it starts to get a bit more complicated for larger user bases.

In our case, we had to send the e-mail to ~1.5 million e-mail addresses.

In this blog post, I’ll quickly explain why a standard approach is not acceptable and go through the solution we chose.

A naive solution

Let’s implement a very naive way to send an e-mail to all our users. We’re going to create a job that loops through all the users and enqueues an e-mail.

class MassEmailJob < ApplicationJob

  queue_as :default

  def perform
    User.find_each do |user|
      Notifier.some_email(user).deliver_later
    end
  end

end

Now let’s see what could go wrong.

Your job might get killed

Looping through millions of users is not free and it will most likely take a fair amount of time.

During this time, you’re maybe going to deploy, restarting your job manager, and killing your job. Now you don’t know which users have received the e-mail, and which have not yet.

One easy fix would be to run the code outside of your job workers, maybe in a rake task, but you have to make sure it won’t get killed, or that if it’s killed, you can resume it without any issue.

You are going to get blacklisted from e-mail providers

E-mail providers don’t like spam. If you send thousands of e-mails from the same IP in a short time, you’re guaranteed to get throttled or even blacklisted.

Therefore, it is necessary to space out the e-mails a bit, for example, adding a 30s delay every 100 e-mails.

You are going to congest your job queue

Every e-mail to be sent equals a job run in your job queue: if you enqueue millions of jobs in the same queue you use for other operations, you’re going to create a lot of congestion.

Therefore, you’d probably want to have a special queue only for your sending with a dedicated worker.

Our solution

First, let’s list the requirements we had in mind:

  • We wanted to be able to enqueue as many or as few e-mails to be able to test the water first (check deliverability, congestion) and then scale up
  • We wanted to easily be able to establish those users for whom we had scheduled an email, and those users who were still waiting.
  • We had to be able to stop sending emails quickly in case something went wrong, and we had to be able to resume it without losing data.

Redis to the rescue

Redis is an amazing multi-purpose tool, it can be used for storing short lived data such as cache, used as a session store, etc. It has a multitude of useful data structures, the one we’re going to use today is the Sorted Set.

A sorted set it a bit like a hash / dictionary / associative array. It contains a list of values, and each of these values has a score.

Redis offers very useful functions to deal with sorted sets, let’s have a look at one in particular.

ZRANGEBYSCORE

This function returns a range of n elements from the sorted set, with a score included between min and max, can you see where this is going? :)

We’re going to store all our user ids in a sorted set, with a score of 0, and change that score to 1 when we enqueue an e-mail for them.

Then, it’s really easy to ask for any number of users for whom we haven’t enqueued the e-mail, using ZRANGEBYSCORE.

Building the sorted set

Let’s create a rake task to populate a sorted set with our user ids.

task :populate_users_zset => :environment do
  redis =  Redis.new(YOUR_CONFIG)
  User.select('id').find_each.each_slice(100) do |users|
    redis.multi do
      users.each do |user|
        redis.zadd('mass_email_user_ids', 0, user.id)
       end
     end
  end
end

Here I’m using MULTI to add the user ids 100 by 100 to the set in transactions, to go easy on redis CPU.

While this task may take quite some time, it is safe to re-launch if killed.

Enqueuing a number of e-mails for send

Now that we have our sorted set, let’s write another task. This one will pick a given number of user ids from the set and enqueue an e-mail for them, while spacing out the sends in time a bit.

task :send_email_batch, [:batch_size] => :environment do |t, args|
  redis = Redis.new(YOUR_CONFIG)
  ids = redis.zrangebyscore('mass_email_user_ids', 0, 0, limit: [0, args.batch_size])
  delay = 30.seconds

  ids.each_slice(100) do |ids_slice|
    ids_slice.each do |id|
      Notifier.some_email(User.find(id)).deliver_later(wait: delay)
      redis.zadd('mass_email_user_ids', 1, id, xx: true)
    end
    delay += 30.seconds
  end
end

Here I get as many user ids as requested thanks to ZRANGEBYSCORE and its limit option. I then iterate over the ids and enqueue the jobs 100 by 100, while delaying the sending by 30 seconds each time.

And that’s it! Thanks to this system you can gradually increase your e-mail batches while keeping an eye on deliverability.

Send 100 mails to test it out:

rake 'your_namespace:send_email_batch[100]'

Everything looks good ? Send 1000, then 10000, etc.

Then it’s easy to know how many e-mails are left to be scheduled: just pop a redis console and ask away using zcount!

Remaining e-mails to schedule:

ZCOUNT mass_email_user_ids 0 0

E-mails already scheduled or sent:

ZCOUNT mass_email_user_ids 1 1

Cons

Obviously there is no perfect solution, here are a few downsides:

Quite a few manual actions

This is clearly not a fire and forget solution, it needs the attention of a dev for a little bit of time: enqueuing the sends, monitoring, waiting for a batch to finish and then send another one, etc.

However, this kind of sending is usually rare but important, so having it done right is worth the effort.

Stopping the machine, is possible, but at a cost

If you enqueue a lot of small batches, you’re going to be fine, but at some point you are going to enqueue batches of 100k e-mails or even more.

What if something goes wrong (deliverability dropping, etc) and you want to stop everything to have a look? You would need to stop the dedicated worker but the jobs are already enqueued, meaning that if you don’t resume for a long time, when starting over the jobs are going to run without delay and you may experience congestion or throttling from your e-mail provider.

This is a risk we were willing to take and that we mitigated with strong monitoring and cautious batching.

Conclusion

This solution worked well for our needs, but as always, your mileage may vary!

Sending millions of e-mails is tricky, but is an interesting problem to solve. Thanks to a bit of custom dev and redis, we were able to send our e-mail in a reasonable amount of time with excellent deliverability.

View openings 👍  Like this post? Join Drivy's engineering team!