Reading Rails - How Do Batched Queries Work?


Once again, we shall dig into the wonderful world of Rails using my patented “Pick Something At Random” technique. This time, we’ll investigate find_each, Rails’ mechanism for performing batches queries.

Batched Queries

When you query for records, ActiveRecord instantiates those records as models. Generally, this is exactly what you want, however there are times when you do not want all the records instantiated at once.

Instantiating records takes memory, if you just need to iterate over the records and either collect some information from them, or perhaps serialize them to JSON, you can decrease the total memory footprint by working in batches. This allows the garbage collector to clean up while you process those records.

The second reason, you may want to process records in batches if you are doing a long running task in the background. In this case you might want ActiveRecord to return a batch, do some work on it, and then sleep for a bit so that your background job doesn’t dominate the database.

How does ActiveRecord perform these batched queries? Before looking at the code, let’s look at an example, and see what SQL gets executed:


# Without using a batched query:
Hedgehog.all.map{|p| p.to_json }
# Hedgehog Load (0.8ms) SELECT "hedgehogs".* FROM "hedgehogs"

# With a batched query:
Hedgehog.find_each(batch_size: 50).map{|p| p.to_json }
# Hedgehog Load (0.7ms) SELECT "hedgehogs".* FROM "hedgehogs" 
#   ORDER BY "hedgehogs"."id" ASC LIMIT 50
# Hedgehog Load (0.6ms) SELECT "hedgehogs".* FROM "hedgehogs" 
#   WHERE ("hedgehogs"."id" > 50) ORDER BY "hedgehogs"."id" ASC LIMIT 50
# Hedgehog Load (0.3ms) SELECT "hedgehogs".* FROM "hedgehogs" 
#   WHERE ("hedgehogs"."id" > 100) ORDER BY "hedgehogs"."id" ASC LIMIT 50

You may notice is that find_each requires more time querying the database. Keep this in mind if you use it, as developers we need to be aware of the trade offs we make.

The first example is quite straightforward, every record is selected at once and returned in an array. The second example using find_each issues queries that dutifully limit the result set to the batch_size we requested. Notice that id is used to order these queries. This ensures that the same record is never returned twice.

Try playing around with find_each a bit in your console. Try querying with a where clause, how does that affect id condition? Try imposing a sort order, what happens, why?

How it Works

Pop open ActiveRecord’s source and take a look at lib/relation/batches.rb.

def find_each(options = {})
  if block_given?
    find_in_batches(options) do |records|
      records.each { |record| yield record }
    end
  else
    enum_for :find_each, options do
      options[:start] ? where(table[primary_key].gteq(options[:start])).size : size
    end
  end
end

The if statement switches the behavior of this method based on whether or not you pass in a block. Let’s investigate the else first. In this case Rails calls enum_for :find_each. This method will call whatever method is passed in as a symbol, and return an Enumerator object that you can pass around. You can see this in action by calling an enumerable method without passing a block:

pairs = [1,2,3].each_cons(2) #=> #<Enumerator: [1, 2, 3]:each_cons(2)> 
pairs.map{|i,j| i + j }      #=> [3,5]

In the example above, each_cons(2) returns an Enumerator that will return consecutive pairs on demand. Let’s see this in action in Rails by calling find_each without a block:

Hedgehog.find_each(batch_size: 50) #=>  #<Enumerator: #<ActiveRecord::Relation [...]>

Yup, there’s the Enumerator. This explains why I was able to call map on the initial example. It may seem odd that this method essentially calls itself again using enum_for, but this is actually a common idiom for Enumerable methods, and the call made by enum_for passes in a block to be executed.

Te block being passed to enum_for is a bit confusing. Take another look at it:

enum_for :find_each, options do
  options[:start] ? where(table[primary_key].gteq(options[:start])).size : size
end

At first glance, one might think this is the block that should be applied to find_each, but it appears to return a count of all the records. Enumerable objects may actually have an infinite number of items in them, Ruby can’t assume that it can iterate over every item and count them up. Instead enum_for allows a block to be passed in that will calculate the total number of items. What happens if we just call enum_for ourselves and don’t pass in a block:

# With the block for find_each:
Hedgehog.find_each.size            #=> 103
# (0.4ms)  SELECT COUNT(*) FROM "hedgehogs"

# With no block:
Hedgehog.enum_for(:find_each).size #=> nil

As you can see without the block, Ruby doesn’t know how it should count the results of find_each, but with the block, it performs a SQL count.

Now lets look at what happens once a block is present:

find_in_batches(options) do |records|
  records.each { |record| yield record }
end

ActiveRecord calls find_in_batches, and then yields once for each record in each batch. This lets us use find_each without worrying about the batches or their size. To the caller, it just looks like we get a nice set of records we can enumerate over. Although a small touch, it makes find_each much easier to use.

Now let’s dig into find_in_batches, it’s a larger method, so we’ll just pick out the most interesting bits, and see what we can learn. At the very beginning of the method, you’ll see a call to options.assert_valid_keys.

def find_in_batches(options = {})
  options.assert_valid_keys(:start, :batch_size)

Passing around hashes of options in Ruby is convenient, but error prone. assert_valid_keys raises an exception if one of the options isn’t in the list:

{:batch_size => 10}.assert_valid_keys(:start, :batch_size)
# => {:batch_size => 10}
{:size => 10}.assert_valid_keys(:start, :batch_size)
# => ArgumentError: Unknown key: :size. Valid keys are: :start, :batch_size

This is a nice way to ensure the caller doesn’t accidentally misname and option. Assuming the options are valid, Rails will now start iterating over batches of records. It does this by taking the relation you called find_each on, Hedgehog in my example, and setting an explicit sort order and limit.

relation = relation.reorder(batch_order).limit(batch_size)

batch_order is defined below as the primary key in ascending order. With the relation configured, it will now start fetching batches of records and yielding them:

records = relation.to_a

while records.any?
  records_size = records.size
  primary_key_offset = records.last.id
  
  yield records

  break if records_size < batch_size

  records = relation.where(table[primary_key].gt(primary_key_offset)).to_a
end

Calling to_a on an ActiveRecord relation like Hedgehog.reorder(batch_order).limit(batch_size) will trigger a SQL select and fetch the records. This ensures that nothing else modifies relation.

The control flow is interesting because it shows off a few things you don’t tend to see as often in Ruby. First, the while loop will keep running until records.any? is false. This is a slightly unusual usage of Enumerable#any. Typically you call it with a block, however when it has no block, it behaves like !records.empty?:

![1,2,3].empty?         #=> true
[1,2,3].any?            #=> true
[1,2,3].any?{|i| i > 4} #=> false

There are actually two ways this loop can stop, one is in the while condition, but the second is break if records_size < batch_size. Calling break in Ruby will break out of a loop early. I generally feel it’s best to use while and break sparingly, but in this case we’ll see it helps avoid needless queries.

The logic here is pretty simple. For each batch, ActiveRecord gets the number of returned rows and the last id it saw. That whole batch is then yielded. Next comes the break. If there were fewer records than a full batch, we know we’re done. Otherwise the next batch of records is queried for by asking for all the records after the last id ActiveRecord saw.

Recap

Now we know how ActiveRecord implements find_each and find_in_batches, we also came across a few other neat things:

  • enum_for will return an Enumerable you can pass around
  • enum_for takes a block for calculating the size of an Enumerable
  • ActiveSupport defines Hash#assert_valid_keys for checking options
  • break can be used to exit a loop early

Thirsty for more? Read up on Object#enum_for, or poke around ActiveSupport and see how assert_valid_keys is implemented.

More articles in this series