Reading Rails - How Do Batched Queries Work?
Once again, we shall dig into the wonderful world of Rails using my patented “Pick Something At Random” technique. This time, we’ll investigate find_each
, Rails’ mechanism for performing batches queries.
Batched Queries
When you query for records, ActiveRecord instantiates those records as models. Generally, this is exactly what you want, however there are times when you do not want all the records instantiated at once.
Instantiating records takes memory, if you just need to iterate over the records and either collect some information from them, or perhaps serialize them to JSON, you can decrease the total memory footprint by working in batches. This allows the garbage collector to clean up while you process those records.
The second reason, you may want to process records in batches if you are doing a long running task in the background. In this case you might want ActiveRecord to return a batch, do some work on it, and then sleep
for a bit so that your background job doesn’t dominate the database.
How does ActiveRecord perform these batched queries? Before looking at the code, let’s look at an example, and see what SQL gets executed:
# Without using a batched query:
Hedgehog.all.map{|p| p.to_json }
# Hedgehog Load (0.8ms) SELECT "hedgehogs".* FROM "hedgehogs"
# With a batched query:
Hedgehog.find_each(batch_size: 50).map{|p| p.to_json }
# Hedgehog Load (0.7ms) SELECT "hedgehogs".* FROM "hedgehogs"
# ORDER BY "hedgehogs"."id" ASC LIMIT 50
# Hedgehog Load (0.6ms) SELECT "hedgehogs".* FROM "hedgehogs"
# WHERE ("hedgehogs"."id" > 50) ORDER BY "hedgehogs"."id" ASC LIMIT 50
# Hedgehog Load (0.3ms) SELECT "hedgehogs".* FROM "hedgehogs"
# WHERE ("hedgehogs"."id" > 100) ORDER BY "hedgehogs"."id" ASC LIMIT 50
You may notice is that find_each
requires more time querying the database. Keep this in mind if you use it, as developers we need to be aware of the trade offs we make.
The first example is quite straightforward, every record is selected at once and returned in an array. The second example using find_each
issues queries that dutifully limit the result set to the batch_size
we requested. Notice that id
is used to order these queries. This ensures that the same record is never returned twice.
Try playing around with find_each
a bit in your console. Try querying with a where clause, how does that affect id
condition? Try imposing a sort order, what happens, why?
How it Works
Pop open ActiveRecord’s source and take a look at lib/relation/batches.rb
.
def find_each(options = {})
if block_given?
find_in_batches(options) do |records|
records.each { |record| yield record }
end
else
enum_for :find_each, options do
options[:start] ? where(table[primary_key].gteq(options[:start])).size : size
end
end
end
The if statement switches the behavior of this method based on whether or not you pass in a block. Let’s investigate the else
first. In this case Rails calls enum_for :find_each
. This method will call whatever method is passed in as a symbol, and return an Enumerator
object that you can pass around. You can see this in action by calling an enumerable method without passing a block:
pairs = [1,2,3].each_cons(2) #=> #<Enumerator: [1, 2, 3]:each_cons(2)>
pairs.map{|i,j| i + j } #=> [3,5]
In the example above, each_cons(2)
returns an Enumerator that will return consecutive pairs on demand. Let’s see this in action in Rails by calling find_each
without a block:
Hedgehog.find_each(batch_size: 50) #=> #<Enumerator: #<ActiveRecord::Relation [...]>
Yup, there’s the Enumerator. This explains why I was able to call map
on the initial example. It may seem odd that this method essentially calls itself again using enum_for
, but this is actually a common idiom for Enumerable methods, and the call made by enum_for
passes in a block to be executed.
Te block being passed to enum_for
is a bit confusing. Take another look at it:
enum_for :find_each, options do
options[:start] ? where(table[primary_key].gteq(options[:start])).size : size
end
At first glance, one might think this is the block that should be applied to find_each
, but it appears to return a count of all the records. Enumerable objects may actually have an infinite number of items in them, Ruby can’t assume that it can iterate over every item and count them up. Instead enum_for
allows a block to be passed in that will calculate the total number of items. What happens if we just call enum_for ourselves and don’t pass in a block:
# With the block for find_each:
Hedgehog.find_each.size #=> 103
# (0.4ms) SELECT COUNT(*) FROM "hedgehogs"
# With no block:
Hedgehog.enum_for(:find_each).size #=> nil
As you can see without the block, Ruby doesn’t know how it should count the results of find_each
, but with the block, it performs a SQL count.
Now lets look at what happens once a block is present:
find_in_batches(options) do |records|
records.each { |record| yield record }
end
ActiveRecord calls find_in_batches
, and then yields once for each record in each batch. This lets us use find_each
without worrying about the batches or their size. To the caller, it just looks like we get a nice set of records we can enumerate over. Although a small touch, it makes find_each
much easier to use.
Now let’s dig into find_in_batches
, it’s a larger method, so we’ll just pick out the most interesting bits, and see what we can learn. At the very beginning of the method, you’ll see a call to options.assert_valid_keys
.
def find_in_batches(options = {})
options.assert_valid_keys(:start, :batch_size)
Passing around hashes of options in Ruby is convenient, but error prone. assert_valid_keys
raises an exception if one of the options isn’t in the list:
{:batch_size => 10}.assert_valid_keys(:start, :batch_size)
# => {:batch_size => 10}
{:size => 10}.assert_valid_keys(:start, :batch_size)
# => ArgumentError: Unknown key: :size. Valid keys are: :start, :batch_size
This is a nice way to ensure the caller doesn’t accidentally misname and option. Assuming the options are valid, Rails will now start iterating over batches of records. It does this by taking the relation you called find_each
on, Hedgehog in my example, and setting an explicit sort order and limit.
relation = relation.reorder(batch_order).limit(batch_size)
batch_order
is defined below as the primary key in ascending order. With the relation configured, it will now start fetching batches of records and yielding them:
records = relation.to_a
while records.any?
records_size = records.size
primary_key_offset = records.last.id
yield records
break if records_size < batch_size
records = relation.where(table[primary_key].gt(primary_key_offset)).to_a
end
Calling to_a
on an ActiveRecord relation like Hedgehog.reorder(batch_order).limit(batch_size)
will trigger a SQL select and fetch the records. This ensures that nothing else modifies relation
.
The control flow is interesting because it shows off a few things you don’t tend to see as often in Ruby. First, the while
loop will keep running until records.any?
is false. This is a slightly unusual usage of Enumerable#any
. Typically you call it with a block, however when it has no block, it behaves like !records.empty?
:
![1,2,3].empty? #=> true
[1,2,3].any? #=> true
[1,2,3].any?{|i| i > 4} #=> false
There are actually two ways this loop can stop, one is in the while
condition, but the second is break if records_size < batch_size
. Calling break
in Ruby will break out of a loop early. I generally feel it’s best to use while
and break
sparingly, but in this case we’ll see it helps avoid needless queries.
The logic here is pretty simple. For each batch, ActiveRecord gets the number of returned rows and the last id
it saw. That whole batch is then yielded. Next comes the break
. If there were fewer records than a full batch, we know we’re done. Otherwise the next batch of records is queried for by asking for all the records after the last id
ActiveRecord saw.
Recap
Now we know how ActiveRecord implements find_each
and find_in_batches
, we also came across a few other neat things:
enum_for
will return an Enumerable you can pass aroundenum_for
takes a block for calculating the size of an Enumerable- ActiveSupport defines
Hash#assert_valid_keys
for checking options break
can be used to exit a loop early
Thirsty for more? Read up on Object#enum_for, or poke around ActiveSupport and see how assert_valid_keys
is implemented.
More articles in this series
- Reading Rails - HTTP DELETEs With a Link
- Reading Rails - Time Travel
- Reading Rails - TimeWithZone
- Reading Rails - How Does MessageEncryptor Work?
- Reading Rails - How Does MessageVerifier Work?
- Reading Rails - How Do Batched Queries Work?
- Reading Rails - The Adapter Pattern
- Reading Rails - Errors and Validators
- Reading Rails - How Validations Are Configured
- Reading Rails - Concern
- Reading Rails - More Migrations
- Reading Rails - Migrations
- Reading Rails - Attribute Methods
- Reading Rails - Change Tracking
- Reading Rails - Handling Exceptions