Enumerating over large datasets in Ruby

Posted on March 18, 2019 by wjwh


The enumerable module in Ruby is one of the most useful in the language. Just by defining the each method of your container-like class (technically any functor) and including the module, you get a huge amount of methods “for free”. Many often-used classes in Ruby include enumerable, like Array, Hash, Set and many more. However, for some data sets it is not practical to load all elements into memory, such as for large database tables, big files or for a Redis instance with a lot of keys. Luckily, these systems and the ruby gems to access them often provide a similar interface for iterating over all their elements.

Every once in a while a situation pops up where you have to resort to somehow touching all the keys/rows/elements in a datastore to fix whatever is wrong with them. For example, some time ago a coworker forgot to set an expiry for a subset of new keys in a Redis instance we use as a write through cache, and regrettably we only discovered this a month later. We quickly fixed the problem for all the new keys, but were stuck with about 10 million keys that would never expire and were just taking up space. Redis will gladly show you how many keys do or do not have an expiry attached, but does not provide a simple way to access all of them. In the end, we had to resort to checking for all keys whether they already had an expiry attached an set it if they did not.

Enumerating large database tables

One case of “large data set that does not easily fit in memory” consists of database tables. These can grow very big indeed, and reading in a few hundred million rows would not only take a relatively long time (with all the locking and timeout issues that might ensue), it will also use a much larger amount of memory than needed. Therefore, a query like MyModel.all.each {|mymodel| mymodel.do_stuff! } would probably be a bad idea.

Instead, ActiveRecord provides a find_each method, which will scan through your data in batches of a 1000 rows at a time and invoke a passed block for each row, like so:

MyModel.find_each {|mymodel| mymodel.do_stuff! }

This query will have the same result as the first one, but never load more than 1000 rows into memory. It will work with all the usual selectors, so something like MyModel.where('created_at < 2019-01-01 00:00:00').find_each { ... } will do what you expect. It works internally by using the fact that a primary key (which ActiveRecord will create as a default for every table), must contain only unique rows and so it can “climb” on that column.

Enumerating redis keys

Redis does not really have the concept of a sorted index for keys, but it does provide a way to iterate over the data inside using the SCAN command. This command may return some data and also returns an cursor item that holds the state of the scan. It is intended that you call the command multiple times with the cursor from each previous call (receiving some items each time) until the cursor becomes zero, which indicates that the scan is complete.

The redis-rb gem abstracts this pattern with the scan_each, which can be used in a very similar manner to the ActiveRecord example above. For example, to set a new expiry of 10 seconds for every key in the dataset that starts with the characters “abc”:

r = Redis.new
r.scan_each(:match => "abc*") {|key| r.expire(key, 10) }

There are similar commands for iterating over the items in Redis Sets, Hashes and SortedSets. These commands provide only a few guarantuees with regards to the keys returned, which are enumerated in the documentation. As an added warning, you almost never want to use redis.keys.filter { ... }, since this will translate to issuing the (in)famous KEYS * command. For Redis instances with a sizable amount of keys, this will block the Redis server for a (potentially) long time as it enumerates and sends all the keys in the dataset to your client. The client will also use much more memory than needed in such a case.

Enumerating S3 buckets

The (excellent) AWS SDK gem for S3 also provides a method for enumerating the objects in a bucket:

s3 = Aws::S3::Client.new
s3.list_objects(bucket:'my-bucket').each do |response|
  puts response.contents.map(&:size)
end

Since S3 provides very little in the way of consistency guarantuees, if there are people or processes adding and deleting objects while your iteration is ongoing it is generally undefined if you will “see” a key if the underlying AWS iterator has already passed its location before it gets created (or if it is deleted before the iteration gets to it).

Enumerating big files

As a final example, the IO class in the Ruby standard library comes with an each_line method that will give you lines one by one without ever loading the entire thing into memory. The interface is probably familiar by now:

file = File.new('huge_log_file.txt')
file.each_line {|line| line.do_stuff! }
file.close

Even though a file is a very different abstraction than a database or a S3 bucket, the interface for the user is remarkably similar!

Conclusion

In some cases the data set is so large that it becomes really important to pull out the bigger toys like parallellisation, clusters of background workers, mapreduce frameworks and the like. Especially if you have to do a specific operation often, this might be the best choice.

However, there are also plenty of ‘one-off’ situations where the operation to be performed is straightforward, but where the data set is too large to just load it all into memory. In such cases, the various enumerable-derived strategies provide a useful pattern to solve this problem, no matter which shape the actual data store takes.