The Test That Cried Fail


Let’s start with a story.

Once there was a developer who was eager and new. He wrote a feature. He wrote a test. He ran the test suite.

His test passed, and the feature seemed to work, but one test failed. Then it passed. His coworkers nodded sagely, “Yes”, they said, “that happens sometimes.” So he committed and carried on.

Weeks later, he fixed a bug, and lo, the test failed once again, but he knew this test, it had a penchant for failure. So he carried on.

That evening, news came from on high, there was a horrible flaw that must be fixed. He wrote a fix, and ran his tests. All but one test passed, but he knew this fickle test, and heeded it not.

Eager to prove his worth, he deployed the code. When the clock struck midnight, every user account was deleted.

There was much gnashing of teeth.

Hyperbole? Perhaps a little, indeterminate failures are costly. When taken seriously they rob us of time, and when ignored, they lull us into false security. So why do tests fail intermittently, and what can we do about it?

There are four major causes of undue headaches: time, external state, internal state, and unspecified behavior. By isolating our code from these, we can prevent many intermittent test failures.

Isolating Time

If a test depends on time, then time’s passage may introduce unexpected results. Let’s consider a simple function that returns the number of weekdays remaining this week.

def weekdays_remaining
  w = Date.today.wday # Sunday = 0, Saturday = 6 
  w < 5 ? 5 - w : 0
end

So if today is Wednesday, then Thursday and Friday remain in this week.

def test_weekdays_remaining
  remaining = weekdays_remaining
  assert_equal 2, remaining
end

This test passes today, but will fail tomorrow, and then pass again next Wednesday. Although contrived, there are lots of situations where we may use Time.now or Date.today, for instance you might send out emails on Fridays, or prevent some operations on weekends.

To test time dependent code, you need to control time. Actually, you need to control Ruby’s two clocks Time.now and Date.today. Controlling just one means that if your code, or any of your dependencies, call the other, you’ll end up in a very strange world where Time.now.to_date no longer equals Date.today.

Luckily there are plenty of libraries that can help you isolate time.

Let’s rewrite this test with ActiveSupport’s time helper:

def test_weekdays_remaining_on_wednesday
  travel_to(Date.new(2015, 10, 7)) do
    remaining = weekdays_remaining
    assert_equal 2, remaining
  end
end

Now we have isolated our test from time, and if we run it on Thursday, the test will pass. Since we are now masters of time, we can easily add tests for other days like Saturday and Sunday.

Symptoms: Tests fail when run at specific times.
Solution: Control time in your tests.

Isolating External State

Databases, file systems, and web services, are all external to your code, but can cause test failures if not controlled.

Consider this set of fictitious tests:

def test_items_should_have_names
  item = Item.find(13)
  assert !item.name.blank?
end

def test_destroying_items_is_logged
  Item.find(13).destroy
  assert LogEntry.where(item_id: 13, action: "destroyed").exists?
end

If the tests execute in the order listed, and your database actually has an Item 13, the tests will pass. What happens if the tests run in the reverse order?

If you use ActiveRecord, you have probably never run into this problem. Let’s consider why. By default, Rails runs every test in a database transaction, and then rolls that transaction back at the end of the test. So when test_destroying_items_is_logged finishes, Item 13 shows up again.

Your database is only one source of state, consider this insidious test:

def test_creating_users_increments_cached_user_count
  User.create! # Assume this increments "user_count"
  count = User.count
  cached_count = redis.get("user_count")
  assert_equal count, cached_count
end

Run this test once, and it will pass. Run the test again, and you may see a message like Expected 1 to equal 2. Although the database rolls back between tests, Redis is a separate system which does not roll back.

If the data you store in Redis is ephemeral, you can probably get away with clearing it after each test, and bring it back to a known state. In Rails you can achieve this with a setup hook:

setup do
  redis.flushdb
end

In a similar vein, consider the file system. If your code interacts with the file system, you may run into some unexpected behavior. Imagine running this test:

def test_generating_reports
  report = SampleReport.new
  report.run
  
  assert File.exist?("reports/sample.csv")
end

What happens if you change SampleReport to generate JSON, and write to "reports/sample.json" instead? The test will continue to pass on your machine until you delete "reports/sample.csv", but developers with a fresh checkout will see a failing test. In the simple case, you can just clean up after your test:

def test_generating_reports
  expected_path = "report/sample.csv"
  begin
    report = SampleReport.new
    report.run
  
    assert File.exist?(expected_path)
  ensure
    FileUtils.rm_f(expected_path)
  end
end

Symptoms: Tests fail when run multiple times.
Solution: Reset servers or filesystem to a known good state.

In each of these cases, you own the source of the external state. It’s your database or filesystem, so you can safely reset it to a known state. It is becoming increasingly common though to rely on third party services. If you’re lucky the service may provide a developer sandbox for you to experiment with. Unfortunately, this makes for a poor testing situation.

Consider the following test:

def test_charging_customer_records_balance
  user = User.find(13)
  
  user.charge(10.00) # issues HTTP POST to example.com
  user.charge(20.00) # issues HTTP POST to example.com
  balance = user.balance
  
  assert_equal 30.00, balance
end

What happens if this test is run twice? What happens if another developer runs this test while you run it? What if the service is down for maintenance?

Just as we controlled time, controlling the network can isolate your tests from this external state. Libraries such as WebMock allow you to declare what requests to make, and what responses they should yield. Instead of relying on the real service, you declare how the service should respond:

def test_charging_customer_records_balance
  stub_request(:post, "www.example.com/api/user/13/charge"))
    .to_return(
      {status: 200, body: '{id: 1, amount: 10.0}'},
      {status: 200, body: '{id: 2, amount: 20.0}'}
    )
   
  user = User.find(13)

  user.charge(10.00) # issues HTTP POST to example.com
  user.charge(20.00) # issues HTTP POST to example.com
  balance = user.balance

  assert_equal 30.00, balance
end

Assuming the responses are realistic, you can now test your application’s logic without relying on external state. This also makes testing failure cases simpler:

def test_balance_only_changes_on_successful_charges
  stub_request(:post, "www.example.com/api/user/13/charge"))
    .to_return(
      {status: 200, body: '{id: 1, amount: 10.0}'},
      {status: 422 }
    )
   
  user = User.find(13)

  user.charge(10.00)    # issues HTTP POST to example.com
  user.charge("$30.00") # issues HTTP POST to example.com
  balance = user.balance

  assert_equal 10.00, balance
end

Isolating tests from third party services avoids unpredictable behavior.

Symptoms: Tests fail when multiple people run them at the same time.
Solution: Mock out HTTP requests.

Isolating Internal State

Just because something is within the boundaries of your code, doesn’t mean it can’t introduce intermittent test failures. Here’s an example of a failure we recently had a whole lot of fun tracking down.

It’s a good idea to set limits, imagine you have a config file describing those limits. There’s a simple way to test those limits, just create a bunch of things:

def test_items_have_a_limit
  max_item_count = MyApp::Application.config.limits.max_items
  max_item_count.times { Item.create! }
  
  assert_raises(LimitException) do
    Item.create!
  end
end

Sooner or later, someone will gripe. Why spend several minutes creating tons of pointless objects? It would be more efficient to simply change the limit:

def test_items_have_a_limit
  MyApp::Application.config.limits.max_items = 1
  
  Item.create!

  assert_raises(LimitException) do
    Item.create!
  end
end

Brilliant! Now your tests run much more quickly. Except there’s one problem. Other tests seem to randomly fail now. Unlike your database, Redis, or that web service you mocked out, MyApp::Application.config is shared global state, and by changing it in a test, you have changed it for all the tests run after it. This is a particularly sneaky cause of indeterminate failures because the new test will never fail.

This isn’t just isolated to configuration data, you may also run into these types of issues if you change any object that persists outside of a given test case such as object caches or dynamically generated code.

To avoid this problem, you’ll want to make sure you always set any shared state back. Here’s a handy method you could use:

# Temporarily override an attribute
def temporarily_set(object, attribute, new_value)
  # Store the original value
  old_value = object.send(attribute)
  
  begin
    # Set the new value, and yield to the test
    object.send("#{attribute}=", new_value)
    yield
  ensure
    # Ensure that the old value is replaced
    object.send("#{attribute}=", old_value)
  end
end

Rewriting the test so that it doesn’t pollute other tests is simple:

def test_items_have_a_limit
  temporarily_set(MyApp::Application.config.limits, :max_items, 1) do  
    Item.create!

    assert_raises(LimitException) do
      Item.create!
    end
  end
end

Alternatively, this is a great time to use a generalized mocking library like Mocha or FlexMock. Let’s rewrite this test using Mocha’s stub method to temporarily override the limit:

def test_items_have_a_limit
  MyApp::Application.config.limits.stubs(max_items: 1)
  
  Item.create!

  assert_raises(LimitException) do
    Item.create!
  end
end

This might look similar to our initial test, but Mocha isolates these changes by tracking the objects it has modified, and undoing the magic at the end of each test.

Symptoms: Adding a test caused other tests to fail.
Solution: Isolate changes to objects that persist beyond the test.

Avoiding Unspecified Behavior

Most operations are specified. Asking your computer to add the integers 1 and 2 will result in 3, or at least it should. There are however operations that are valid, but unspecified. These are the source of some ridiculously intermittent failures.

Consider this simple test verifying that items send an email when marked done:

def test_marking_item_done_sends_email
  item = Items.first
  item.done!
  
  sent = ActionMailer::Base.deliveries.present?
  assert sent, "Expected marking an item done to send an email"
end

Run this a hundred times on your machine, and it may well pass. Run it on your build server, and maybe it fails. Perhaps most of the time the item fetched is not done, but once in a while it is already marked done, and no email is generated. How could this be? Well if you’re using Postgres (or most SQL databases), it’s because we didn’t specify a sort order:

If sorting is not chosen, the rows will be returned in an unspecified order. — PostgreSQL 9.4.5 Documentation

It happens to be that unspecified order order tends to be the insertion order. Things that “tend” to behave one way should set off alarm bells when writing tests.

There are a few ways to make our test more resilient. One is to be more specific, we want an item that is not done, so specify that:

def test_marking_item_done_sends_email
  item = Items.where(done: false).first
  # ...
end

That’s better, but it might be more clear to either use a fixture, or a test factory like Factory Girl. Both approaches help ensure you test a known object. Rails’ fixture data lets you reference records by name, so you know exactly what you’re getting:

def test_marking_item_done_sends_email
  item = items(:undone_item)
  # ...
end

Test factories, like Factory Girl, make it easy to build records suited to your test:

def test_marking_item_done_sends_email
  item = create(:item, done: false)
  # ...
end

Both approaches have merits, and most importantly, both help you avoid unspecified behavior.

A big theme in unspecified behavior is ordering, be it sorting records with non unique values, or the completion order of asynchronous jobs. If you can, make your tests resilient, or be explicit and avoid unspecified behavior all together.

Symptoms: Tests tend to pass in one environment, and fail in another.
Solution: Avoid unspecified behavior, especially ordering.

Recap

There are a lot of reasons why tests may fail one day, but not the next.

  • Isolate time when tests depend on the system clock.
  • Isolate resources you control by resetting them to a known good state.
  • Isolate resources you don’t control by mocking them out.
  • Mock out changes to shared values (config, constants, etc.)
  • Prefer explicit tests, and avoid unspecified behavior.

These are just a few common causes of intermittent failures. If you’re interested we can explore some more horrible examples in the future. Until then, do you have any favorite examples?

More articles in this series