Automated end-to-end tests work great to validate real-world behavior, but tend to fail at random times. How can we reduce their flakiness?
As a software engineer, few things are more frustrating than committing code changes that cause your automated tests to fail on your continuous integration (CI) systems. After working on a task for hours or even days, seeing a failing test can quickly deflate one’s motivation. When this happens, we immediately start looking for the reasons why that test failed and asking questions like:
Did the most recent code I merged change how things work? How could that one change I made on this section of the codebase break that other completely unrelated part that I didn’t touch? Was it carelessness on my end? Do I even know what I’m doing?
After second-guessing our abilities, an even more frustrating situation is realizing that the failing test isn’t your fault. In fact, running those same tests on your local system works perfectly well, and running the automated test suite on CI again shows that everything magically works without a hitch. Now you ask a different question: Why did this test randomly fail for no apparent reason?
If you’ve worked in software for any period, you’ve likely experienced the annoyance of dealing with flaky automated tests. It happens to everyone, whether you’re a scrappy startup or a huge tech conglomerate. These random failures are particularly prevalent in automated end-to-end tests, with their more extensive coverage creating additional points of failure for them to happen.
A simplified definition of flakiness in automated testing is where an automated test scenario doesn’t work consistently. Technically speaking, an automated test should always produce the same results for the current state of the application and its environment. If you run an automated test suite one hundred times, it should give you the exact same results every single time as long as the application’s codebase remains untouched and no changes have occurred on the systems running the tests. But if some test scenarios pass or fail randomly, you’re dealing with a flaky test.
For example, imagine your development team has a continuous integration pipeline that runs a suite of end-to-end tests every night. One Monday morning, the team returns to work to find that the latest test run failed. The test failure happened on a weekend when no team member made any changes to the codebase or the underlying infrastructure where the tests ran. After manually rerunning the pipeline, the failing test now passes as if nothing happened. As you can guess in this scenario, the unpredictability of flaky tests adds an extra layer of difficulty during the development process.
Flaky automated tests can happen anytime, such as running a single test scenario during development, performing smoke tests on a subset of scenarios before deployment, or executing a full-scale overnight test run. It can also happen only on a particular test case or randomly pop up across different test cases. The irregular behavior of flakiness makes it feel like it’s impossible to nail down the root cause.
The most frustrating part about flaky automated tests is that there’s rarely a single reason why they occur in your application. Sometimes, the problem lies in the environment where the tests run. Other times, it’s due to how the team builds and executes the automated tests. You might even begin to think that the flakiness is happening because the sun and the moon have aligned at that particular moment, just because you can’t explain the randomness of the test failures.
As mentioned earlier, end-to-end tests cover larger portions of your application’s architecture, meaning that more sections are involved in executing them compared to unit, functional and other lighter forms of automated testing. Although automated end-to-end tests are notorious for producing unexpected results from one test run to the next, there are a few primary suspects involved that typically cause these automated tests to fail randomly:
Most applications need to access data from a database, file system or other data store to work correctly, and specific test scenarios will also require information related to what it validates during a test run. Setting up the data to prepare your application for automated end-to-end tests requires some forethought to build the strategy for managing the data throughout the test run.
To understand how a poorly planned test data strategy leads to flaky tests, let’s say you have an end-to-end test to verify that your application’s sign-up page works, and it relies on using a unique email address to complete the process successfully. As part of preparing for the automated test run, the team populates the application database with a list of test users containing random email addresses. If the test generates an email address using a similar pattern used to populate user emails in the database initially, there’s a chance the address used during the test already exists in the database, which causes it to fail.
The nature of end-to-end tests is that they will manipulate the application’s state, and if you’re not mindful of how the data changes throughout the test run, you can end up with a test that intermittently fails. These small oversights demonstrate the importance of planning how to manage test data to avoid flakiness.
The order in how you run end-to-end test scenarios can also influence flakiness during execution. Depending on your test framework, your end-to-end tests can run in a different sequence each time you execute them. This behavior forces developers and testers to run each test scenario independently so it doesn’t affect the outcome of other tests.
Taking the previous example of testing the sign-up process for an application, you can’t use the created account in other tests if they run in random order. So, if you run a login test that relies on the account from the sign-up test, it won’t work predictably since the account isn’t guaranteed to exist for the login.
While most testing tools let you specify the test run order, and some teams advocate doing this for various reasons, you likely want to avoid depending on running tests in a specific sequence because:
An often overlooked part of running end-to-end tests is the environment where the tests get executed. Typically, developers or testers won’t run the full battery of end-to-end tests on their development systems due to the heavy weight of these tests. Instead of having teams wait a long time for an end-to-end test run to complete, they can defer running their tests on a continuous integration service. That way, they can continue working on their tasks while the automation happens in parallel without making anyone wait for the results.
However, most CI services use lower-powered servers with significantly fewer available resources than the average developer’s or tester’s computer, and the difference between these slower systems can introduce an element of flakiness, especially while running end-to-end tests that need additional resources during test execution.
For instance, web-based end-to-end tests will need to load a browser (usually in “headless” mode). Some browsers, like Google Chrome, love to consume as many system resources as possible, leaving little to none for everything else and causing a test to time out. These failures are especially difficult to debug if you have limited access to the CI server.
A developer or tester might have assumptions about other systems where the tests run or forget to handle specific conditions properly, leading them to write test code that sporadically fails. For example, a test might perform an action that triggers a background task, and the person creating the test halts execution for a few seconds before proceeding. However, there’s no guarantee the background task will always finish in the allotted time, causing a test failure. These pauses in test execution (known as sleep or wait, depending on the framework) are a surefire way to cause flakiness in end-to-end tests.
Another often-neglected source of flakiness is code related to dates and times. I recently worked on an application containing a test that somehow only worked after 4:00 PM. The test never failed for other developers and the organization’s CI systems. I soon discovered the problem was in the test code. The test checked that a date on a page matched the current date but assumed the application was running on U.S. Pacific Time. Since I live in Japan, the test would pass after 4:00 PM, when Japan and the United States West Coast share the same date. These examples show how this kind of code can become the culprit for a flaky test.
The reasons mentioned above are just a few areas to inspect when dealing with test flakiness in end-to-end tests. In most cases, there isn’t a “one-size-fits-all” approach to figuring out why a test becomes flaky. You’ll have to tackle the problem with a systematic approach, first by identifying the root of the issue and then considering what to do to resolve it quickly. It also helps to build an environment where flakiness isn’t tolerated so these problems happen less frequently in the future.
When faced with test flakiness in your project, here are some steps I recommend taking to fix the problem as soon as possible and have them happen less frequently.
When flakiness happens in a test suite, developers and testers often jump straight into making technical changes, hoping for a quick fix. From experience, this rarely resolves the issue due to the randomness of failures caused by a flaky test, and the same problem may reoccur down the road. Most of us tend to act first and ask questions later, which isn’t practical for problems without straightforward solutions like these. Instead of making hasty adjustments, it’s better to pause and identify the root cause of flakiness.
Checking where the flaky tests tend to happen can yield clues to crack the case. For example, if the flakiness only happens on continuous integration and random test scenarios, verify that those systems have enough computing power to go through the tests. If you have the same test case failing often, dig deeper into the application’s state to check if your test data causes the issue. While finding the exact point of failure might be challenging and will take time, eliminating possibilities and focusing on solid leads can save time and effort by targeting the most likely causes and avoiding new problems.
When building an end-to-end test suite, we rarely think about the lifespan of the tests. We’d love for our work to live on for years without modifications, but codebases need to adapt and adjust to the surrounding business environment, including the tests around the application. There may come a time when an existing test has outlived its usefulness, and it’s not worth the effort to fix a flaky one that no longer serves a strong purpose.
A team I helped recently had problems with stability in their end-to-end tests that created a massive bottleneck with their development workflow. After looking at their automated test suite, I noticed that the team had built a collection of API tests that covered the same business logic as some of their flaky end-to-end tests. We quickly determined that it wasn’t worth maintaining the ones slowing the team down, and it substantially improved their development speed. Applications like Telerik Test Studio can help this transition by combining both functional UI testing and API testing in a single package so that you can select the right tool for the job.
Taking the time to reevaluate and prune your automated tests, especially problematic test cases, can keep your workflow lean and running smoothly.
One of the reasons why flakiness is so persistent in end-to-end testing is that it’s easy to ignore. All it takes is rerunning the test suite and poof, the problem is gone, at least until it happens again.
Unfortunately, this only makes the problem worse, and Murphy’s law will eventually come into play. Your test runs will cause something to go wrong at the worst possible time, like being unable to deploy your application before an important product demo or working late into the weekend because you can’t figure out whether you’re dealing with a flaky test or a legit bug.
The best solution is to address flakiness as soon as it happens. Setting up alerts when a test fails in your continuous integration system gives you an opportunity to see problems occur in real-time. Using the functionality built into your testing tools also helps to smoke out these problems with ease. Telerik Test Studio, for instance, lets your team monitor test results through its Executive Dashboard and provides easy access to uncover which tests failed and why. The key is to focus on fixing your tests so they don’t snowball into an unreliable test suite that no one wants to use.
If you’re the only person on your team who cares about fixing flakiness, your job will be much more challenging. To get the most out of the process, having a solid testing culture in your organization will make all your efforts much more manageable. In testing, it’s dangerous to go alone. Instilling the habit throughout the entire team of determining the causes of flaky tests, evaluating existing tests, and taking swift action to correct issues inevitably reduces flakiness.
Admittedly, building a culture around testing for any software development team is easier said than done. Developers are notorious for bypassing testing for various reasons, so getting them on board will take some effort. I’ve found that education and showing them the tangible, positive effects of fixing a flaky end-to-end test goes a long way.
It might take much longer than you’d like, but establishing solid testing habits across the team is worth it in terms of faster development, higher-quality applications, and fewer headaches around QA.
No matter how careful you are when building an automated end-to-end test suite, you’ll run into flaky tests—test scenarios that fail randomly for no reason in one test run, only to work again on the next run. The causes behind flakiness can span different areas, like a lack of strategy around test data, underpowered systems that run the tests or unexpected behavior in the test itself. Whatever the reason, it can quickly derail the work done during the development process since the team won’t know whether there’s a legitimate problem or if it’s just the test suite acting up again.
There’s no silver bullet for eliminating flaky end-to-end tests, but you can take steps to reduce them so they’re no longer a threat. Step back and attempt to understand why a flaky test appeared before jumping in with a solution so you can whittle down the possibilities. Figure out if the test still holds value and remove it if it doesn’t. Take action quickly since ignoring the situation makes things worse. Use tools like Telerik Test Studio to fix flakiness and improve your test automation processes. Finally, work on making your team understand the importance of resolving flakiness to help everyone do their best work.
Flakiness is inevitable, and the only thing we can do as developers and testers is to devise a strategy to resolve the issue before it becomes a bigger problem. These steps serve as a guide toward delivering high-quality produce with fewer hassles along the way.
Dennis Martinez is a freelance automation tester and DevOps engineer living in Osaka, Japan. He has over 19 years of professional experience working at startups in New York City, San Francisco, and Tokyo. Dennis also maintains Dev Tester, writing about automated testing and test automation to help you become a better tester. You can also find him on LinkedIn and his website.