Data Driven Testing: What’s a Good Dataset Size?

Friday, January 25, 2013 by Jim Holmes | Comments 2

I thought I’d follow up that last post on Using Data Driven Testing Wisely with something specific around the size of the dataset for a data driven test (DDT).

What’s a good size for a DDT? As with everything in software engineering/testing, the answer is “42.” That, or “It depends.”

In all seriousness, the right size of a dataset for a carefully thought out scenario does indeed depend. My payroll algorithm in the last post was a simple test set. You may be working something much more complex relating to finance, rocket science, or environmental controls.

Every situation’s different, but I can tell you that you need to re-examine how you’ve built your dataset and test script if you’re in the hundreds or thousands of rows of data.

I’m specifically not saying you’ll never need huge numbers of iterations of data. What I am saying is you need to re-evaluate your dataset if it’s that big. If you’ve done your due diligence around planning your data and you’re still that large, then fine—at least you’ve carefully thought things out and applied some of the steps I talked about in the previous post.

I’d also love to hear back from this blog’s readers:

What sizes of datasets (number of rows, number of parameters/columns) are you generally dealing with? What’s your typical set size? What’s the largest dataset you’ve pushed through a DDT, and why?

About the author

Jim Holmes

Jim Holmes

has around 25 years IT experience. He is co-author of "Windows Developer Power Tools" and Chief Cat Herder of the CodeMash Conference. He's a blogger and evangelist for Telerik’s Test Studio, an awesome set of tools to help teams deliver better software. Find him as @aJimHolmes on Twitter.

2  comments

  • James Higginbotham 25 Jan 2013
    Hi Jim,

    My experience is that DDT needs to happen with datasets that are a few orders of magnitude above the expected data size. Systems behave differently at each level, particularly databases, and testing with a much larger dataset will provide some basic evidence on how it will perform at each stage and where the stress points may be. It can also help with monitoring for those inflection points and planning accordingly.

    While the last few projects I have worked on could likely be classified as "little big data", they generally exceed the common amounts of data many developers are used to dealing with and therefore never consider these issues when designing and building their systems. 10-100 million rows in a single table is not uncommon for me now, with 8-12 columns per table using a variety of data types. 

    I have run across include databases that can no longer perform joins in memory and start working with temp files for join processing once you reach a specific breaking point. The number varies based on the hardware, database vendor, and usage patterns. The threshold often decreases when deploying onto cloud hardware where virtualization and I/O contention become more of an issue. There are also issues where aggregation queries that look innocent enough when displayed on a page but start to break down once data size increases beyond some threshold. 

    As a result, I often push 1 million rows at a minimum for systems that expect to see at least 100k rows in at least one table, and often push up to 10 million rows when I test. This helps me to catch poorly crafted queries, naive assumptions in the UI/data loading procedures, and important inflection points that can impact a systems performance.
  • Jim Holmes 25 Jan 2013
    @james: This is a great comment! I was focusing on functional testing using DDT, and you're bringing stress/load testing into the picture. You're absolutely right to call this out, and I should have made the differentiation.

    Thanks for a great response.

Add comment

  1. Formatting options
       
      
     
     
       
  2. (optional, emails won't be shown on public pages)
  3. (optional)