There are various challenges to overcome when you implement a big data solution, but perhaps the biggest problem that’s often overlooked is how to create accurate test data. You’re implementing a new system in order to deal with a massive amount of data, and perhaps your relational database can’t handle the volume, so it’s vitally important to properly test this new system and ensure that it doesn’t fall over as soon as the data floods in.
In order to test thoroughly and identify bottlenecks and problems, you need a large volume of test data that you can use. If you don’t have access to real data, then you’re going to have to create it. But for this test data to do its job, it really has to closely emulate the actual data that the system is going to be processing.
Test Data Management Challenges
When you’re selecting or creating test data you have to keep several factors in mind. It’s very important to maintain the integrity of your data and make sure that you can validate it and trace it through the system. Consider that data ages and can become obsolete. Privacy is vital if you intend to use any real data, so make sure that it is masked or anonymized wherever appropriate in accordance with government or industry privacy regulations.
The test data you select must be representative of the data that will be in use every day when the system goes live. You need to mimic the production data as closely as possible. Make sure that you understand the context of different data ‘states’ and that you’re able to trace them back to business workflows.
Select carefully or draw up specific criteria to create representative data. The primary test for data completeness is to ensure that all of the data is covered. Special case data exists in all applications and its workflows. The coverage of all such special cases and boundary cases must be ensured. All possible data scenarios should be included to ensure full coverage in testing data quality.
Make sure that you also have the peak expected production data volume to ensure that this volume of data can be loaded within the agreed timeframe.
Building Your Test Data
Ideally you’ll have a test team with deep expertise and understanding of the business data and applications required to create data for testing purposes. This is quite a different prospect from regular software testing and it requires a different approach. Testers should identify their test data requirements based on the test cases— which means they must capture the end-to-end business process and the associated data for testing. This could involve a single application or multiple applications.
Start by assessing the system. How will data come in? There may be millions of points of data coming in from different sources. Where are the patterns? What are the parameters of this data? The aim is to build test data that looks like the real thing and to feed it into your big data solution in the same way that real data will come in.
Creating realistic data patterns can be a major challenge. You may even need to consider reverse engineering the application systems to understand the structure. Analyze the patterns to learn how data typically enters the system, how it relates, and what a natural flow is. If you can identify trends and emulate them, then you can find a way of generating realistic data.
Testing, Maintaining, and Updating data
The best approach is to create criteria and automate test data selection or creation, so that you can draw the test data you need when you need it. It’s important to consider potential changes in the future and include room to reconfigure your test data based on new criteria.
Consider also how you are going to store this data. If you create a large volume of test data, it has to be stored somewhere. You’ll also want to be able to refresh and update it regularly and migrate it into the system smoothly. Plan the logistics at the beginning to avoid problems later.
Make sure that you test data integrity and trace it end-to-end through the workflow to ensure nothing is being lost. Can you re-use the same data in multiple tests? You should have checks and balances in place to ensure that test data remains relevant and provides full test coverage.
During the testing process, test data often diverges from the baseline, resulting in a less-than optimal test environment—but refreshing test data can improve testing efficiency.
Creating accurate test data is challenging, but it’s not an insurmountable problem. It just takes some upfront consideration and planning. Study the data flow in your system, take time to understand it, so you can really emulate the production environment. Consider the infrastructure support that you’re going to need and assign the necessary resources. If you can create high quality test data, then you can really put your new big data solution through its paces and iron out any problems long before it goes live.