Data Warehouse Testing Quick Reference

A Data Warehouse (DW) stores data where it can be analyzed, archived, and secured. DW products are used to manage large amounts of data (now recently called Big Data). Business intelligence software, in turn, takes the data and transforms it into information that can be used to make decisions. With mobile devices, social networks, and web-based software collecting so much data, many organizations are beginning to scrub, analyze and filter this data. We’ve recently worked on several data warehouse testing projects and wanted to share a few quick QA tips specific to this domain.

  • Design a basic test data unit: Using the basic test data unit, design test cases based on the known data using an incremental step by step verification approach. Use a program to recover the basic test data unit to facilitate regression testing.
  • Design an exception data unit: Exception data are specific input that are designed to cause an exception. Create a data unit where the wrong conditions are applied and determine the influence on the data.
  • Data Integrity: Decide what data needs to be tested and how to test them. Many times, domain knowledge is necessary to understand an inquiry and the expected results from the source system(s) and data warehouse. Data needs to be checked in all dimensions and time units. For complex calculations, a sample of test data should be validated in the external environment.
  • Data Quality: We usually execute three kinds of data quality tests; historical data load test, basic test data unit test and real data test. Historical data load tests serve as the mainstay of data validation. Basic data unit tests validate every component. Finally, the live data test is done in an operational environment using smoke regression.
  • Operating process test: This test integrates all components and ensures that the whole system operates as expected. For example, test whether the ETL process starts at the right time. Ensure that the OLAP multi-dimensional data units are compiled and aggregated as expected. This involves first checking that the underlying data has been updated correctly and then each aggregation level in the appropriate time dimensions.
  • Live test: This test should simulate the real situation of the data environment and work environment. The test maybe last for several weeks and should at least cross one operational cycle. For instance, most reports are done monthly or quarterly, so when doing a live test, the test period needs to cross those time boundaries.
  • Performance test: For this part, mainly focus on data loading and inquiry. When testing the loading data process, the simplest way is serializing the main components, so all ETL work will be done before multi-dimensional routines and reports are created. With test steps separated clearly, once the process is terminated, we can locate any problems quickly. Data inquiry performance testing is more difficult because we don’t know what type of data users want to query, accumulate or report. We usually look at log files to determine the top 10 queries and use those as a starting point.

Data warehouse testing will become more and more important as we begin to amass larger and larger amounts of data stored in various formats and dimensions. Of course, this is not an exhaustive list of tips, but it’s a start. It also does not include any specific industry data warehouse testing tips which would have special rules for accumulating and filtering, or scrubbing data according to the industry. We’ve been testing pharmacy and healthcare software for instance, which has many rules for data reporting as well as privacy. Data warehouse testing on such a complex domain as healthcare definitely takes time to accumulate industry knowledge.