Test Your Test Metrics





Test Your Test Metrics Date: Thursday , June 26, 2008 It might sound obvious, but one of the most important tasks of the software quality assurance life cycle is collection and analysis of test results. Lacking metrics, the stake holders would have no idea if they were making progress in removing software defects. Yet, as any experienced test specialist will testify, “Test metrics are similar to corporate balance sheets; what they show is very interesting, but what they hide is vital”. While collecting test metrics might be anything but simple, ensuring that you are collecting the right metrics and interpreting it correctly — could be even more challenging. For instance, your team might say that all defects are logged with adequate comments, and that defect severity levels are trending down over time. Yet when the application is released to production, its performance proves dismal while users unexpectedly report many bugs. In effect, the surgery was successful, but the patient died. Consequently, the burden is on the test manager to understand what test results reveal, and more importantly what they don’t reveal. To provide an idea of what to look for, let’s review some examples on how commonly used metrics sometimes generate misleading results, and then we’ll discuss how to overcome these shortfalls. The Defect-to-Remark Ratio – Test Cycles Used for measuring how well defects are documented, this metric is calculated either as a ratio or a percentage. Organizations typically categorize issues identified and logged by test teams as remarks. When remarks are logged, the team validates and eliminates duplicates or impossible to reproduce defects. All valid remarks are then marked as defects. Ideally, all remarks should be converted to defects. At first glance, the chart above shows a promising trend: while only 40 percent of remarks were converted into defects during test cycle 1, by the last cycle, the ratio increased to 95 percent. However, this chart may not reflect the real story. As part of it the analysis managers should consider different factors like: Test Coverage: Inadequate test coverage can deceive. For instance, 95 percent ratio with 70 percent coverage is bad compared to 80 percent ratio with 90 percent coverage. Number of Defects: The chart only covers remarks for known defects. Just as poor test coverage limits the effectiveness of high remark ratios; the same applies if the team fails to identify most defects. Defect Classification: The metric does not specify whether these were technical defects (For example, coding errors, incorrect database schema, or wrong tables updated), functionality defects, or user interface (UI) defects. Achieving a high Defect-to-Remark Ratio could provide a false sense of accomplishment. Interpreted in isolation, Defect-to-Remark Ratios provide only partial insight into software quality. The Defect Severity Index The Defect Severity Index (DSI) is a common method of measuring a weighted average of defect severity. Lower values (For example, Severity 1 or S1) denote highest severity, while higher DSI values indicate lower severity. In this sample, the DSI improved by the second round of testing. But the overall average hides the fact that although Severity 1 defects remained flat, Severity 2 defects spiked by 50 percent during the second cycle of testing. The question then becomes, is an application with lower DSI necessarily better than those with higher ones? DSI is meaningful only if test coverage is adequate and when severity of defects assigned appropriately. Mean Time to Find a Defect The Mean Time to find a Defect metric is a measure of the amount of time before a new defect is detected; it is also called Mean Time to Failure (MTTF). Theoretically, the further into the test cycle, the more time should elapse before the next defect is discovered. Yet, this metric alone doesn’t tell the full story either. For instance, in this sample, defects were discovered every 5-6 minutes during the first test cycle, compared to 30 minutes by cycle 6. But this trend is meaningful only if test coverage is adequate. It could be skewed by relative defect severity and the types of tests that are conducted. The team must know what kinds of tests were conducted at each cycle. For instance, a full regression test in cycle 1 would likely discover defects more rapidly than a reliability test during cycle 10. The trend is meaningful only if you compare apples with apples. Test Coverage Test Coverage is typically measured based on two parameters: the number of requirements that are covered by test cases, and the amount of code that is exercised by testing. Test Coverage can also be tracked based on the degree to which documented test cases are actually executed, or to which planned tests are actually run. In reality, few testing teams ever achieve 100 percent coverage across all of these criteria. In the example above, the team’s 70 percent requirements coverage would be considered satisfactory, in that most of the planned and documented tests were executed. The weak spot, however, was code coverage, in which less than half the code was exercised. As a result, it pays to check on whether the most important requirements and functionality are really tested. Did the tests cover the most critical requirements, or settle for the most elementary? Similarly, did the tests cover only explicit requirements or implicit ones as well? What types of requirements were tested: functional or non-functional? None of these criteria is reflected in the above graph. On the other hand, while the less than 50 percent code coverage results at first glance appear disappointing, it makes sense to dig deeper. For example, did the tests focus primarily on user experience elements, including JavaScript and HTML, or on the Java or C# code for the back-end business logic? The answers to these questions should be considered in the light of the criticality of the application itself — obviously, code coverage should be more comprehensive for core systems than for peripheral applications. Finally, documented or planned test cases must account for the importance of the tests; if actual test coverage is based on an inadequate test plan, 90–100 percent test coverage results will prove disappointing. Test your Numbers! The simple lesson to be learned from the preceding discussion is this: no single metric adequately measures software quality, so assemble a complementary battery of metrics to get the clearest picture of software quality. As the examples above show, the old maxim “statistics don’t lie, but statisticians do” could well apply to test metrics. So, when defining, collecting, and analyzing SQA metrics, keep the following rules in mind: * Do not focus on a single metric in isolation. * Choose and analyze the right combination of metrics for painting an accurate picture. The author is General Manager - QA, Virtusa. He can be reached at venkataramanal@virtusa.com