April 28, 2007

Heuristics In Software Testing

Posted by Ben Simo

The point of philosophy is to start with something so simple as to seem not worth stating, and to end with something so paradoxical that no one will believe it."
- Sir Bertrand Russell

A heuristic is a commonsense guideline used to increase the probability of solving a problem by directing one's attention to things that are likely to matter. The word is derived from the Greek "heurisko" which simply means "I find". The exclamation "eureka", meaning "I found it!", shares roots with heuristic.

Just as the "Eureka!" screaming forty-niners of California's Gold Rush did not have secret knowledge or tools to tell them exactly where all the gold was buried; software testers don't know exactly where the bugs are going to hide. However, both software testers and gold miners know where bugs and gold have been found before. We can use that knowledge of past discoveries and the nature of what we seek to create heuristics that help us narrow in on areas most likely to contain the treasure.

Gold miners and testers can find treasure by accident. However, intentional exploration for bugs and gold are more likely to produce results than aimless wandering. That last statement is a heuristic. It is true most of the time, but sometimes it can be proven false. Sometimes wandering testers and miners stumble into something very important. I just don't want to do all my testing by accident.

Heuristic-based testing may not give us concrete answers, but it can guide us to the important things to test. Heuristics can also be used in automation to provide information to guide human testers.

There was a time that I told developers and project managers that I could not test their products when the requirements did not include straightforward "testable" criteria. I thought that I could not test without being able to report "pass" or "fail" for each test.

A good example was a requirement that stated something like "the user shall not have to wait an unacceptable amount of time". As a good quality school tester working in a factory school organization, I demanded to know how long was acceptable before I could start testing. I wanted to quantify "unacceptable". In this case, the truth is that "unacceptable" will vary based on the user. There were no contractual SLAs to satisfy. I may not be able to report that the requirement is met, but I can provide useful information to the project team to assist in determining if the performance is acceptable.

I have since learned that answers to heuristic questions are useful. It was in that same project that I started applying heuristics to automated data validation. Even without concrete requirements, we testers can provide provide information that is useful in answering important testing questions -- especially the qualitative questions. (As a side note, I am now amazed at how much we who call ourselves "Quality Assurance" like to focus on quantitative requirements and metrics.)

To use heuristics in testing, create a list of open-ended questions and guidelines. This will not be a pass/fail list of test criteria. It will not be a list of specific test steps. Instead it can be used to guide your test scripting and exploration for bugs. You will likely develop general heuristics that you can apply to all your testing and specific heuristics that apply to specific applications.

We need to be careful to apply heuristics as heuristics and not enforceable rules. For example, most people involved in testing web application have heard the heuristic that every page should be within three clicks of any other page. Applying this "rule" to web application design usually results in better usability. However, it does not always improve usability. Sometimes making every page within three clicks of another is not reasonable. Adding too many links are likely to confuse users more than they help. (Another heuristic?) Complex work flows often require that pages be more than three clicks away from others. Common sense needs to be applied to heuristics to ensure they are applied only when they fit the context.

Happy bug prospecting.

Eureka!

Next time.... apply heuristic oracles to test automation.

  Edit

April 26, 2007

How Doctors Think

Posted by Ben Simo

Michael Bolton recently brought Dr. Jerome Groopman’s newest book, “How Doctors Think”, to my attention. Michael suggested to the software-testing list that this book contains information that is relevant to software testing. In this book, Dr Groopman explores how doctors diagnose medical problems and what can be done to reduce bad judgment.

Michael Krasny recently interviewed Dr. Groopman on public radio about this book. You can listen to the interview here. Dr Groopman made a number of statements during this interview that I believe apply to us software testers’ search for and diagnosis of software bugs. I expect to find more in the book.

Here are some gems from the radio interview:

About computer-assisted diagnosis software.

It’s a two-edged sword. On one hand, it can be useful if you feed in the most important and relevant symptoms, but the problem is that you need to be able to illicit them effectively from the patient and so you return to language and you return to some of these thinking errors or these cognitive traps that we can all fall into. So if you put in for, example the first prominent symptom that you hear from someone into the software package, it may be that the fourth thing that the person complains about is the most relevant one. Recently you may have seen, for example, in reading mammograms, that computer assisted diagnosis actually generated more false positives. It caused the radiologist to think that shadows which he or she would normally think were benign were in fact cancer and didn’t add to the diagnosis of cancer –the true positives. So I think this technology is worth exploring, but I think that we have to be very careful about it because it some ways it can seem seductive. You cannot substitute a computer for a human being who listens and filters and judges and elicits from the patient the full story.

The same applies to test automation. It can be a useful tool if we don’t let it lead us astray. We human testers need to be actively involved in listening, filtering, judging, and eliciting the full story.

About time-constraints:

Its one of the biggest issues in the current delivery of medical care. You know, everyone is being told as physicians to see patients in 12 minutes. See patients in 11 minutes. So you’re sitting there with one eye on the clock and you can’t think well in haste. And there’s the natural inclination to anchor onto that very first impression as they stop there and just function as if you’re working on an assembly line in a factory. ... patients are being ill-served.

I think everyone in the medical system feels under siege. There’s not a lot of satisfaction but in a way this is penny-wise and pound-foolish because the cost of misdiagnosis is extraordinary. … It also costs in terms of dollars. Its much more
expensive to care for and treat an advanced problem than to make a diagnosis early on.

I think we have to force ourselves to resist and part of that can be done: sometimes we have to extend a visit. But we really are beholden to administrators and managed care plans, so one of the things that I’ve begun to do is if I can’t get to the bottom of a problem in my 15 minutes allotted visit,
then I say to the patient, you know, I haven’t figured it out. I need to think more. ... reschedule an appointment and spend more time.


Testers also make mistakes under time pressure. We also need to be aware that our first impressions may not be right. Testing is not a factory assembly line (even though this is a widely-held view). Testers need to be engaged throughout testing and not just mindlessly follow test scripts. And sometimes we need to lobby to spend more time than the schedule originally gives us.


About the art vs. science of diagnosis and improving the effectiveness of that diagnosis:

There’s a seductive nature of numbers, but statistics from clinical studies and so on are just averages and they may not reflect the individual in front of you. ... So real judgment involves customizing medicine: looking at scientific data but also seeing how it applies or doesn’t apply to the person sitting in front of you.
...
It’s not technology, but its language. Language is really the bedrock of medicine. Most of what we do with a doctor – or should be doing with a doctor – is talking… engaging in a dialog.
...
... all of us are wired to make these kinds of snap judgments to sort of go with our first impressions. to rely very heavily on intuition. … That often works but in medicine unfortunately too often it doesn’t work because we have a tendency to latch on to that first bit of information that we get from a patient and then run with it; as opposed to keeping our mind open and constantly doubting ourselves.
Testing is not an assembly line process. Testers need to keep their minds engaged throughout the testing process. We should not ignore our snap-judgements (blink testing), but we also need to look and think beyond those first impressions. We need to continually question both the software and our own judgement as we test.

We need to communicate more than numbers. We need to communicate stories. We need to translate bugs and metrics into language that matters to the business.

  Edit

April 21, 2007

How many tests do you need?

Posted by Ben Simo

“Testing is potentially an infinite process.” – James Bach

Nearly 30 years ago, Glenford Myers provided a self-assessment test to software testers in his book The Art of Software Testing. Much of this book is outdated; however, his test is still regularly used to illustrate the complexity of testing software. He asked readers to write down the test cases required to adequately test the following program.

The program reads three integer values from a card. The three values are interpreted as representing the lengths of the sides of a triangle. The program prints a message that states whether the triangle is scalene, isosceles, or equilateral.


Glenford Myers then fills the next page of the book with a list of 14 questions that should be answered by test cases for this simple program. He then makes the point that more complex programs are exponentially more difficult to test.

Elizabeth Hendrickson recently created a modern version of the program (no punch cards are required) and challenged testers to create test cases and test her version. See "Testing Triangles: a Classic Exercise Updated for the Web" in her blog.


Last summer, James Bach offered a new challenge to handful of testers sitting around the table at a Denver eatery. This challenge involves testing an even simpler system. James asked how many times we need to press the button on the following system to test 85% of the possible results.

You are asked to test a black box with a single push button. Pressing the button spins an internal wheel that randomly stops at one of 100 possible positions. If any of the 100 possible results (otherwise unknown to the user) encounter a bug, the box will burst into flames. The system has no other output.

Have an answer? To help, let's consider a couple other systems.

How many times do you need to flip a coin to get 85% of the possible results? A coin has two possible results. The first flip will get you 50% of the possible results. Each additional flip will have a 50% chance of not getting the remaining result. Therefore, it is possible -- although unlikely -- that we will ever get a result of heads and a result of tails.

How many times do you need to roll a die to get 85% of the possible results? A die has six possible results. The firs roll will produce a new value. The chance of getting a new value will drop as each new value is encountered. As with the coin, it is possible that we will never encounter all the possible results.

  • 1 value: 83% chance
  • 2 values: 67% chance
  • 3 values: 50% chance
  • 4 values: 33% chance
  • 5 values: 17% chance

Let's go back to James' challenge. How many time do you need to press the button to test 85% of the possible results? How about 95% of the results? And I will add the question: Is 100% test coverage feasible?

Click here to test a JavaScript implementation of this black box.

What if this black box had 200 possible values? What if there were 307.200 possible values?

Have an answer? Please share it.

  Edit

April 19, 2007

Taxing my badometer

Posted by Ben Simo

I've used great software. I've used some horrible software. I've used lots of software with annoying bugs. I regularly spend a great deal of time working around bugs in software that I use on a daily basis. However, I think I have recently found the worst yet. It is not the worst because it fails to work. It is not the worst because it is missing features I want. It is the worst because it appears to work on the surface yet generates bad data that has a direct impact on my stress level, my finances, and my interaction with the IRS.

Dealing with the IRS -- or even thinking about it -- is already stressful enough. The tax code is complicated enough that it is easy to accidentally report the wrong income or deductions. If it were easy, we'd see accountants lining up at soup kitchens instead of raking in the cash helping us common folks with our tax returns.

For most of the past 10 years, I have elected to use commercial tax preparation software to help me complete and file my tax returns. I've used some tax software that was good, and some that was bad. I've even tried some freebie and online options. The software I used this year is nothing short of a software atrocity.

This software was swarming with bugs -- although most were not obvious. I suspect that most users didn't even notice.

After installing the software, I dutifully entered more financial data than I care to track. I tried a couple what-if scenarios in the process: such as a last minute IRA contribution. I also entered some estimated values before entering the real data so that I could determine if I should take one option or another. I also entered some data for deductions that I later removed after learning (due to reading the IRS publications) that they not apply to my situation.

I did not follow the interview process from start to finish. I went back and forth in this process entering data, revising data once I clarified the rules or gathered up all my evidence for income and deductions. After several hours, I thought I had a complete return. I had entered and verified all my data in the software's "interview" interface.

Then the automated error check failed. It would not let me continue until I fixed data in a form. This was a form that I had accidentally selected somewhere in the process that contained no data. I then tried to delete the form by unchecking a box that made the options that went the form available. The software then informed me that I could not delete the form there. I went to the portion of the interview where it told me to delete the form and there was no delete option. I then discovered that I could delete the form from the program's "forms" list. However, I later found out that going back to any point in the interview recreated the form. There was no option in the software to permanently get rid of my accidental selection.

I discovered that there was no way to print my return for review until I got to the last step of the eFile process. This required that I go back in the process and tell it that I wanted to file on paper. I printed the return and was surprised to see that the data on the return did not match the current data in the interview interface. Some values that I had entered and then deleted or changed were in the printed return. My return was inaccurate. I was glad that I went through the tedious process of getting to the print feature. If I had eFiled without first printing, I would have submitted a return that did not match the data I entered -- and the data shown to me on the screen.

I then spent several hours deleting and reentering data until the printed forms finally matched the data I entered.

In the midst of my review, I noticed that the software had forced me to take an option that resulted in a lower federal tax bill. However, selecting this option instead of another equally valid option significantly increased my state tax bill. The software forced me to have an overall higher tax bill because it only considered the federal return when deciding which option was best for me. After about an hour, I realized that I could remove a deduction in order to force the software to take the alternate option.

Once everything was in order and the displayed and printed data matched, I selected the eFile option to submit my return. I entered bank account information to pay my taxes online. I then discovered that the eFile service I paid for did not allow me to submit electronic payment to the state. My returns were submitted. My federal payment was sent electronically. However, my state payment still had to be mailed. I could have filed my state taxes and paid online using the state's web site as I had done in previous years. Instead I thought it would be easier to use the integrated system. I wish that the software would have told me that I could not electronically submit my payment before it took my money for the eFile service.

And to top it all off, the software did not tell me how to send in my payment by mail. I spent nearly an hour going through the system's help and the vendor's web site seeking guidance. I finally found an option for an online chat with customer service. It then took customer service a half hour to figure out how and where I should send my payment.

I asked if I could have a refund due to the buggy software and the eFile service that was not as good as the state's free eFile service. I was directed to call them by phone to get a refund because the could not give refunds via the online chat. I called the number and waited on hold of over a half hour before I was disconnected. I called back and got a recording stating that the customer support office was closed.

The bad software was topped off by bad customer service. I was angry. My badometer was pegged.

I then did a little exploratory testing in the application. I discovered additional bugs that created incorrect tax returns. I noticed that the software said it would calculate things for me and then ask me for the value without giving any assistance. I found places where the instructions in the application did not match the IRS publications. I even found a way that I could get it to calculate a refund of any value on the state return without changing the income or deductions.

I understand that tax software companies have a very small window in which to do their development and testing. I am amazed that this software was released. The poor quality -- and the agravation it caused -- have ensured that I will not be a returning customer next year.

I suspect that the software passed all the test cases created by its makers. I suspect that it even passed multiple automated tests. This just goes to show that passing all the preconceived scripted tests does not make a quality product. I do not know what methods this company uses for testing their software, but I suspect that it is primarily scripted testing.

I believe that exploratory testing (and model-based automated testing) would have been more likely to find the bugs I encountered than scripted testing.

As we test software, we need to consider the infinite possibility of data and work flow variations. We can't test all the variations, but we can vary what we test. It is easy for individual testers to select similar test options when they think they are selecting data randomly. Seek out variety. Use random data generation tools to provide more variety.

How would you test tax software? How would you ensure that bugs like the ones I encountered don't aggravate customers?

  Edit

April 16, 2007

Performance Testing Lessons Learned

Posted by Ben Simo

Web and client/server load testing can easily become a complex task. Most people I've met got started in load testing with only minimal training in using the test tools. This is how I got started in load testing -- although I had an advantage in that I had been exposed to load testing of communications systems. I also had experience with automated single-user performance testing. I had led some small-scale manual load tests with multiple testers on a conference call hitting the same client-server application at once. (And we found some show-stopping bugs doing that manual testing.) I had watched others perform load tests. I had read numerous load test plans and reports. However, I had never directly participated in executing automated load tests... then I was asked to lead a load testing project.

Through the years, I have made many mistakes designing, scripting, and executing load tests. Load testing easily becomes complex. Tool sales people sometime tell us that nearly anyone can create tests with their tools. (Yet buying test tools is sometimes just like buying a new car: the salesman tells you that the car is reliable and has a great warranty; then the finance person warns of everything that could go wrong that isn't covered in the warranty and trys to sell you an extended warranty and maintenance contract.) Learning the mechanics of how to use a tool are often the easy part. Its what you do with the tool that matters.

Here is the short list of some of the important performance/load testing lessons I have learned. Some I learned from my own experience. Some I learned from the failures of others.

  • Bad assumptions waste time and effort
    • Ask questions
    • Performance testing is often exploratory
    • Expect surprises
    • Prepare to adapt
  • Get to know the people in your neighborhood: no single person or group is likely to have all the required information
    • Subject-matter experts
    • Developers
    • System administrators
    • Database administrators
    • Network engineers
  • Don’t script too much too soon: you may end up tossing out much of what you script
    • Applications change
    • Real usage is difficult to estimate
    • Tool limitations may be discovered
  • Different processes have different impacts: what users do can be as, or more, important as how many users are doing it
    • Include high-use processes (80/20 rule)
    • Include high-risk processes
    • Include “different” processes
  • Modularize scripts: simplify script maintenance -- but only when you intend to run the script again
  • Data randomization is not always a good thing: randomization can make result comparison difficult
  • Code error detection and handling
    • Don’t assume that your tool will handle errors
    • Catch and report errors when and where they happen
    • Realize that errors may change simulated user activity
  • Know your tools and test environment
    • Tool’s supported protocols and licensing
    • Load generator and network resource usage
    • Load balancing and caching mechanisms
    • Beware of test and production environment differences
  • Try to translate results into stories that matter to the applicable stakeholders
    • Tests are run to answer questions: don't overwhelm your audience with hundreds of numbers if they just want a "yes" or "no" answer


and finally…


  • Most performance and load-related problems are due to software code or configuration; not hardware

    • Don’t throw more hardware at a software problem

            Edit

          April 15, 2007

          Model-Based Test Engine Benefits

          Posted by Ben Simo

          A Model-Based Test Engine (MBTE) is a test automation framework that generates and executes tests based on a behavioral model. Instead of performing scripted test cases, a MBTE generates tests from the model during execution. Instead of implementing models in code, a MBTE can process models defined in tables. Both human testers and computers can understand models defined in tables. A MBTE can be built on top of most existing GUI test automation tools. Combining good automation framework practices with Model-Based Testing (MBT) can transform some common test automation pitfalls to benefits.

          Implementing a MBTE can produce the following:

          1. Simplified automation creation and maintenance.
          2. Simplified test result analysis.
          3. Automatic handling of application changes and bugs.
          4. Generate and execute new tests – and find new bugs.

          More to come...

            Edit

          April 7, 2007

          Get excited about the negative

          Posted by Ben Simo

          "It is the peculiar and perpetual error of the human understanding to be more moved and excited by affirmatives than by negatives."

          --Francis Bacon

          Selective thinking is the practice of giving more weight and credence to information that confirms our beliefs than information that may contradict our beliefs. We are all guilty of this. We tend to easily believe data that confirms our beliefs and experience. We tend to ignore data that does not confirm our beliefs.

          This confirmation bias can greatly impact our work as software testers. Testers (and developers) often test (and code) to confirm requirements are met by testing the positives. We often overlook the negatives.

          Peter Wason's card problem demonstrates this. The problem involved four cards that each have a number on one side and a letter on the other. They are presented with the following values visible:


          A

          B

          4

          7

          The other side of the cards is not shown.

          The following claim is made: "If a card has a vowel on one side, then it will have an even number on the other."

          The following question is then asked: Which cards do you need to turn over to determine if the above claim is true?

          Try to solve the problem before continuing.

          Research has shown that most people get the answer wrong. The majority of people believe that A and 4 must be turned over to answer the question. This suggests that most people try to confirm the positive when the question requires that we also try to disprove the statement. Both the positive and negative need to be confirmed to answer the question. Click here to see the answer.

          When we test software we need verify both the positive and the negative of the requirements. We need to ensure that the software does what it should do and does not do what it should not do.

            Edit

          April 6, 2007

          You're too negative

          Posted by Ben Simo

          I believe that some level of pessimism is required to be a good software tester. Some of the best testers I have met are pessimistic towards the systems they test. These black hat testers (of which I am one) consider what might go wrong and ask questions. Although I believe that applied pessimism is a necessary for good testing, this negativity can hurt relationships and the project if it is not constructive. Optimist developers and project managers often have trouble understand us pessimists. The result of our pessimism needs to be better preparedness, not shared depression by all involved in a project. This is the type of pessimism that is the subject of Dr. Julie Norem's book, The Positive Power of Negative Thinking. Dr. Norem defines "defensive pessimism" as

          Defensive pessimism is a strategy used by anxious people to help them manage their anxiety so they can work productively. Defensive pessimists lower their expectations to help prepare themselves for the worst. Then, they mentally play through all the bad things that might happen. Though it sounds as if it might be depressing, defensive pessimism actually helps anxious people focus away from their emotions so that they can plan and act effectively.
          I do not agree that defensive pessimists are necessarily "anxious people". I see the pessimism as a necessary part of good critical thinking.

          Think you might be a defensive pessimist? Take the defensive pessimist quiz.

          While the pessimist black hat is a necessary part of testing, Julian Harty argues that we need to try on Ed DeBono's other hats as well: both in analysis and applied methodology. Take a look at Julian's CAST presentation from last year.

            Edit