Managing Risk with Usability Testing
By Aviva Rosenstein, Ph.D,
Usability Services Manager
At a trade show I attended earlier
this month, a young CEO asked me an excellent question about the principles behind
discount usability testing. He told me that he recently got a report on a usability test
conducted by members of his product development team, but was inclined to discount the
results because they weren't statistically significant. He asked, "Why should I pay
attention to usability test results based on the observations of only a few users?
Shouldn't the tests be based on a random sample of the whole population of users, and be
statistically valid?"
I told him,
"Usability tests conducted with a small number of users aren't intended to give you
statistically significant results. But, that doesn't mean the test results are useless!
Performing a study with a small number of users can be a cost-effective way of minimizing
the potential risks of developing and launching unusable software or web interfaces -- IF
the participants for the tests are carefully selected, the study is appropriately designed
and conducted, and the data is analyzed methodically and accurately."
I want to share the
explanation I gave him with all of you because understanding how usability tests help you
manage risk can save you a lot of money and time.
Saying that the
findings of a study are "statistically valid" doesn't mean that the results of
the test are more useful than data resulting from other empirical research approaches. It
simply means that the kinds of observations made from the group participating in the study
are likely to be found within the larger population, within a given range of certainty.
While claims of statistical validity may be important for scientific investigators, they
are not always critical or even that relevant for product developers.
This claim may sound
surprising; so let me clarify it a little further. I'll begin by defining some terms often
used in conventional research.
Validity
Validity is the
degree to which a research design accurately measures what it is intended to measure,
while excluding possible alternate explanations for the findings.
Designing a valid
measure is more complicated than it might seem. In usability tests, for example, testers
frequently measure how long it takes each user to complete a given task. But, that measure
may not always be an accurate reflection of the usability of the interface.
Here's an
illustration. In the past, I've seen reports based on using web server log data to measure
how long, on average, users took to add something to an online shopping cart and proceed
through a checkout process. While it might seem like "time to completion of
task" is a useful measure of usability, it doesn't account for other possible
explanations for the users' behavior. Some users may be distracted by cross-selling
opportunities and leave the shopping cart task to browse for other items. Other users may
be distracted by other tasks in their environment, such as a phone call or other
interruption. Some users may have opened other browser windows to research prices with
different vendors before completing their purchases. All of these possibilities could lead
to longer "time-to-completion of task" measures, but that doesn't make the
shopping cart interface any less usable! So averaging the time-to-completion data
contained in the log file is statistically valid, but completely useless in this case.
Since other possible explanations for the results besides poor usability were not excluded
or accounted for, the time-to-completion measure doesn't actually measure the overall
usability of the shopping cart interface.
Reliability
Reliability refers
to the extent to which repeated measurements using the research design under similar
conditions produce the same results.
However, reliability
does not guarantee validity. It's entirely possible to design a study that is reliable,
but still does not result in valid data. In the example above, the time-to-completion
average remained consistent in the server log data from week to week. This makes the
measure reliable, but it does not increase its validity, since it is not an accurate
measure of the shopping cart's usability.
Sampling
Sampling refers to
the practice of selecting a representative subset of study participants from the total
population of interest.
Researchers use
sampling because studying everyone in a population is usually far too impractical and
expensive. Theoretically, sampling allows the researcher to generalize their findings from
the subset onto the larger population. For example, sampling a subset of potential users
out of the total user population and observing how they navigate a particular interface
allows one to generalize the behavior of the sample onto the larger population with a
specified degree of certainty. Most people are familiar with random sampling -- one in
which every person in the population has an equal chance of being selected for the sample.
However, there are several other types of sampling methods available to the researcher,
each with its own level of precision and difficulty.
The sampling method
chosen for any usability study should balance the costs of the method against the desired
level of precision needed for the study. A "convenience sample" of co-workers
recruited from down the hall might be inexpensive and easy, but it may not provide the
most appropriate level of validity for a particular project. For example, if these testers
are already familiar with the application under development, the test may result in
findings that are not representative of the actual user population. Hence, the validity of
a usability study based on a convenience sample may be questionable. We recommend
different sampling techniques depending on the characteristics of the user population of
interest and the risk levels associated with a specific project. Typically, we use methods
that ensure that test participants reflect the relevant characteristics of the system's
intended user population.
Generally speaking,
the larger the size of the sample, the more reliable the study will be. But, remember that
reliability does not directly relate to validity! The costs of increasing the sample size
(and the reliability of the study) should be balanced against the increase's potential
return on investment. Consider how much that extra reliability is worth to you. In
scientific studies, the acceptable probability level for making a claim is usually set
very high -- at 95% or 99%. Put into straightforward terms, this means that studies are
designed so that "based on our observations of the sample, we can predict that the
general population will act the same way at least 95% of the time."
Managing
Risk and Maximizing ROI
It's not difficult
to design or conduct usability tests that deliver statistically valid results. In fact, at
Classic System Solutions, we are familiar with a variety of research designs and
probability sampling methods. However, we have found that designing usability studies to
ensure a high level of statistical significance is rarely cost effective for our clients.
Instead, we have found that using a small number of test participants allows us to
significantly reduce risk for the majority of our clients, while maximizing the return on
their investment.
How does this work?
The risks associated
with the user interface of a web page or software application are tremendous. We know, for
example, that usability problems in a system will significantly decrease end user
efficiency and increase support costs. End users also tend to resist adopting systems with
poor usability, so the competitor with the easier user interface gets the edge in the
marketplace. And, several studies have shown that incorporating usability checks into the
development process is far less expensive than redesigning a system to make it more usable
late in the development cycle, or even after an initial launch. Any of these factors may
significantly reduce the return on your company's investment in developing software or web
applications.
While it's important
to manage risk in the business world, it's also important to minimize costs. While it's
not difficult to design usability tests that deliver statistically valid results, the cost
of conducting these tests can be prohibitive due to the large sample sizes required, and
the benefits marginal. Since we recommend that you conduct usability tests early in the
design process to identify areas that need improvement, it's usually much cheaper to just
fix the problems identified by a small group of testers than it would be to statistically
confirm the results of the tests with a larger population of users.
Furthermore,
research has demonstrated that the number of usability problems, found in a design, levels
off significantly after the first six testers. The first five test participants typically
discover approximately 85% of the usability problems in a task; but it might take another
ten testers to find the remaining 15%. Consequently, we find that the smaller sample size
is a more cost-effective choice for most development projects, even if it does not
engender statistical validity.
Plus, you can use
the money you saved on running that enormous study to run additional test cycles later in
the development process. This not only allows you to validate the design choices made to
fix the initial problems, but also ensures that no additional usability problems are
introduced by the new designs.