“How easy is it to use?” something that we would like to hear from the leadership, but eventually it is the burden that we keep asking ourselves (as UX Designers, Analysts, Product people…). Now, measuring usability is a highly researched subject that we will briefly touch as means of starting the discussion. What we will discuss is how to present usability in a more humanistic way.

Conventionally, when measuring usability we are taking a number of individual scores given by people who went through the system in question (participants), then we take the average of those. This average is the system’s usability score. A usability score, most of the time, has a meaning behind it: Does the system have good usability? Does the system have mediocre usability? Does it have bad usability? That score comes with an approximation on its accuracy, sometimes with the minimum participants required to achieve 100% accuracy. Other times, the accuracy is related to statistical measures.

Averages are great means of summarising a large set of numbers, but although they give a good overview about the system, they take us away from the individual. A system with good usability means that the average user will find that it has good usability, but as one will find out sooner or later *there is no average user*; except of probably the Everyman. As we more and more take decisions based on metrics, rethinking of these metrics to emphasise the human instead of the system will help us take more humane decisions, naturally. So instead of asking “*does the system have good usability?” *we can investigate *“How likely is it that the next visitor will find our system easy to use?” *or most importantly* “How likely is it that the next visitor will **not **find our system easy to use?”*.

Let’s first take a look at the most simple scenario: we have performed zero usability measurement of the procedure in question — we have **maximum** **uncertainty** about the usability of the procedure. If we have no previous knowledge, then our best case is to treat the event of *“someone going through the procedure and finding it usable” *as completely random. That means, it is 50% likely that the next visitor will find our app easy to use.

As you can imagine, this is not the most helpful result, since most likely the humans going through the procedure do not produce random data. Nonetheless, we are getting an answer to our question, regardless of the fact that we know nothing about what users “think” of the system in question. That’s why it is important to keep in mind the number of users the result was calculated upon (i.e. certainty). This is a reality that we have to deal with statistics: being able to calculate the number doesn’t mean that the number is correct.

Although there are numerous tools to measure usability, time-cost restraints, the nature of the tool itself (used to measure), and the unavoidable nature of humans will create uncertainty about the exact results; this is fine and we will leave it as it is for now. However, keep in mind that the following way of presenting usability results follows the same uncertainty-based pitfalls of conventional usability presentation.

Let’s say that the procedure we are trying to measure exists in a website, it could be for example the sign-up flow. The people who will go through that procedure are not randomly selected from a pool of people with evenly distributed and different characteristics, past experiences, and knowledge. Furthermore, a big percentage of those people are working in similar contexts (viewing the webpage from a mobile screen, typing on a keyboard, etc.). The perceived usability of the procedure for each individual would be different, but related to one another. Except for some outliers, most usability scores will revolve around some specific numbers. If we asked ten participants if the flow was easy to use on a scale from 1 (hard) to 5 (easy), it would be more likely to get the responses “1, 4, 5, 5, 5, 4, 1, 4, 5, 4”, than “1, 5, 1, 5, 1, 5, 1, 5, 1, 5”.

Another way to view it is, if 200 people rated the usability as “good” and 10 as “bad”, intuitively we would expect the next person that would come, to find the procedure as with “good” usability.

We have the understanding now that if we get some initial information, we can better estimate the ease of use of the process. Depending on the scope and budget of each research that can mean achieving 100% certainty or just getting a good-enough estimate. Regardless, we would need to recruit a tool of measuring usability. A good tool would be reliable and valid, furthermore it would give some sort of qualitative result of the usability of the system in question. Usability questionnaires (like the PSSUQ or the SUS) are a great tool to gather the initial data. But a simpler measure like the Net Promoter Score *, or any other measurement that its result will be segmented to “good” vs “bad” usability, will do.

Going back to “it is more likely that the next user will find the procedure of *good *usability” if 200 people rated the usability as “good” and 10 as “bad”. We can very well go back in time — in the 18th century to be exact. Pierre-Simon Laplace, at the time, was trying to answer “If we repeat an experiment that we know can result in a success or failure, n times independently, and get s successes, and n − s failures, then what is the probability that the next repetition will succeed?” [1] He ended up creating a mathematical formula that he used to calculate “the likelihood that the sun will rise tomorrow, based on the fact that it has risen each day the previous 5000 years” (around 99.9999453%).

A person that goes through the procedure we are testing can either find it “good” or “not good” regarding ease of use. How we differentiate “good” from “not good” is up to the tool we used to measure individual perceptions. For example, a System Usability Score in around the 80s will indicate a system with good usability; individuals who score 8 and up are considered promoters by the NPS.

First we want to find the frequency of which individuals are rating our system as with “good” usability. Let’s call that number: s (for **s**uccess). E.g. If we have used the SUS questionnaire, and 5 out of 8 individuals have rated it around 80. Then we have 5 “good” observations, s = 5.

The total number of individuals that participated in the study, we will call it n (for **n**umber), n = 8 in our example.

Then, we can calculate what is the likelihood of the next person finding that the procedure has “good” usability as: (s + 1) / (n + 2). That is 6 / 10 = 60%, in our example. More importantly, we can calculate the reverse of that, how many people **will not find the process usable**, how many will experience some hiccup, or are more likely to not complete the process [money loss = business likes]. That is of course, 100% — 60% = 40% of the individuals.

A question that might pop into someone’s head [it sure popped up into mine] is: *Why plus 1 and plus 2? Why not 5 / 8 = 62.5%?* And it is very logical [I hope]. Five out of eight (5 / 8) is the current state of knowledge. We already know how many people used the process and how many found it with “good” usability. What we want to find is in regards to the future. We want to find how likely it is for the **next** user to find the process with “good” usability. A practical way to think of this is the following:

- Currently
*n*participants gave their feedback - Out of those
*n*,*s*found the process with “good” usability. This is our knowledge, that s / n participants find the process usable. - If a
*new*participant uses the application then they can either find the process “good” or “not good” - We don’t know if the next participant will find the process “good” or “not good“. But we want to know how much it will affect our knowledge, in case that either of these happen [2]
- We assume then, from the next 2 participants that they will give their feedback, one will say it has “good” usability and the other “not good” usability
- So, in the future, we will have
*s + 1*participants saying that the process has “good” usability, were in total*n + 2*participants will have given their feedback - So finally, the likelihood of the next participant giving a good feedback is (s + 1) / (n + 2)

As you may have guessed, this formula didn’t come out just for usability studies. It can be applied to any two-outcomes (mathematical) system that some prior knowledge exists, but there is no other knowledge about the likelihood of any of the events occurring (if generalisation is your thing and you like the video format: you are in for a treat). To quote Laplace’s own words when presenting the example of the probability of the sun rising the next day *“But this number is far greater for him who, seeing in the totality of phenomena the principle regulating the days and seasons, realizes that nothing at the present moment can arrest the course of it.”*

Indeed, if there was a deterministic way to combine all of the system’s attributes and result in a single number regarding its usability 3, then this metric will be redundant and not that appealing as an option.

The process used to calculate this metric falls under Inductive Reasoning, and if someone has to be careful about one thing in Inductive Reasoning that is *Anecdotal Generalisation. *As we discussed in the “Approaching uncertainty” section, this method will give a result regardless of how *certain the result is*. In other words, one has to be cautious saying that the process has a 66.6% likelihood of the next person finding it usable, if the result is based on one participant. A way to mediate this risk is by including a measurement of uncertainty alongside the result, so at any point the person viewing the report will understand the validity of the result.

As Goodhart famously stated “When a measure becomes a target, it ceases to be a good measure”. Indeed, this metric didn’t come out of nowhere to be followed blindly. It just presents the same information in a different way: instead of responding to the question “Does the system have good usability?”, we are responding to “Will the next user find the system usable?”. That metric is independent of the tool used to measure usability and utilises Laplace’s Rule of Succession to calculate a number that we consider more humanistic. Putting an emphasis on **what percentage of the users will not find the system usable**, could drive research and development in minimising the usability issues of the people affected by them, rather than finding comfort in a system with “good usability”.

Read the full article here