Using CRANIUM for Empirical Testing

How do you know your work is any good? And I mean, really know? (This is the second post in my Heuristics and Empirics series. You can find the first post here.)

Let’s start as simply as possible, with a list. Why a list? Well, because it's the basic building block of just about anything metadata-ish. You can put them together to define a metadata model, embed them to make a hierarchy, relate them using taxonomy and ontology, or just create something browseable so people can choose what they want or where they want to go next. Lists abound on websites, intranets, tagging interfaces, menus, folders, and models. And they're basic, which means we all get what I mean when I say "list."

So, this list. How can we determine – empirically, quantitatively, and with sufficient statistical significance – if this list is actually any good?

Before we can measure goodness, we are first challenged to decide what “good” means to us. Are we satisfied if our list is free of typographical errors, or do we need better? We need to do some design thinking (not empirical!) and ask why: Why does the quality of this list matter to my business? Asking why is a tenet of effective problem solving and can lead to creative solutions, but for now we need something way simpler. We need a quantitative measurement of goodness (quality), which means we need some kind of goodness scale. So the better we understand our motivation for quality, the more we understand what qualities we want in our list, and the better our test protocol will be.

“Why?” almost always boils down to trying to increase value: money (making or saving), time (saving), or risk (reducing). Therefore, our list must be good if it leads to greater money, more efficient use of time, or less risk. Experience tells us that lists capable of achieving these goals have many of the CRANIUM characteristics:

  • Complete / comprehensive
  • Relevant (appropriately contextual)
  • Accurate
  • Navigable (allows users to find things)
  • Intuitable (i.e., quickly understood)
  • Unambiguous
  • Meaningful

Luckily for us, CRANIUM features are testable characteristics. What’s more, if we’re smart, we can tie these characteristics to a dollar amount, time savings, or risk avoidance ratio, providing ROI and a decent justification for performing the test.

Here are my recommendations for CRANIUM-testing a list of values. None of these approaches requires the tester to speak with participants, or for the participants to answer subjective questions.

  • Completeness. Require participants to apply list values to existing content or tasks. For example, you might need to know if this list works to identify all products in a catalog, or all content in the CMS. We consider the list is complete if all content and tasks can be matched to at least one term. Prerequisites for testing include a complete list of content or tasks (to be tested).
  • Relevance. Same as for Completeness. We consider the list relevant if every list value is matched (by a majority of test participants) to at least one content item or task.
  • Accuracy. Identify tasks and content items that are already assigned values from the list; these assignments are considered “the truth” for the purpose of this exercise. Then, require participants to again assign terms from the list to the tasks or content. For example, show participants a document and have them decide if it’s an invoice or a memo. If participants make choices that are consistent with prior assignments, the list is sufficiently accurate; however, if participant choices are inconsistent with prior assignments, further testing (or subjective discussion) is required to determine which set of assignments is considered better. Prerequisites for testing include a list of tasks or assets that are already associated with list terms. Note also that participants should come from different audiences and demographics than those who created the original assignments, to avoid selection bias.
  • Navigability. Require participants to find a term you provide, as with a simple matching game, and measure their speed. For example, ask users to find the word marmalade. (In a simple alphabetized list, this would happen quickly, whereas in a list of chronologically ordered items, this might prove difficult.)
  • Intuitability. As participants complete any of the other tests in this list, observe their performance at each step of the test. Poor performance (compared to the overall performance of that participant) is indicative of localized trouble. To obtain highly contextual information, it helps to record the entire experiment for further close study.
  • Unambiguity. Require participants to match list values to content or tasks that you provide. Terms in the list are unambiguous if all users match terms consistently. Prerequisites for testing include a sufficiently large list of relevant tasks (that spans the entire list), plus reasonable foreknowledge of how they are expected to map to your list values.
  • Meaning. Same as for Unambiguity. Terms in the list are meaningful if they aren’t overlooked, such that terms are selected by participants at frequencies consistent with the test questions.

These tests can be applied to hierarchical and polyhierarchical lists as well (e.g., taxonomies), although their complexity makes interpretation of results more challenging. This is because problems at the higher (broader) levels of the hierarchy will affect test results regarding the lower (narrower) levels. For example, participants looking for microphones might get confused by an ambiguous top-level choice between “Electronics” and “Computers” and so fail to find “Microphones” under Electronics. Recognizing that the problem lies at the top level (with Electronics and Computers) and not elsewhere is not intuitive.

Lists and hierarchies, as you might imagine, appear everywhere in information management environments. These structures are used for product categories and product specifications at e-commerce websites, inside SharePoint document management for tagging, as options for search queries and refiners for search results, and, of course, for navigation everywhere.

Given how often they appear, a certain amount of empirical reassurance is always good for the CRANIUM.

Seth Maislin