Expert Insights | Earley Information Science

Taxonomies, Metadata, & Search

Written by Earley Information Science Team | Aug 25, 2007 4:00:00 AM

The information age poses all kinds of challenges, the most fundamental of which is how to find things — this Articles explores how a taxonomy, through the application of metadata, can help users find exactly what they need.

Browsing and Searching

When people know what they are looking for (or think they do), they typically enter a search term and then browse through a result set. But if they aren’t sure what they want or have no particular goal in mind, they’ll usually look through navigational links and labels and use them as clues about where interesting information might lie. These methods aren’t mutually exclusive; some users will move back and forth between both approaches when one isn’t fulfilling their needs successfully.

The Google Effect

There is a great deal of confusion surrounding the value of metadata and taxonomic terms in organizing documents. In many organizations, there is a line of thought that “if we just get a really good search engine,” then the problem of people not being able to locate information in the context of their work will go away. People will be able to just enter a search term in a Google-like interface, and the precise information they are looking for will appear. This is a typical argument against the process of formally building metadata structures and standards and developing a well thought-out taxonomy.

The answer to this line of thinking is that although algorithms are getting better, it is not yet possible for machines to infer intent. They can count words, look for patterns, derive categories, cluster results, extract entities, compare word occurrences, and apply complex rules and statistical analyses. But they cannot tell what you want to do. They don’t know the context of your work task. They cannot determine what is important to you. One could argue that no one can determine a user’s intent, even taxonomists and metadata architects. That may be true, but if we know something about who users are and understand how they do their jobs, then we can start to make some assumptions about the information that we think they want.We might also begin to understand both the specific language and terminology that searchers use and their mental models of their world and work tasks.

What is the significance of knowing all this? Well, the more we know about a user’s world, the more precise our assumptions will be about the types of artifacts that user will look for in day-to-day work tasks. If I am a salesperson and am doing some cold calling, I might first look for calling scripts, some Articless on the market or customer needs, or perhaps a follow-up presentation or white paper I can send my prospects after I get them on the phone.

If I am a consultant trying to install the latest version of engineering design software, I will want to look for technical bulletins, bug fixes, common installation problems, previous engagements’ lessons learned, customer site histories, specific configuration documents, and so on.

These process steps or work tasks help describe artifacts and the context in which they are used.These descriptions and contexts become the raw material for the taxonomy. They can also be the basis for metadata fields that are applied to documents.

For example, perhaps my work process looks like this:
Scope project
Write proposal
Deliver project
Capture lessons learned
Close project


In the first step, I need to find the following:
Fee worksheets
Prior projects
Example solutions
Scoping worksheets


These artifacts should ideally be labeled, so they can be more easily retrieved.

I might label content in an application according to the process step, so that when I am scoping a project, I can search according to any documents that are appropriate for the scoping phase.

Now imagine that I serve a number of markets: pharmaceutical, financial services, aerospace, automotive, and high-technology. It may also make sense to allow retrieval of documents in the first process step that are related to my client’s industry.

We could also say that there are documents for different audiences, perhaps for technical versus non-technical. Or documents may be distinguished as internal or external, partners or customers, and so on.

Each of these perspectives represents a different “facet” of the content. Metadata is applied by deriving a list of terms for each facet and a using a combination of these terms to describe the exact context of the content. By applying metadata to content in this way, and then letting users select the appropriate terms that describe their tasks, we are in effect letting users describe their intent—they are telling us who they are, what they are attempting to accomplish, and what is important to them.

This is called faceted navigation or faceted search and is really just search on metadata—the old “advanced search” that no one ever used. But now we can "fool" people into thinking that they are navigating instead of searching. This is done through clever user interfaces, like those by companies such as Endeca and Siderean, but this can also be accomplished through “stored searches”—queries that are preconfigured for a particular task.To users, this looks just like navigation: they simply click on a link, the search is executed behind the scenes, and a set of results is presented.

This type of search can help precisely distinguish content with fine shades of meaning, especially when there are large numbers of complex documents in a repository. An “outsourcing strategy” can vary widely across industry and process and contain many types of documents and deliverables. Broad searches using ambiguous terms will not zero in on “best practices for telecommunications call center outsourcing strategy for the insurance industry,using service firms located in India” by searching on “deliverables.” However, searching on the metadata facets of “industry,” “process,” “locale,” and “best practices” will yield more appropriate results.

The Tagging Process

Does that mean we have to add meta-tags to everything? When I describe faceted search, many people say “our users won’t tag content” or “that is too expensive.” Do we have to tag all of our content?

The answer is no, not all content, because not all content is not of equal value.

You should concentrate on what is important for users in their context, what needs to be easily and quickly accessible. Information required for work tasks (e.g., worksheets) or that is reviewed for timeliness and appropriateness (e.g., best practices) is considered highvalue content, since its findability and use directly impact business objectives. Unfiltered information that is less directly involved in specific work processes is generally less valuable to daily operations and should be less of a concern.

This is not to say that unfiltered information has no value. On the contrary, emails and discussions can be rich sources of tacit knowledge, often valuable in complex or novel situations.However, this type of content is generally unstructured and therefore more difficult to organize. And since the information these documents contain is tied less directly to specific work processes, the cost associated with applying a formal tagging structure to this type of information is harder to justify. Content that is already structured or tied to a structured process tends to derive greater value from controlled organization.

So you don’t have to formally tag all your content. But if you do, keep in mind that tagging large volumes of content can be a time-consuming and costly process. You will likely want to tag in phases, prioritizing categories of content based on their relative value to users. It is not unusual to run out of budget or have progress postponed during the course of a tagging project, so you want to be sure you’ve really focused your efforts where they matter most.

Decide what content has the greatest value and prioritize there. You can also set up a “prioritization matrix” that assigns values to various attributes of your project and attempts to place a score on one focus area versus another.

Value of Social Tagging

Although you may not be ready to invest in formally tagging all your content, there is a less structured and expensive approach to adding metadata to content. A social tagging approach gives users the ability to add keyword metadata to content items and is not controlled by any taxonomy or term list. User-generated tagging has some benefits: it is useful for identifying emerging knowledge and terminology, it can take into account multiple perspectives, and it certainly costs less than controlled tagging! However, with social tagging you lose many of the benefits of a controlled approach.Terms may be ambiguous or overly broad, there may be many variants, and terms alone lack the context provided by a structure.

Of course, one approach does not preclude the other. Not all tools lend themselves well to leveraging metadata standards or a taxonomy. Collaborative tools tend to be less structured (along with their content) and focus on knowledge creation rather than access; thus, these tools respond better to an unstructured tagging approach. Tools that support controlled processes tend to focus more on knowledge access and require a more structured and rigorous approach to organization.

You can even use social tagging as a supplement to a controlled vocabulary in some contexts, using it both to raise awareness about how users think about finding information and as a source for tracking vocabulary changes. There are many hybrid approaches possible; it is up to you to decide what suits your context and budget best.

The point is that it is not an “either/or” situation. Tagging with controlled metadata is important when a process requires more structured information. One would not use social tagging for validated processes for FDA drug submissions. Those require very precise editing, vetting, and control processes. Less structured information can do with a less formal process. The goal is to determine both the relative value of content and how users think of that content in the context of their work and then to determine how formal the process of applying metadata needs to be.