Tangled Up in Taxonomies

A recent article by the industry analyst CMS Watch proclaimed the following under the title “Taxonomies are dead.  Long live metadata”:

“With social computing coming to the fore, it's never been more obvious that everyone does not, and will never, categorize things in the same way...... I will assert that the days of the traditional, definitive, and single-hierarchy taxonomy are long behind us.”

New Developments in Taxonomies

In my experience, we have never considered a single, galactic, über taxonomy to classify and organize all things – even for a narrow domain of knowledge.  It is true that people have different mental models and classify things based on their perspectives and experience as well as their understanding of the intent of their ultimate users.

Multiple perspectives have to be combined and synthesized and various stakeholders need to be brought into alignment.  Successful taxonomy projects get people to agree on very nuanced and granular details about how the organization communicates.

Taxonomies are one of those things that seem so simple on the outside – and a well developed taxonomy can appear very obvious and intuitive.  But elegant solutions typically appear obvious after the fact.  Getting to a simple, elegant set of organizing principles belies the incredible effort that goes in to arriving at that outcome.

But you always face the question: How do you leverage a taxonomy after all the hard work of understanding users, analyzing content and processes and getting agreement amongst diverse groups of stakeholders?

Searching and Browsing

When a user seeks information, they are doing one of two things: browsing or searching.  In the old days of taxonomy, most content management and document management systems treated these as distinct functions with differing underlying mechanisms.  Browse was done through a physical layout of links that corresponded to a physical directory on a server.  Search was accomplished through an engine that looked at all the documents sitting on that server and built reverse indexes that compiled lists of terms occurring in documents and pointers to all the documents where those terms lived.

Increasingly sophisticated mechanisms have evolved to improve the relevance and ranking of search results to help divine the user’s intent.  Meanwhile, the old advanced search box, which allowed savvy users to more precisely define what they were looking for languished.<

There were a number of reasons for this.  First, people are lazy.  Why click another link when I can just type my query and go?  Second, advanced search provided somewhat meaningless details around what database to search (why not search all of them?  Especially since I would not know why more than one existed anyway) or the ability to define a pdf versus a doc, or determining the date of the document creation, even the size of the file.

These were the first attempts at searching on specific metadata contained in documents or web pages.  Other bits of metadata like content type, or topic may have added more flexibility, but many of the facets first used in advanced search interfaces were pretty much unusable or required sophisticated knowledge of the structure and content of the application.

It was also possible to formulate nonsensical queries, like looking for an industrial product in a consumer market or searching for “asbestos” as a topic and “promotion” as a content type.   In other words, if metadata facets are disconnected, combinations that make no sense can be selected.

But even when combinations made sense, there was no way of knowing of content existed with the unique parameters that users were selecting.  So, it was not uncommon to have zero results in search after search.

Browse Gets Better

Meanwhile, the ability to browse has improved with the use of tools that improve on advanced search concepts but also trick the user into thinking that they are simply browsing.  Now, clicking on a link does not just bring you to a physical location but can execute a search behind the scenes.

There are a number of clever UI tricks that have made this approach –faceted search - as useful and appealing as it is.  One is the intelligence to drop choices that are no longer relevant.  If you are looking for regions in Latin America, no need for the US to show up.  If there are no documents that contain certain terms from the taxonomy, remove those from the list of choices.

The Role of Metadata

Metadata drives all aspects of content management – from basic housekeeping, to content assembly and reuse, dynamic presentation of content, personalized content, etc.   Metadata also drives faceted search.  A faceted search engine (like Endeca for example ) dynamically presents navigation based on choices the user makes according to various parameters.  On an e commerce site, this is the typical price range, size, color, style, brand, model, etc.  Attributes will vary depending on whether you are looking for a digital camera (number of megapixels) or a laptop (screen size, amount of RAM), etc.

But faceted search can be applied to internal processes and knowledge objects.  Content can be organized by process, by product, market, content type, industry, solution, etc.

The Value of Taxonomy

Taxonomies drive all of the organizing principles of content and commerce applications.  New tools have created a greater need for formal taxonomy development, not reduced the need for taxonomies.  Faceted search has become the new standard in content access.  However, there are many examples of faceted search being poorly implemented without the benefit of thorough content analysis, usability testing, user scenarios, and sound library science principles.

In  some cases, technology vendors claim that their tools are  smart enough to “derive” the taxonomy,  This is not entirely accurate.  Many tools can perform entity extraction and make a guess as to the “is-ness” and “about-ness” of a piece of content. These entity extraction processes may do a decent job, depending on the quality of the content and will provide decent results especially when compared with the usual poor quality of search.   In most cases, e commerce sites already have metadata defined and applied to content.  But this does not mean that the user interface and way that facets and attributes are surfaced cannot benefit from tuning.

One has to determine the correct context in which to expose terms.  Terms are usually in a hierarchy for usability purposes but users can get confused and lose their place if they do not have enough contextual clues.  There are times when you need to break good classification rules and defer to a user’s mental model and apply good navigational constructs that “break” the hierarchy.    (User testing can tell you when the rules need to be broken)

Searching versus Browsing, Clustering, Query Guidance and Disambiguation

Technology has gotten better at leveraging taxonomies in other ways that help users find information.  In addition to faceted search, integration tools allow a shift between browse and search perspectives.  People are not either browsers or searchers, they are both.  We browse when we don’t know w hat we want and want to discover what is there.  We search when we think we know what we want and simply want to retrieve information. But users typically shift back and forth between retrieval and discovery mode.

There are tools that will selectively present a portion of a taxonomy to a set of search results in order to provide greater context and help the user zero in on what they want. (This is a form of results clustering based in the taxonomy)

Tools can also assist users in formulating more precise queries – by suggesting narrower or related terms.  These mechanisms pull from pre defined taxonomies and thesaurus structures. And of course, Best Bets are directly derived from search log analysis and the taxonomy in order to help disambiguate user queries.

The Challenge of Tagging with a Hierarchy

Another aspect of taxonomy integration is in the tagging interface.  Large taxonomies can become unwieldy.  You cannot present 700 terms for a user to select from.  Some tools present taxonomy values in a hierarchy and then tag the content with all of the terms that form the path to the lowest level node.  In this way you can maintain context of the term.  The problem is that metadata does not respect a hierarchy.  So when you navigate through a large term sets the terms need to be broken up into categories.  But is this just for the tagger? What will the system do with tags in categories?  Some content management systems will use the hierarchy to present content in an analogous static navigational structure.  Or a search application can use the hierarchy to allow for dynamic faceted navigation.  Other content management systems will ignore the hierarchy and only capture the lowest level node.  Then the system will use that term as metadata in a presentation layer or for content assembly.

Another way to leverage a hierarchy is to develop “cascading lists”.  This breaks a two level hierarchy into two separate metadata fields.  But the lists need to be linked.  Tools like SharePoint do not do this automatically.  In this case, each list is presented in its entirety, allowing the same problem at the tagging UI as was described in the old advanced search UI – the tagger would be able to choose inconsistent term sets.  For example choosing “Engineering” from the first list and “Sales” from the second.  (What is interesting about that example is one could think of “Sales” for “Engineering” services.  But that is not a true taxonomy, but instead an index or an associative relationship from a thesaurus.  In our case, it did not make sense, but one could come up with scenarios that would allow these terms to be coordinated)  Our list actually contained various types of Engineering specialties (Electrical engineer, Mechanical engineer, etc) and various Sales roles (Inside Sales, Sales manager, etc) for this particular organization.   We want to associate these lists appropriately with the specific parent term chosen in the first list to drive selection of the value in the second list.

Most systems require custom programming to create these relationships.   So again, it is very difficult to leverage a true hierarchy.  Even if we can do this at the tagging interface, information architects will also need to determine (along with the development team) exactly what the system will do with cascading values selected in two metadata fields.

Every content management system is different in how it creates the user experience.  Many times, developers will make design decisions without understanding taxonomy principles, library science concepts, classification, categorization, thesaurus relationships, etc.   In our experience, the end result is that things that can be modeled in a semantic framework (by understanding and mapping the relationships of terms, concepts, metadata fields navigational constructs and search mechanisms) are instead handled in a brute force coding approach.

Communicating the Value

Anything can be accomplished with anything.  With enough time, money, manpower, etc.  But the key is elegance and adaptability.  The brute force approach (also achieved through “acts of heroics”) is not scalable and not adaptable and can be extremely expensive.

So without leveraging organizing principles that can be applied consistently across different systems and content, functionality and the user experience is built as a series of “one-offs” – brittle integration, inflexible user interfaces and  fragile architectures that are costly to modify and maintain.

Whenever an organization is developing search systems, content management applications, document management tools and even transactional systems, it is essential that a big picture understanding is developed prior to beginning development.  Having someone with a library science and information architecture background on the team can save hundreds of thousands and even millions of dollars. But more importantly, good classification principles (which apply to reference data and other elements of data architecture as well as user facing components) will contribute to an adaptable enterprise.  In our world nothing stays the same, competition, markets, products, technologies, customer needs all change.  The key is to abstract organizing principles in such a way that they are consistent across the organization and its infrastructure and can be modified without going in to each and every application and recoding functionality.  Are organizations doing this today?  Very few are.  Things like services oriented architecture are a step in that direction.  But a more immediate step is to get your terminology consistent and agreed upon.  Build a core competency around the ability to derive, apply and adapt the organizing principles of taxonomies and controlled vocabularies.

In this economic climate, organizations are cutting anything that is not essential.  Consultant budgets are the first to go.  But the fact of the matter is, this is precisely the time to make do with fewer resources.  Businesses are cutting people, programs, capital expenditures, etc.  But the people that are left need to make better use of the most important  asset that any organization possesses – the intellectual capital and know how that makes the business run.  Much of that expertise is walking out the door with headcount reductions, divestitures and downsizing.  What remains for the leaner staff is knowledge codified in systems, documents and tools.  Making those systems more usable and effective in leveraging organizational knowledge is the key to surviving and thriving in difficult economic times.

Customers will do business with companies that can solve their problems.  Companies that solve problems do so with the expertise that resides in their people and that captured and organized in their systems.  Organizing those assets has got to be at the top of the list for any organization wanting to survive and thrive in this downturn.  Many times this requires relatively miniscule investments relative to costs that are already sunk.  And the organizations that spend this time getting their houses in order will be the ones best prepared to grow when the economy does recover.

In conclusion, there is no substitute for human judgment when it comes to improving the value of information.  Search tools and content management systems cannot make information easier to access without intelligence of classification and taxonomy.  There are endless ways of organizing knowledge assets for various individuals performing their many tasks.  This takes the application of specific methodologies for understanding the user, making sense of the content and surfacing content in the context of the problem the user is trying to solve.  Taxonomies are far from dead.  They are evolving, growing, adapting, living entities that make sense of vast seas of information and allow people to solve problems on their jobs and in their lives.  This is certainly more important in the current economic climate.

© 2009 Earley & Associates, Inc