Contact Us|Search:  
Earley & Associates

Search and Taxonomy - Leveraging Metadata to Return Content in Context

Introduction: Search as a Utility

According to a recent nation wide study, roughly 33 million American internet users use a search engine to locate information at least once a day. The only task that was completed more often was checking email1. It is clear that the use of search engines has become a part of daily life and a necessary utility for finding information. However the question of how to best apply and leverage search in varied information environments, such as databases, portals, and websites has yet to be adequately answered.

Search engines all attempt to do to two basic tasks: (1) interpret a users’ query and (2) bring back the most relevant and accurate results that match it. Different search engines employ different methods for accomplishing this task. Most people are familiar with internet search engines, such as Google, Yahoo, All the Web, etc. However many people are less familiar with search applications designed for internal enterprise use, such as Fast, Autonomy, Endeca, etc. People often assume that search applications and internet search work in the same way, but differences in the information environment create unique challenges for each type of search. Regardless of the environment though there are elements of all types of search that remain problematic. Users will often search with ambiguous keywords and bring back results that are either far too numerous, or not in the proper context of the users information need. Often users will not even find what they are looking for at all. This article will focus on the problems of search in the enterprise and the ways that search can be improved through integration with a taxonomy and structured metadata.

Search as an Application

Current information environments in many organizations are fractured, segregated and unable to meet the needs of end users. Information is often stored in many different places, (databases, intranet, shared drives, etc) each with different methods, and structures for classifying and organizing that information.  In a nutshell, many employees simply can’t find the information they need to do their daily work. On the surface, an easy solution to this problem might be to attach a search engine to the company portal or database, and consider the problem solved. Experience shows us that the answer is not always that simple. Search cannot be just be placed on top of an existing information 
environment. It needs to be thought of as an application integrated with supporting systems, tools and processes.

How Search Works:

To understand how search works, it is important to understand metadata. Metadata is simply information that describes a piece of content, e.g.
 
 Title:
 Date Created:
 Author:
 File Type:

Some content might have more elaborate metadata, and some might contain only the most basic as listed above.

All search engines leverage metadata in some form or another. A full text search engine will also derive metadata from the actual text of a document. It will create an index of terms present in a particular document, and then associate that document with those terms. This process is usually referred to as document processing. Various search engines handle this process in different ways but in general it involves:

  • Isolating metadata tags such as title, author etc.
  • Deleting stop words (extremely common words) e.g. and, the, it. etc,
  • Stemming words, which reduces words to a stem form e.g. swimming, swimmer, are reduced to “swim”
  • Weighting the extracted terms2.

Weighting terms is a process that is also handled differently by the various search engines. For example some search engines may weight a term more heavily if it appears in the “title field” of the document, if the term appears 20 times as opposed to 5, or it may be weighted based on its proximity to other terms or other linguistic rules.

When a user enters a search query into the engine it then goes about the task of processing your request. This process is termed query processing. Similar to the document processing phase most search engines will then

  • Delete stop words
  • Stem query terms
  • Parse the query into terms and operators (e.g. Boolean, proximity)
  • Weight query terms
  • Match terms against an index

The search engine will then retrieve the results and display them to the user.

 So why doesn’t this process work very well in so many situations? Searches often lead to either endless lists of results which the user must then sort through, or results that are totally irrelevant to the users’ information need. There are many answers to this question, but arguably the most significant reason is that “search terms are short, ambiguous and an approximation of the searchers real information need.” 3
 
 Consider a basic query like “notebook” - the immediate problem that arises is how can a search engine know whether the user is looking for information on a type of computer or a paper product? The answer is that the search engine can’t know.  Relevant search results come from having the correct context:  task, audience, process, and perspective. Context comes from understanding what a user is doing and how they think about the answers they seek. When building or configuring a search application, we need to pay close attention to these issues. We need to be able to take ambiguous key words, and present users with possible ways to find the most relevant information. We need to be able to show users the organizational context of the information that is retrieved; and we need make sure that information is not lost in multiple systems. Combining a search application with a taxonomy is the best way to present users with content in context.

Why a Taxonomy? 

Most organisations use a range of systems to store information. Each of these systems, in turn, uses its own set of categories (metadata) to organise the information it stores. For example, a customer database might use one set of fields to classify documents, a repository of marketing collateral another, and a technical product database another still.

Furthermore, none of these ways of organising information may map to the ways in which customers and employees view the world. Different groups have different ways of looking at things. Marketing executives and technical experts use widely varying terms to describe the same concepts.  When there is a disjoint between the system’s categories and the user’s, important information is not found and this leads to mistakes, poor decision making, and the needless duplication of information.

Most often, databases and repositories have no means of expressing relationships between information, attributes of information or synonyms. Thus, employees can only find related information if they already know it exists and where it is located. When relationships do exist, they are usually static, and cannot be adapted to the varying needs of users.

A taxonomy is a structure that represents an organization’s understanding of the content it possesses and uses. A taxonomy allows an organisation to create and centrally manage the controlled vocabularies and meta-data that applies to all of the organization’s content. These vocabularies and metadata can then be consumed by a search application to return content in context.

Continue to part 2 > Building a Taxonomy

 

© 2008 Earley & Associates, Inc.


Enterprise taxonomy development

Content management & Knowledge management

Technology advisory

Search strategy & integration

Change management & governance

Training & workshops

Case studies

Past clients

Speaking engagements

Past conferences

Presentation abstracts

Taxonomy Community of Practice series

Jumpstart series

Other sessions

Articles & reports

Audio & video presentations

Web resources

Blog

About Earley and Associates

Careers with Earley & Associates

Contact us

News