All Posts

Text Mining: Search's Silver Lining

Looking for information inside an organization is different from looking for information on the Web. As more people use the search services of Google, Microsoft, and Yahoo!, among others, the more widespread the assumption that enterprise search will work “just like Google, Microsoft, and Yahoo!'s Web search systems.

More disconcerting is that when one of these “big three” provides an organization's enterprise search system, Intranet search is different. There are many differences why looking for the information related to a particular invoice is different from seeking information about William Shakespeare. In fact, the differences and the reasons for these differences are of almost no interest to users of an enterprise search system.

An interesting view of enterprise search emerged from’s Enterprise Search Report, 2nd edition. The data revealed that Fortune 500 companies have five or more enterprise search systems. Clearly the phrase enterprise search system does not mean what it says. No single system can meet large organization's needs. Yet multiple systems increase information technology costs and foster frustration and confusion when an employee has to find “the most recent ABC Company contract and their email about a warranty claim.”

One interview subject at a major consumer products company in New Jersey said, “Most of my colleagues know that Internet search works really well on Google or Yahoo!. But we also know that our internal search systems don’t work very well at all. Everyone is very frustrated because chasing down information costs money, wastes time, and loses some sales every single day.”

User Behavior

When an enterprise search system doesn't meet the needs of its users, those users take one of several all-too-familiar actions:

  1. Work around the search system. Manual filing systems, Post-It notes stuck to monitors, and asking around are the expensive, inefficient, and frustration tactics.
  2. Install a search system for a single user or a department. Google's Desktop Search allows any employee with permission to install software to index the contents of a local hard drive. In some cases, an employee will be clever enough to install the version of Google Desktop Search that can index any shared drive to which that user has access.
  3. Quit. Many organizations assume that an employee's leaving is the result of a better offer or changing interests. In focus groups conducted by Stephen E. Arnold, author of The Enterprise Search Report, published by, said, “A surprising number of interview subjects linked job frustration with finding the information needed to perform basic work tasks. When work was interrupted by hunts for basic information, employee frustration was often a factor in that employee's decision to change jobs.”

Is search broken beyond repair? No, advances in search technology have been accelerating in the last 24 months. Search, in fact, is viewed by many professionals as one of the knowledge worker's mission critical activities. In a company where information is money, risk increases when an employee is unable to access the information needed to close a sale, handle a customer problem, or respond to an inquiry from a government agency.

Search often boils down to two tasks that many employees find difficult or intimidating. The first job is to figure out how to ask a question. Because most enterprise search systems offer a naked “search box”, employees familiar with Google expect a one or two word query to return a list of relevant results. Key in Spears on Google and the first result is Britney Spears's Web site. Key in Smith contract in an Autonomy-Verity, FAST Search & Transfer, or Google Appliance. What happens? The system responds with a laundry list of results that may or may not be related to the specific Smith contract. Locating the documentation associated with the Smith contract is essentially impossible unless the user knows exactly how to hook up the notion of Smith, the specific contract, and the email related to the contract itself.

The second task is figuring out how to craft the query to produce the desired result. Again, Stephen Arnold, said, “In our interviews, we learned that more than two-thirds of those using enterprise search systems were not certain about the commands needed to get information from an enterprise search system. Some employees are adept. The majority waste precious minutes trying to come up with the words, phrases, and commands needed to make one or more search systems deliver useful results. A surprising number of interview subjects said, 'I just keep my own paper files.'”

What's the "Fix" for Search?

Experts are beginning to recognize what Yahoo! Seems to have known for years. Users like to look at a list of links, explore them, and fall back on search if the links don't deliver exactly what's needed with one click. The links seem to provide the equivalent type of fast access that a movie marquee provides to a filmgoer. One glance speeds up the selection process by orders of magnitude.

Not surprising, some enterprise search vendors and specialized software developers have come up with tools to provide this “Yahoo!-style, point-and-click interface to the organization's employees. The solution: let the machines do the reading. Humans can then use their time and expertise to examine only the documents that contain information of significance. How does one get 1s and 0s to know the difference between idle chatter and a nuance about an illegal money transfer? How does a machine differentiate between John Doe the “good guy” and John Doe the “bad guy”?

The answer is one of those maddening paradoxes that managers encounter every time advanced technology is nominated to improve decision making or reduce an enterprise's net profit. On one hand, software can turn the mashed potatoes of electronic information into tasty intellectual French fries. On the other hand, the technology is complex, costs money, and not 100 percent perfect.

Text mining can help organizations with certain information problems. Consider these examples:

A customer support center has a proprietary system that stores email from customers and standard answers to frequently asked questions. A Microsoft SQL Server database maintains customer account feedback. Instead of requiring account representatives to perform a search to locate the “answer” to a caller’s question, software “reads” the inquiry, automatically performs a search for the “answer” and then either automatically responds if the system determines a high probability match. Alternatively, when the “score” suggests that the answer is not on target, the system forwards the inquiry and the “answers” to an account representative who can select an answer or write an original response. The original response is added to the support database and linked to the original query. This type of system is available from such companies as Endeca, RightNow, and others.

FAST Search & Transfer has implemented a similar system for several of its clients. U.S. government sales representative Bobbie Browning said, “FAST’s technology allows traditional search as well as offering users facets of information. These facets are stored queries that retrieve the most relevant information without the user having to think up a query and manually enter it.”

These approaches extend what Yahoo! has known for more than a decade. Many online users want to recognize a category without having to type a query. However, the option for entering an original query must be offered. So, enterprise search is providing users a choice in order to increase the speed with which the information retrieval process moves forward.

What's immediately noticeable is that “search” has become an option. The main points of access are categories of information and hot links directly to documents. Search is still available, but analysis of user behavior has shown that:

  1. Most visitors can locate specific information by clicking on a link and launching a “saved search.” The hyperlink is actually a single-click way to retrieve information from a query that exists “behind the scenes”
  2. Certain types of information does not require a search to be launched. A click on a Daily News item displays the document itself; for example, the Daily Bulletin
  3. Search is used only when a user seeking a specific item cannot locate on this point-and-click type of interface.

Erik S. Arnold, a senior manager at Vivisimo, adds: “Clustering the “hits” from a query of multiple search systems provides three benefits. First, duplicates can be automatically eliminated so users don't look at the same information twice. Second, visual cues such as file folders with clear text labels allow a worker to see what's been found without scanning dozens of documents. Third, software is better suited to look at multiple repositories because the process can be parallelize, saving minutes, even hours from a typical enterprise information search.”

What types of text mining technology can an enterprise use today without the time and expense of customized solutions? There are a surprising number of options. The biggest challenge is deciding on a particular vendor for a specific problem.

As suggested elsewhere in this Articles, text mining refers to one or more computer-based activities that do what ordinary humans cannot do when confronted with large volumes of information in electronic form. Off-the-shelf solutions exist now for :

  • Clustering: Generate a diagram of the main topics included in the documents processed. This diagram can take many forms. The idea is to let a picture communicate the overall topics in the documents processed.
  • Automatic Categorization: Produce a list of the topics and subtopics covered in the documents. This type of listing is similar to the indexing system used in libraries in the 1950s. The idea is that a small number of main headings provide easy access to the subheadings under which specific documents are organized. An example would be the broad category of Marketing and its subheadings such as Direct Mail, Public Relations, and so on.
  • Entity Extraction: Generate a list of the names of the people, places, and things in the document collection. Click on a specific name and the system displays a list of documents in which the name appears. When the user clicks on a specific document, the portion of the document containing the name is highlighted.

The Role of a Taxonomy 

What is becoming a major groundswell in the search-and-retrieval business is that a self-evident, easy-to-use interface is removing what might be called “search box anxiety.” The text links or the graphic links allow the system user to go directly to the information needed to answer a question.

Companies such as Endeca and FAST Search & Transfer, Mondosoft, and Autonomy-Verity have subsystems and technology “baked into” their enterprise search systems to allow programmers, designers and system administrators to speed access to information. Pointing-and-clicking is simply faster and easier for most employees,” says Laust Sondergaard, the president of Mondosoft. “A human can recognize an item from a list or a meaningful graphic much faster than figuring out a Boolean search query and typing it into a search box with no mistakes.”

The term used to describe this blend of interface and information access is taxonomy. “The $5.00-word is not necessarily the best one,” says Stephen Arnold. “But it's the one we're stuck with at this time.”

A taxonomy is a series of pigeon holes or categories into which similar items can be placed. Most organizations do not have a classification system similar to the one used by the Library of Congress. The raw materials for such a taxonomy exist in the documents that the search system indexes. The documents can provide useful information about the way a particular organization organizes its work processes, documents, and data.

However, generalized tools are likely to lack the fine-tuning controls needed to create a taxonomy that is appropriate for a specialized financial services company or even a company involved in freight forwarding.

Says Stephen Arnold, “Providing a basic search-and-retrieval system with the type of information a taxonomy or classification scheme contains can turn a so-so search system into a pretty good one.” In fact, the good news for organizations with an existing enterprise search system or multiple search systems is that a taxonomy can make an immediate and direct improvement in the quality of the search experience.

The Taxonomy Task

Creating a taxonomy is work, but it is getting easier with each passing month. New companies such as Jarg or Genalytics introduce interesting new tools. Then, in a matter of months, established vendors such as ClearForest, Inxight, and Stratify raise the bar. The cycle then begins again.

Instead of focusing on the tools, consider the steps in creating a taxonomy that can be used to turbo charge an existing enterprise search system:

  1. Look for existing lists of words and phrases. Many organizations have word lists for indexing documents. Examples range from the company's legal department to the engineering units responsible for developing new products. If your organization has an Oracle database license, Oracle provides to its customers controlled term lists and basic classification of these terms.
  2. Assemble a master list and seek the advice of a taxonomy expert.
  3. Ask your search system vendor if the present search system has an automatic classification system or function. If so, your vendor will advise you on how to use the existing system to produce a term list or a classification scheme during the document processing that occurs for the existing search system. If not, ask your vendor or an outside expert to recommend a third-party taxonomy generation tool.
  4. Use the tools (provided by the search vendor or a third party) to refine the classification scheme and the assignment of controlled terms or phrases to specific pigeonholes. This preliminary draft of a classification scheme can then be used as a supplement to key word indexing. Instead of assigning as an index term the words and phrases found only in the document itself. The search system consults the taxonomy and the terms assigned to categories and it then adds these categories and terms to the indexed document.

What's the Payoff?

The obvious payoff is that even with machine-generated systems with minimal editorial review, a document is indexed by:

  1. The words and phrases in each document. (This has always been done.)
  2. Each document is tagged with one or more classification categories or headings.
  3. Each document is automatically linked to one or more of the subdivisions in the taxonomy.

With these additional chunks of information attached to each document, the person using the system can look at other documents assigned to a particular category such as “Financial > Accounts Receivable” and see that information without keying a single letter.

The Web design can use the classification categories to create a point-and-click interface to explore the categorizers, drill into a particular category, or link saved searches to a particular icon. A user can then click on an icon, launch the search, and use the links to other related information as a way to chase down an elusive item of information.

Wrap up

In closing, a well-shaped taxonomy delivers a number of fungible benefits to individual employees and to the overall enterprise. Among the most important payoffs from industrial-strength taxonomy systems are:

  1. A rapid, systematic way to create an overview of what an information collection is about, its gaps, and its strengths. Researchers, marketers, lawyers, and product developers can reduce their cycle time with focused text mining of information and data
  2. Locating information about a specific person, technology, or other entity in one or more collections of text. Sales managers, telemarketers, business development, and lawyers can use entity extraction to identify a particular person, place, or thing and information related information to that entity.
  3. Road maps that lay out the who, what, and where of documents so the human can take the shortest route to the needed information. Researchers, competitive intelligence professionals, lawyers, senior mangers can determine what gaps exist in currently available information. Acquisition of new information becomes more efficient so that the enterprise's knowledge pool expands instead of duplicates itself.
  4. Use of the outputs of text mining systems as input for existing enterprise search systems. The entities, categories, and other information about particular documents can be used to add subtlety to the often-crude enterprise search string matching technology.

But by adding value to the information a company can make everyone in the organization smarter. Ulla de Stricker (, a knowledge management consultant in Toronto, Canada, says: “The benefit from looking at documents organized into logical groupings is an acceleration of the decision making process. Taking that additional information and using it elsewhere in the organization allows software to help employees make vital connections. Those connections put a trampoline in an employee's computer. Day-to-day tasks get more lift and bounce from focused text mining.”

Is text mining and automated discovery the long-sought after “silver bullet” that solves information problems? The answer is, not surprisingly, “Maybe.” Enterprise search is undergoing an important change. Brute force matching of the words in a user’s query are becoming more pliant, discovery-centric systems. The ubiquitous search box will be complemented by links to specific documents germane to the information an employee needs to do his or her job.

If you would like to know more about how a taxonomy can be used to smooth out some of the bumps associated with the search engine used to explore your organization's information superhighway, contact us.

See also reports from Real Story Group (formerly CMS Watch).

Seth Earley
Seth Earley
Seth Earley is the Founder & CEO of Earley Information Science and the author of the award winning book The AI-Powered Enterprise: Harness the Power of Ontologies to Make Your Business Smarter, Faster, and More Profitable. An expert with 20+ years experience in Knowledge Strategy, Data and Information Architecture, Search-based Applications and Information Findability solutions. He has worked with a diverse roster of Fortune 1000 companies helping them to achieve higher levels of operating performance.

Recent Posts

[Earley AI Podcast] Episode 22, Peter Voss

In this episode, Seth and Chris talk with Peter Voss,  Founder, CEO, and Chief Scientist at AGI Innovations &

[Earley AI Podcast] Episode 21, Dan Turchin

Guest: Dan Turchin, CEO & Founder, PeopleReign

Making the Best of AI – What Executives Need to Know

Beginning the journey with the right preparation will lead to success instead of disappointment