The Critical Role of Content Architecture in Generative AI

This article was featured in the Association for Intelligent Information Management (AIIM).  

What Is Generative AI?

Generative AI has caught fire in the industry – almost every tech vendor has a ChatGPT-like offering (or claims to have one). They are claiming to use the same technology – a large language model (LLM) (actually there are many Large Language Models both open source and proprietary fine-tuned for various industries and purposes) to access and organize content knowledge of the enterprise. As with previous new technologies, LLMs are getting hyped. But what is generative AI?  

Simply put, generative AI is a technology that responds to natural language questions based on algorithms that are “trained” on large amounts of text from across the web. This means that they “understand” terminology, concepts, and relationships between concepts to such a degree that they are able to create a response that sounds as if it came from a human. 

Although the technology creates the impression of being sentient, the capabilities of ChatGPT and other LLM-powered applications use algorithms based on mathematical analysis. Rather than retrieving an answer from existing content, the algorithm is creating that response based on embedded knowledge of language and concepts by making a prediction about the text that should follow. Generative AI creates new content based on its knowledge of language, concepts, and relationships between concepts 

The Importance of Content in Generative AI 

One challenge of generative AI is that these responses are based on patterns learned by ingesting public information, which is necessary in order to get the required large volumes of training content. But what if an organization wants to use generative AI on its internal information? In that case, the organization needs to use a somewhat different approach. Instead of using only embedded knowledge of the language, an enterprise application would need to retrieve information from a knowledgebase, content management system, or other data source that is curated with specific answers from the organization. This is referred to as Retrieval Augmented Generation or RAG. 

Using an LLM on your organization’s data is not as simple as just pointing it at that data and letting the model figure out an answer (although that is what many vendors are claiming). Instead of admitting that content requires structure and curation through the use of a taxonomy and content architecture, some vendors in the space will claim not to use taxonomies or metadata. Instead, they will admit that they have to “label the data”. But in fact, labels are metadata. The content needs to contain clues as to its context. Metadata provides contextual clues for interpreting content.  

Imagine that you are using an LLM to support a customer seeking information about a particular product. If the information is nonpublic, sensitive, or proprietary intellectual property, exposing it to a large public language model can compromise corporate IP. (OpenAI is claiming that API-accessed functionality will not compromise corporate IP, but some data is too sensitive to trust that assertion.) Even if it is not sensitive, we still need to be able to retrieve specific information about a particular product, and the instructions needed to support that product.  

Therefore, a piece of content ingested into an LLM needs to be labeled with attributes such as the product name, the product model, any installation instructions, and error codes. The knowledge of the organization must be structured in such a way that the LLM can be used to retrieve that knowledge in the context of the customer's problem, their background, their level of technical proficiency, the exact configuration of their installation, and so on. 

How Does Generative AI Work?  

Since generative AI creates (generates) original content, accurate knowledge of the organization needs to be referenced by the technology to prevent “hallucinations”; that is, the made-up answers that sound plausible but are factually incorrect. Here are some additional details on how it works: 

  • Generative AI algorithms learn the underlying patterns and structure of a given dataset to generate new data. The model captures the probability distribution of the training data and generates new content based on those probabilities.  
  • Different probability frameworks are used to explore possible responses, meaning that the algorithm selects a response based on what is statistically most likely to be an appropriate answer (that is, it considers probabilities from a variety of perspectives). 
  • Neural Networks and Deep Learning techniques are used to model complex data relationships.  
  • In pure Generative AI approaches, data is not labeled (no metadata needs to be applied) – the system learns from the data itself without reference. (There are caveats to this that we will explore later in this article.) The issue with unlabeled data is that content can miss important context and detail.  
  • According to ChatGPT, "Generative models require substantial amounts of training data to capture the complexity of the underlying distribution accurately. Large datasets enable the models to learn diverse patterns and generate more realistic and varied outputs."   
  • "Applications of generative AI should be considered in a wider societal context," according to Ibid. "Generative AI can be used for beneficial applications like art, entertainment, and research, but it also has the potential for misuse, such as in deepfakes or misleading synthetic media."

Natural Language Processing (NLP) in Generative AI

Natural language processing (NLP) is extensively used in generative AI. When a user asks a question, the same question can be asked in many ways. In chatbot design, the question or query is called an utterance. Variations in ways of asking a question need to be classified to a single user “intent” that the system can act upon. NLP provides a path to understanding intent.  

In generative AI, the same thing applies to a phrase or concept. The system is trying to interpret the question and resolve the different ways of asking. Language variations are represented mathematically by ingesting the question into a data store referred to as a vector database.  

Vector databases are different from traditional databases in how they store, process, and retrieve information. A traditional database represents the document or product in rows and the characteristics of that object (price, color, model) in columns. When there are large numbers of descriptors (dimensions) in unstructured content, a traditional database can be more challenging for certain types of queries.  

A vector representation creates a mathematical model of the object (let’s say a document) in a multi-dimensional space3 (a tough concept to wrap our minds around since we can’t think in more than three dimensions – four if you add time as a dimension). But a vector space allows for as many dimensions as there are attributes – meaning hundreds or thousands. Documents are also ingested into the vector database and can be represented as a single vector or broken up into components with each component having a vector representation.  

This form of representation allows a different approach to analysis, in which the proximity of data points shows how close the different attributes or other data elements are to each other. Since both the query and the content are represented as a vector, at a high level, the vector representation of the query is compared with the vector representation of the content and a response is generated.  

Metadata can be associated with content to provide explicit attributes of the content which provides context for the query and the response. In the case of a generalized language model, dimensions of the vector space are determined by “learned features” of the information.4 However explicit metadata is can also be part of the vector representation of content, “embeddings”.  

 

The Role of Knowledge Management in Generative AI 

The Holy Grail of knowledge management (KM) has been to provide the right information to the right person at the right time. The challenge has always been how to represent that knowledge in a way that is easily retrieved in the context of the user and their task. User context is an understanding of their goal, the specific task, their background and knowledge, expertise, technical proficiency, the nature of the query, and the details of their environment.  

This information comprises the customer or employee's “digital body language”. These are the digital signals that people throw off whenever they interact with an electronic system. Any touchpoint will provide data that can be interpreted as a part of a user’s context. Some organizations will have 50 to 100 systems that construct a user experience that encourages them to buy a product or complete their task. Those data points provide context about a user’s goal or objective.  

However, the customer journey is a knowledge journey. The employee journey is a knowledge journey. At every step of the process, people need answers to questions. KM has always attempted to organize information in a meaningful way that reduces the cognitive load on the human; that is, makes it easier for them to accomplish their task. That knowledge has to be structured and tagged in such a way that it can be easily found through search or browsing and increasingly through the use of chatbots and other cognitive AI applications. 

While the technology is becoming more and more powerful – certainly as exhibited by generative AI – it does not solve the fundamental problem of KM and access by itself because, for internal use, generative AI needs to be trained on organization-specific information.  

Conclusion

Now is the time to get one’s knowledge house in order. Generative AI is an amazing advance but using the same general language model as your competition will not create competitive advantage. It will provide efficiency through standardization. Organizations differentiate on knowledge – that forms your competitive advantage – accessing knowledge of the business using Retrieval Augmented Generation sooner than your competitors will provide a clear and measurable marketplace advantage. This is not a nice-to-have. It is a need-to-have in today's fast-moving marketplace. 

Seth Earley

Seth Earley is the Founder & CEO of Earley Information Science and the author of the award winning book The AI-Powered Enterprise: Harness the Power of Ontologies to Make Your Business Smarter, Faster, and More Profitable. An expert with 20+ years experience in Knowledge Strategy, Data and Information Architecture, Search-based Applications and Information Findability solutions. He has worked with a diverse roster of Fortune 1000 companies helping them to achieve higher levels of operating performance.