Why AI Initiatives Stall: Data Quality, Knowledge Engineering, and the Path to Operationalizing Machine Learning

Anyone who has spent time in a laboratory understands the frustration of a beautiful theory colliding with ugly reality. In undergraduate chemistry, the textbook reaction looks clean and elegant. At the bench, the result is often something unrecognizable -- a cloudy, ambiguous mess that only vaguely suggests the intended outcome. The science was correct. The execution, however, was another matter entirely.

Working with enterprise data feels remarkably similar. Organizations arrive at AI initiatives carrying high expectations: precision-targeted marketing, predictive customer intelligence, cross-functional operational insight. What they encounter instead is data in inconsistent formats, riddled with gaps, carrying conflicting values across systems. The algorithms are sound. The inputs are not. And that disconnect is where most AI programs lose their momentum.

This Article originally appeared in the July/August 2017 issue of IT Pro, published by the IEEE Computer Society.

Clearing Up the Confusion Around Data Variability

A persistent myth in enterprise AI is that variability in data is synonymous with poor quality. These are not the same thing. Variable data simply means data arriving in different structures and formats -- transactional records look nothing like social media activity, and both look nothing like sensor output. Understanding how each type of data functions as a signal within a given problem domain is the real analytical challenge, not an excuse to dismiss the data outright.

Messy data is a separate issue: missing fields, malformed records, formats that resist standard ingestion pipelines. Messy data can still be high-quality data -- it simply requires remediation before it becomes useful. The critical point is that no algorithm, regardless of sophistication, can compensate for inputs that are incorrect or incomplete. As one computer science professor has noted, the idea that raw data can be fed directly into an algorithm and produce meaningful insights is a myth, not a shortcut.

The popular "load and go" framing -- the notion that organizations can simply ingest all available data and let it tell the story -- is similarly misleading. "All data" is a conceptual overgeneralization. Which data? From which systems? Covering which time periods and business domains? These questions require deliberate answers before any model can function reliably. Bulk ingestion without curation also creates risk: concentrating data into a single repository without governance makes it a more attractive and vulnerable target.

Where Data Scientists Actually Spend Their Time

Here is an uncomfortable truth about enterprise AI: the majority of what data scientists do every day is not science. It is preparation -- cleaning, linking, reformatting, and structuring data so that it can be processed at all. Industry observers have described this work as data janitorial labor, and that characterization is accurate. One operations leader in the advanced analytics space has estimated that roughly 80 percent of data scientist effort is consumed by data cleaning, linking, and organization -- tasks that are fundamentally information architecture work, not data science.

This has direct implications for how organizations build and staff AI programs. When the bulk of skilled, expensive analytical talent is absorbed by data preparation rather than model development and interpretation, productivity suffers and business value accumulates slowly. The bottleneck is not analytical capability -- it is the state of the underlying data infrastructure.

AI Exists on a Spectrum, Not a Single Point

Discussions of AI often treat it as a monolithic capability, but in practice AI applications span a wide range of complexity and technical demand. At one end of the spectrum sit embedded AI functions that users interact with daily without recognizing them as AI at all: spelling correction, machine translation, search ranking, speech recognition. These capabilities have matured to the point of becoming infrastructure -- reliable, scalable, and largely commoditized.

At the other end are advanced analytical applications requiring deep mathematical expertise, custom algorithm development, and ongoing model refinement. Between these poles lies a broad middle ground of configurable platforms and development environments that organizations can deploy with varying levels of integration effort, depending on the specificity and complexity of their use case.

Understanding where a given initiative sits on this spectrum matters enormously for resource planning, vendor selection, and expectations around time to value.

Cognitive Computing and the Demands of Natural Language

Cognitive computing represents one of the more demanding categories of AI application -- systems designed to allow humans to interact with computers in more natural, contextual ways, and to process information in less rigidly structured formats. A physician using an AI system to evaluate patient observational data against a broader evidence base is using cognitive computing. So is a customer using a virtual assistant to resolve a service issue without navigating documentation.

These applications share a common architecture. Speech recognition converts spoken input to text. Natural language understanding derives the intent behind that input, either through training on phrase variations or through parsing for meaning. Dialog management routes that intent to an appropriate response. Each layer relies on machine learning, and the quality of outputs at each layer depends on the quality of the data and knowledge structures that inform it.

Retrieval Requires More Than Algorithms

When a user asks a virtual assistant a question, the system must do more than understand the words. It must retrieve an answer -- not a list of documents, but a specific, contextually appropriate response. That requires a content corpus that has been parsed, componentized, and tagged with sufficient precision that ranking algorithms can surface the right information at the right moment.

Auto-tagging and auto-classification tools can assist in this process, but they require well-designed scaffolding and iterative refinement with human judgment. Inference can also draw on relationships mapped within an ontology -- for example, connecting a product category to relevant troubleshooting steps, or associating a user's configuration profile with applicable guidance. Some of that knowledge emerges from the data itself; some must be intentionally structured through knowledge engineering.

Ontologies as the Foundation of Enterprise Intelligence

Ontologies are the organizational structures that give AI applications their contextual intelligence. They capture relationships among knowledge elements -- how product categories relate to solution types, how problem taxonomies map to resolution pathways, how user intents connect to appropriate actions. Rather than relying on raw data alone, ontologies encode the institutional knowledge that makes retrieval accurate and contextually relevant.

Consider a customer trying to configure a new device. Instead of navigating a support portal or calling a help desk, they interact with a chat interface. The system interprets their question as an intent, maps that intent against an ontology of product types and common issues, and retrieves the specific guidance that matches their situation. The algorithm handles the matching. The ontology is what makes the match meaningful.

This approach also addresses one of the most persistent and underappreciated risks in enterprise AI: knowledge loss through staff turnover. When analytical expertise and institutional knowledge live in individual heads or buried in code repositories, they disappear when people leave. An ontology that has been deliberately constructed and maintained becomes a durable organizational asset -- one that retains value independent of the individuals who helped build it.

Analytics at Scale Requires a New Developmental Model

Organizations that have successfully moved from AI experimentation to AI production have typically done so by rethinking how the analytical development lifecycle is structured. Rather than building models against raw data from scratch for each use case, they introduce a semantic layer between the raw data environment and the application layer. This intermediary layer preprocesses data to a point of generalized utility -- independent of any specific future use -- and then allows machine learning to handle the conversion from large, varied datasets to the precise, small-data outputs that applications consume.

By wrapping analytical models and tools within this semantic layer, organizations make them discoverable and reusable across teams. The insights generated by one team's work can feed back into the shared ontological structure, progressively enriching it. Over time, the semantic layer becomes a living knowledge asset -- one that reflects accumulated organizational intelligence rather than isolated project outputs.

This model also reduces the proportion of time data scientists spend on preparation work, freeing capacity for higher-value analytical functions. Treating data as a service and the platform as an orchestration layer -- rather than a collection of bespoke solutions -- is what allows organizations to achieve meaningful productivity gains.

Building Durable Competitive Advantage

As AI tools mature and commoditize, the components that can be purchased off the shelf will be. Speech recognition is already there. General-purpose language models are moving in that direction. The elements that cannot be commoditized -- because they encode something specific to a particular organization -- are the data, content, processes, and semantic translation layers that reflect how that organization actually operates and serves its customers.

A speech recognition agent that understands a company's specific product catalog, customer history, and service vocabulary is more valuable than a generic one. Not because the recognition technology is better, but because the knowledge it draws on is richer and more precisely structured. That structured knowledge -- the ontology, the metadata architecture, the curated content -- is where sustainable competitive differentiation lives.

Organizations that invest now in understanding their own data, articulating the business problems they are solving, and building the semantic layers that enable portability across platforms will be better positioned to take advantage of best-of-breed solutions as the AI vendor landscape continues to evolve. Those that do not will find themselves dependent on specific vendors, constrained in their ability to adapt, and unable to fully operationalize the analytical capabilities they have acquired.

The real problem with AI is not the algorithms. It is the organizational commitment to building the knowledge infrastructure those algorithms require.


This article was originally published in IT Pro by the IEEE Computer Society and has been revised for Earley.com.

 

Download Now

Meet the Author
Earley Information Science Team

We're passionate about managing data, content, and organizational knowledge. For 25 years, we've supported business outcomes by making information findable, usable, and valuable.