Earley AI Podcast - Episode 33: Knowledge Graphs, Data Modeling, and the Future of AI with Ben Clinch

Why Proprietary Data Models Are the Foundation of Enterprise AI Success

Guest: Ben Clinch, Head of Information Architecture at BT Group

Hosts: Seth Earley, CEO at Earley Information Science

Chris Featherstone, Sr. Director of AI/Data Product/Program Management at Salesforce

Published on: September 8, 2023

In this episode, Seth Earley and Chris Featherstone speak with Ben Clinch, Head of Information Architecture at BT Group and a leading voice in enterprise data management, knowledge graphs, and AI architecture. Known as "Mr. Data Fabric," Ben brings 23 years of experience across global organizations. They explore why organizations cannot rely solely on generic AI data models, how knowledge graphs supercharge LLM accuracy, and why data modeling-like an org chart-is a non-negotiable asset for any enterprise deploying AI at scale.

Key Takeaways:

Organizations mistakenly assume generative AI eliminates the need to curate and prepare proprietary data models before deployment.
Relying solely on LLM-generated data models means deferring your organizational intelligence to a standard everyone else already uses.
The data model is to data what the org chart is to people-both are irreplaceable assets that must reflect your unique organizational reality.
Knowledge graphs provide semantic structure and factual grounding that significantly reduce LLM hallucinations in enterprise applications.
Retrieval-augmented generation works best when paired with well-structured metadata, componentized content, and tightly controlled data sources.
Ontology governance requires versioning, usage metrics, and self-correcting mechanisms to stay relevant and avoid unchecked complexity.
AI is a powerful accelerator for building knowledge graphs, but human-in-the-loop oversight remains essential to ensure accuracy and soundness.

Insightful Quotes:

"Companies will realize quickly that they can't do any sensible generative AI without a core of useful referential data to utilize, train on, and not hallucinate." - Ben Clinch

"If you're deferring your data modeling to a standard that everybody else has access to, it might be a good starting point-but it would be a disaster to solely rely on somebody else's model for you." - Ben Clinch

"Taxonomy is a chart of accounts for knowledge-and the project succeeded because they tried doing it so many times before and failed without the core information architecture." - Seth Earley

Tune in to discover how combining knowledge graphs, proprietary data models, and retrieval-augmented generation creates the reliable AI foundation every enterprise needs to succeed.

Links:

LinkedIn:https://www.linkedin.com/in/benclinch/
Website: https://www.bt.com/
Twitter: https://twitter.com/BritishTelecom

Ways to Tune In:

Earley AI Podcast: https://www.earley.com/earley-ai-podcast-home
Apple Podcast: https://podcasts.apple.com/podcast/id1586654770
Spotify: https://open.spotify.com/show/5nkcZvVYjHHj6wtBABqLbE?si=73cd5d5fc89f4781
iHeart Radio: https://www.iheart.com/podcast/269-earley-ai-podcast-87108370/
Stitcher: https://www.stitcher.com/show/earley-ai-podcast
Amazon Music: https://music.amazon.com/podcasts/18524b67-09cf-433f-82db-07b6213ad3ba/earley-ai-podcast
Buzzsprout: https://earleyai.buzzsprout.com/

Thanks to our sponsors:

Podcast Transcript: Knowledge Graphs, Data Modeling, and the Future of Enterprise AI

Transcript introduction

This transcript captures a conversation between Seth Earley, Chris Featherstone, and Ben Clinch about the critical role of proprietary data models and knowledge graphs in enterprise AI success. Ben shares hard-won insights on why organizations cannot defer their data modeling to generic LLM standards, how knowledge graphs reduce hallucinations through semantic grounding, and how governance and human oversight keep ontologies accurate and useful over time.

Transcript

Seth Earley: Good morning. Welcome to today's podcast. My name is Seth Earley.

Chris Featherstone: And I'm Chris Featherstone.

Seth Earley: And today I want to introduce our guest. He's an industry expert in data management, knowledge graphs, and business and data architecture. He brings 23 years of experience in business technology and operations across some of the largest multinational organizations. He has a passion for communicating complex ideas in relatable ways, and teaching how to implement data management at scale.

He's a sought-after public speaker, having spoken at Gartner Data and AI Summit and Google Next, among many others. He's an advocate of the Enterprise Data Management Council, a nonprofit promoting the highest standards in data management. He's obsessed with knowledge graphs, starting the Guild at BT with 200-plus members. He's Head of Information Architecture for BT Group - known as Mr. Data Fabric - Ben Clinch. Welcome to the show.

Ben Clinch: Thank you, Seth. Thank you, Chris. It's great to be here.

Chris Featherstone: Yeah, thanks, Ben. And it's always a pleasure. All our interactions have been thought-provoking and very gracious. So thanks for the time.

Seth Earley: So let's jump in. Ben, what do you think the biggest misconceptions are about generative AI? I mean, we know you're heavy in data and knowledge graphs and information architecture, data management and data architecture. But the whole thing with generative AI is so front of mind for people. How does that relate? What do you think the biggest misconceptions are these days about generative AI?

Ben Clinch: So I think first of all, obviously generative AI is an incredibly valuable tool in the right hands. But the biggest misconception, from my perspective, is that many people assume that it's not necessary to curate and prepare data now that we have these capabilities. Of course, generative AI models like OpenAI and others have hoovered up huge amounts of structured data across the world and filtered out a lot of data as well - they've dismissed data they don't want to utilize. That's very easy to overlook for the many millions of people who've interacted with these large language models through chat interfaces. So the magic is still behind the curtain, so to speak.

Seth Earley: And you had mentioned when we were chatting before that you're kind of leaning on someone else's data model or someone else's knowledge, in a lot of ways, as it is today. Can you say more about that?

Ben Clinch: Yes, absolutely. It wasn't that long ago I was at a conference where someone said, "We no longer need to do any data modeling, because large language models can now create data models entirely for us without us having to do any work." And of course it's a great accelerator for that type of activity. But in reality, it's being trained on huge sets of information like Schema.org and other freely available ontologies - curated and created thoughtfully by brilliant people around the world.

From that perspective, what you're doing is deferring your data modeling to a standard that everybody else is going to have access to. And if no data modeling goes on anymore, it's going to increasingly diverge from reality. So while AI can absolutely accelerate these activities, it would be a disaster to solely rely on somebody else's model for you.

Seth Earley: That's interesting. People used to ask me: don't you just have a taxonomy we can use? And I'd say, look at two stores that sell exactly the same things. Think of your display taxonomy as your store layout, your shelves, your signage. Do two stores look exactly the same? If everything looked the same, there would be no competitive differentiation. Right? When you standardize, that's good for efficiency, but differentiation is what gives you competitive advantage. It sounds like you're saying the same thing.

Ben Clinch: Exactly. It's a good starting point, but it doesn't give you differentiation. And actually there are two angles on that. I always think of the data model being to data what the org chart is to people. They're both incredibly powerful assets - data and people - for any organization.

And it's an analogy that keeps giving in many different ways. You wouldn't go to an external party and download somebody else's org chart and say, "I can't be bothered to work out how to organize my people - I'll defer entirely to somebody else's standard." It might be a starting point, sure. But you would never defer your entire organizational design to somebody else because people are too important and individual.

Chris Featherstone: That's also saying that org chart comes with all the HR responsibilities and pay grades associated with it too. But in that vein, Ben, I'd love to get your take on how you actually get business leadership within your organizations to understand the value of data modeling.

Ben Clinch: So for many organizations, data modeling has been really downplayed over the years. A lot of this has to do with various vendor platforms saying, "You don't need to create schemas - just pilot data into my data lake and people can work it out themselves." What that really means is you've got one group of people putting data into the lake who are off the hook from structuring it meaningfully. And then you've got huge teams of data scientists - highly skilled individuals - wading through the data lake and becoming something akin to data janitors, all doing it simultaneously and overlapping. Rather than having one group responsible for curation, you've got loads of highly paid people wasting their time working out what the data even is.

The other thing is the org chart analogy is really powerful here. People say, "What's the ROI on data modeling? When is the data model done?" And I say, "Well, you tell me: what's the ROI on an org chart? Have you ever worked out the return on investment for organizing your people?" They go, "No - why would I?" Exactly. And when is the org chart done? You tell me when it's not going to change again. Of course the data model needs to flex and grow and evolve with the strategy of the organization - just like an org chart. Those analogies really resonate with leadership.

Seth Earley: I also have analogous arguments on the unstructured side. We did this project with Applied Materials - which became a Harvard Business Review article that made it to their best-of-archive edition - on "Is Your Data Infrastructure Ready for AI?" The CFO was asking, "Why do we need taxonomy, ontology, and information architecture?" And I said, "Do you have a chart of accounts for your finance organization?" Of course they do. "Why don't you just get rid of that and use Google?" They said, "That's ridiculous." I said, "Exactly. Because taxonomy is a chart of accounts for knowledge." The project was very successful because they'd tried it so many times before and failed without that core information architecture.

What else do organizations and executives need to know about this related to generative AI? There's the data side and the unstructured content side. How do you see knowledge graphs fitting in - and what are your thoughts on retrieval-augmented generation?

Ben Clinch: I'm actively exploring that, and I'm very passionate about the fact that graphs and generative AI work very, very well together. Generative AI is extremely eloquent, but it is prone to hallucination - which means that in the course of being creative, it can tell very plausible lies. It can also have been trained on information that is misleading or inaccurate, which an organization would not want to relay to others.

The fantastic aspect of a knowledge graph is that it can give structure and semantic context to AI, which enables it to better understand what we mean when we say certain things. But also, it can provide a corpus of knowledge that can be referred to and checked against. The output of a large language model, for example, can then be checked for hallucinations against facts and against reality. So they work really beautifully together in that sense.

Seth Earley: We're seeing more and more of that. There are a lot of different ways to combine an information architecture with a knowledge architecture. You can use the large language model to process the query, retrieve results based on a normalized query, and then process the results conversationally. If you turn the temperature down to zero and say "only get information from this knowledge source" - zero is the creativity factor, one is more creative or prone to hallucination - and tell it, "if you don't have the answer from this data source, say 'I don't know,'" that pretty much vastly restricts hallucinations.

The other thing we found in our research was that adding metadata to the content and componentizing it effectively really did improve accuracy and precision. The tricky thing about generative AI is it may give you an answer that's factually correct and drawn from the right data source, but it's not phrased in a way where you can look at the source and verify that quote. It's already been translated. So that's an interesting challenge.

Ben Clinch: I absolutely agree. And actually, I think one of the aspects of your book is that you make very complicated ideas extraordinarily accessible for business people. I became aware of your book through a book review at BT - we have a wonderful book review culture there. A very brilliant colleague, who doesn't specialize in data, did a review and said, "I really like this book, but I'm not convinced organizations can actually build these ontologies." I said, "That's really interesting, because I'm here to help do exactly that." We had a really involved conversation. He joined the Graph Guild. He became very active and totally sold on it - now he's a proponent of knowledge graphs everywhere. And ontologies are easier to understand for business people than many assume.

Seth Earley: Ben, what are your thoughts on using generative models to generate graphs and these types of knowledge structures off of basic information?

Chris Featherstone: Yeah, these types of technologies off of basic information.

Ben Clinch: Well, I think there's a great opportunity there. But I think it has to be tempered. I've been playing with this - running podcasts through a graph and extracting entities and creating graphs on the fly, which is great fun and quite a powerful tool. That said, you still need to correct it heavily. Even when writing code, you need the domain knowledge associated with not only the structure and content, but how it's been codified to ensure it's a sound ontology. So I would say it's a great accelerator, but the human in the loop is absolutely critical to ensure you get the results you want.

And that's really interesting from an educational perspective. People are saying AI is having an impact on people's interest in learning how to code and develop these skills. But we're not going to be able to correct and advance AI if there's a lack of skilled people who can oversee and be that human in the loop.

Chris Featherstone: This is where Seth and I started to discuss the notion of dynamically creating something off of, let's say, a speech. And once that graph gets created, you can dissect it and come up with really interesting relationships between what somebody speaks publicly. But what they're also doing - because of their knowledge of the domain - is outlining all of their intellectual property associated with it. And so we got into the question: who owns that? If somebody graphs a TED Talk, who owns that information, especially if you can disseminate it into structure out of non-structure? How do you protect it?

Ben Clinch: That's a really interesting ethical piece. The thing is, anybody can invest time and listen to a podcast and manually map out the entities. It's just too slow to do by hand. But when you can deploy graphs, you can hoover this stuff up pretty quickly. Personally, I wouldn't object to somebody mapping what I'm saying publicly. I'm speaking openly - somebody's going to benefit from that, which is great. I wouldn't speak openly about something if I didn't want people to benefit from it.

Seth Earley: Quick question: is there anything that keeps you up at night in your role these days? Things you're worried about in terms of trends, implementation, or buy-in?

Ben Clinch: Oh, so many things, because as with any role, it's always about managing risk and reward. Some of the things I find really interesting in the AI space: ethics, not dissimilar to the question Chris just posed. I don't think there are very clear answers or guidance across the industry or from regulators on a lot of this stuff - things like who owns the intellectual property of AI-generated images that are derivative by nature from other sources, cross-border regulatory discrepancies, and so on. I think there are some really great position papers from the UK government, among others, that we're very interested in and inputting into as an organization. Making sure we always hold ourselves to the highest ethical, regulatory, and legal standard.

Seth Earley: That's such a concern of the C-suite - the risks, the audit trails. And some of the things we talked about can really mitigate that. If you can say, "Here's our data, here's our data source, we're only using the LLM to process a query and then retrieving from our own data" - you have control, you have an audit trail, you vastly eliminate hallucinations with certain parameters. That's a really important way for organizations to think about deployment.

Can you talk about a problem your team has recently solved - an interesting business or technical challenge you've applied these principles to?

Ben Clinch: Fantastic. One of the things we're super excited about is ontology-based. We've done a Cloud Data Management Capabilities assessment - it's a framework from the nonprofit CDMC, the EDM Council, which I helped contribute to writing. It brought together about 300 professionals from about 100 of the world's largest financial services organizations, as well as the hyperscalers. What we created is what we call the CDMC Information Model - an ontology model of all the metadata necessary to complete CDMC certification.

It provides a large ontology for metadata based on standards like DC (a cataloging standard), DQV (data quality), PROV-O (a lineage ontology), and others. We're in the process of utilizing that to structure our internal metadata in a consistent standard. It allows us to automate some of our data management activities and it perfectly aligns with the AI-powered enterprise approach.

Seth Earley: Before we continue, you talked about things that are inherently valuable but don't necessarily have a direct or measurable ROI. Does this fall into that classification, or has it been tied directly to business outcomes?

Ben Clinch: That's a really fantastic question. Building a digital twin in this sense can be used for multiple use cases, each of which has an ROI. But I'd recommend for anyone who wants to do that kind of thing: choose a couple of really big-hitting initiatives that are going to benefit from it. Look at the efficiency gains you can make. From our perspective, we're on a simplification exercise across the organization, and we can make that easier by tracking lineage from a graph perspective and tracking metadata associated with system dependencies. That's a very healthy use of a knowledge graph.

I like to call the business capability the "excuse case" - the use case that's the excuse for doing something that has applicability across lots of other domains. Focus on one particular area like customer 360, customer experience, compliance, or security - something with a measurable outcome - and say, "This is what enabled that." Because data supports process, process supports an outcome, and that outcome supports the enterprise strategy.

Seth Earley: When you have that connection and lineage, it's much easier to say, "This was enabled by that." And that's where, as you're saying Ben, you can have self-correcting mechanisms by tying it to metrics and usage.

Ben Clinch: And the Graph Guild is very much how we're pulling together lots of different use cases from across the organization. We have a large number of graphs already, and we've created a community where we can leverage those to build out domains and an enterprise-wide knowledge graph - with a meta-model structure we call an upper ontology. As people are delivering value for their particular customer-facing or corporate unit, we can leverage that for cumulative benefit across the organization.

Libraries of use cases that you maintain over time allow the testing of your capabilities. Use cases need to be unambiguous - testable and verifiable. If you're building use cases for those knowledge graphs, you just build them up over time and keep them as corporate assets alongside your knowledge graphs and data models.

Chris Featherstone: For everybody listening - it's more along the lines of how do you put the governance and guardrails around the evolution of a graph and make sure that what is there is relevant?

Ben Clinch: One of the interesting things about semantic modeling and ontologies is that it's so flexible you can model anything - and the interconnectedness of everything and anything. The real challenge is to ensure that what you're modeling is relevant to the use case and relevant to the organization.

The analogy I always use is one my dad used to say: it takes two people to paint a great painting - one to paint it, and the other to tell them when to stop. One of the challenges is that ontologists love to model. That's why they do it. And they can model and model and model.

Seth Earley: There are two types of taxonomists in the world: splitters and lumpers. If you put an ontologist with a splitter, you're bound to have a very long project.

Ben Clinch: You're so right. If you over-normalize or over-generalize, it's not as useful. If you make it too granular, it becomes a beast. One of the interesting angles is to actually analyze usage and determine whether certain parts are redundant between nodes. And I think that could be something AI would be really powerful at - a self-correcting ontology over time, understanding the weighting of nodes as an AI use case.

Seth Earley: Chris hit the nail on the head around governance - we need metrics, feedback mechanisms. Are people using this? Are we satisfying use cases? Are we breaking something when we make changes? And as you're saying, Ben, you can have self-correcting mechanisms by tying it to metrics and usage.

Chris Featherstone: And being able to tie that to the business case and use case - going from the very granular to the macro scale. Organizations need to get their minds around this, especially in terms of governance, ongoing resource management, decision making, course corrections, and ROI.

There could almost be a semantic relevance score for your model - a notion of explainability that tells you what makes up the relevancy scores for these models. Usage is king, and if you start to see just a few synapses firing and that's it, you've missed something.

Ben Clinch: I totally agree. I'm also interested in the fact that a lot of graph packages originated in academic backgrounds, so they don't always have versioning built in. Obviously, there are ways to supplement with things like GitHub versioning, which I think is really important - because the ontology is going to evolve just like the org chart, and you need to keep track of that.

The other thing that's really interesting is making sure security aspects work well - applying zero-trust data entitlements thinking to an ontology, and actually breaking it up into many mini-ontologies as one of the ways to curate graphs that some people are allowed to see and some are not.

Seth Earley: You know, you talk about mini-ontologies, and I think that's fine as long as those are connected and not independent. I've seen some companies break off pieces of their ontology and then go crazy with it - and it's like, no, you're fragmenting things instead of integrating them. That comes back to governance and change management.

Ben Clinch: I agree entirely. It's a dangerous path. Coordination is critical, and a common vocabulary and common standards is one of the ways to ensure that kind of interoperability. That's actually one of the reasons I generally favor RDF - it makes things more interoperable than labeled property graphs, although I'm a big fan of labeled property graphs as well. Interoperability and common standards - huge benefit.

Seth Earley: Let's switch gears a little bit. Tell us about yourself - where are you from, and how did you get to what you're doing? The world according to Ben. Were you the kind of kid who lined up all his M&Ms by color?

Chris Featherstone: I would say I was always like that... I'm not convinced I was.

Ben Clinch: So I have a background in engineering. I have a master's degree in mechanical engineering from Bristol University, where I actually specialized in artificial intelligence - back-propagation neural networks. I graduated back in 2000 and was quite convinced that was what I wanted to do with my life.

But I hadn't found data management yet. What I ended up doing was working in financial services for about 20-plus years. I didn't touch artificial intelligence as much during that period. I did a lot of architecture around data and the benefits of relational databases and business architecture. But it's really interesting - I've come full circle most recently, because of the interest in AI and its dependency on well-curated data. The two go very, very well together.

In between, I did some fascinating roles. I did a lot around financial crime investigations and intelligence in financial services - incredibly data-intensive. All sorts of ontologies and AI being applied in that space. More recently, I was the Global Head of Business Data Architecture for HSBC's Investment Bank for a few years - a fantastic role. I got very involved in the EDM Council through that role, helping build common standards and promoting them.

Seth Earley: We only have a few minutes left, so I'm going to ask you an interesting question. If you could go back to when you were in college and give yourself advice - what would you tell yourself?

Ben Clinch: I would say: follow your bliss. As Joseph Campbell would say - the person who discovered the hero's journey. And my bliss is following challenges. It's not hedonistic - it's look for the things that really excite you, push you out of your comfort zone, that give your life purpose and joy, and go for that. And you will always find that life aligns to support you when your passion is aligned with your work.

Seth Earley: You know, when my wife tells me I work too much, I like to say, "You mean I have fun too much?" I do love it. What do you do outside of work for fun?

Ben Clinch: I'm a huge advocate of cycling and sports data in general. I actually have my own little podcast with a dear friend of mine where we collect all sorts of sports, health, and wellness data. We analyze it, talk about it, and help people benefit from trackables and wearables data. And yes, I try to take as much exercise as my schedule will allow.

Seth Earley: That's great. Well, I really want to thank you for being here, Ben. It's been wonderful. I really appreciated your time, insights, and expertise, and we'll absolutely have to have you back.

Ben Clinch: Thank you so much. It's such a pleasure. Keep doing the great work - you're inspiring an industry.

Seth Earley: Thank you. I want to thank our audience. I hope you guys learned something today. Have a terrific week - or weekend, or whenever you're listening to this. And again, Ben, thank you so much for being with us. Okay, bye now.

Ben Clinch: Thanks you.

Earley AI Podcast - Episode 33: Knowledge Graphs, Data Modeling, and the Future of AI with Ben Clinch

Why Proprietary Data Models Are the Foundation of Enterprise AI Success

Meet the Author

Let's Connect