Earley AI Podcast – Episode 38: Revolutionizing Data Pipelines with Alexander Schober

Unifying Metadata, Knowledge Graphs, and Generative AI for Smarter Data Engineering

Guest: Alexander Schober, Data & AI Project Owner at Motius

Hosts: Seth Earley, CEO at Earley Information Science

Chris Featherstone, Sr. Director of AI/Data Product/Program Management at Salesforce

Published on: January 8, 2024

In this episode, Seth Earley and Chris Featherstone speak with Alexander Schober, Data & AI Project Owner at Motius and former researcher at Siemens Technology specializing in federated learning and anomaly detection. They explore the critical prerequisites for deploying large language models on enterprise data—covering knowledge graphs, unified metadata models, data contracts, and semantic consistency. Alexander shares practical frameworks for assessing organizational data readiness, identifying high-ROI use cases, and building AI-powered data pipelines that reduce technical debt and improve long-term data quality.

Key Takeaways:

Successful LLM deployment on structured data requires discoverability, trustworthiness, availability, governance, and semantic consistency before any AI layer is added.
Knowledge graphs significantly enhance LLM capabilities by providing semantic context, resolving data conflicts, and enabling complex multi-hop queries across linked data sources.
A unified metadata model across data modeling, cataloging, and observability tools is essential to eliminate the costly disjointment that slows data engineering teams.
LLMs can automatically generate entire data pipelines—including quality checks, governance attributes, and privacy classifications—when provided with rich metadata context.
Data contracts between producers and consumers prevent upstream changes from silently breaking downstream pipelines, reducing costly debugging cycles from weeks to minutes.
Model drift is an underappreciated risk: monitoring data distribution changes over time is critical to maintaining predictive accuracy, especially in industrial AI applications.
AI will not replace skilled workers but will fill critical labor gaps in manufacturing and field services by capturing retiring experts' tacit knowledge and making it accessible at scale.

Insightful Quotes:

"The other thing which is connected to the idea of knowledge graphs or ontologies is the semantics, which is also really, really important." - Alexander Schober

"If you have questions which require a lot of joins within tables and you have a lot of hops, that's when you start seeing that a knowledge graph actually provides additional benefits." - Alexander Schober

"So much of our efforts around AI and machine learning are making up for our past sins in processes." - Seth Earley

Tune in to discover how organizations can build the data foundations—from knowledge graphs to metadata unification—that make enterprise AI scalable, trustworthy, and genuinely transformative.

Links:

LinkedIn:https://www.linkedin.com/in/alexander-schober/
Website: https://www.motius.com

Ways to Tune In:

Earley AI Podcast: https://www.earley.com/earley-ai-podcast-home
Apple Podcast: https://podcasts.apple.com/podcast/id1586654770
Spotify: https://open.spotify.com/show/5nkcZvVYjHHj6wtBABqLbE?si=73cd5d5fc89f4781
iHeart Radio: https://www.iheart.com/podcast/269-earley-ai-podcast-87108370/
Stitcher: https://www.stitcher.com/show/earley-ai-podcast
Amazon Music: https://music.amazon.com/podcasts/18524b67-09cf-433f-82db-07b6213ad3ba/earley-ai-podcast
Buzzsprout: https://earleyai.buzzsprout.com/

Thanks to our sponsors:

Podcast Transcript: Revolutionizing Data Pipelines, Unifying Metadata, Knowledge Graphs, and Generative AI

Transcript introduction

This transcript captures a conversation between Seth Earley, Chris Featherstone, and Alexander Schober about the foundational requirements for deploying large language models on enterprise data. They explore knowledge graphs, unified metadata models, semantic data quality, AI-powered pipeline generation, data contracts, and practical strategies for identifying high-ROI use cases in manufacturing and beyond.

Transcript

Seth Earley: Welcome to the Earley AI Podcast. My name is Seth Earley.

Chris Featherstone: And I'm Chris Featherstone.

Seth Earley: And we're very excited to introduce our guest today. We're going to be discussing using knowledge graphs to improve large language model accuracy. We're going to talk about some of the misconceptions as well as some better practices—or best practices. I always have an issue with best practices; they're practices, who's to say they're best? For using large language models, some accepted practices. We'll talk about metadata, the topic that's near and dear to my heart, around data engineering for AI, and then talk a little bit about the data infrastructure. Our guest today is the Data and AI Project Owner at Motius. He manages a team of tech experts focusing on machine learning, knowledge graphs, and data analysis. He has prior experience at Siemens Technology, which involved pioneering research in federated learning and self-supervised methods for anomaly detection. He has extensive experience in AI and data analysis. Alexander Schober, welcome to our show.

Alexander Schober: Thank you so much for the kind introduction. Happy to be here.

Seth Earley: Absolutely. What I like to start with—we talked about this before during our intro call—is what are the big misconceptions that you're seeing in the space in terms of large language models, in terms of knowledge graphs, in terms of data preparation? There's a lot of noise and confusion, fear and uncertainty and doubt, and all of the nonsense that vendor hype generates. But what do you see? What do you think are some of the bigger misconceptions?

Alexander Schober: So yeah, maybe I can start a little bit with the story of how I got into all these topics. When GPT came out, I think everybody started thinking about having this one chat interface that would answer all their questions. And then you quickly figure out—okay, what is missing to actually get there? One use case you see quite often is that you're able to do some document search. There's plenty of approaches there where you retrieve some text, and I think there's great value in those use cases as well. We work a lot with companies in the mechanical engineering sector, and they have these huge manuals which they need to search through. Those are quite interesting use cases. But then, if you think about wanting to answer all the questions you have, you also get to topics where you need to look at data—more structured forms of data. And then you find approaches where you do text to SQL, text to Cypher, and so forth. That's where knowledge graphs also come in.

But if you ask how well this works—it works well when you already have the data prepared. As long as I have data in a table where an LLM can just write the SQL query, that actually works quite well. The challenge comes when you don't have all your data in a state ready for that kind of querying. And yeah, that's the question that's been bugging me for the last year: how do we actually get data into a state where we can use large language models to interface with it?

Let's imagine I'm in a factory and I want to answer questions like: which parts should I order now? Or which should I produce so that I don't have any delays in manufacturing? Which parts might be delayed in the next three months—even do some predictions. In order to get there, there are some things we need to put in place. And knowledge graphs are one thing I think is important, especially when we think about the semantics of the data.

I did a lot of interviews with other experts in the field, and I broke it down into six points I think are quite interesting. First, we need data discoverability—we need to be able to find and access the relevant data. If your data is on some local Linux server under some table, it's really hard to access it with LLMs; even data engineering projects will take weeks until you have access.

The second one: if the data is not trustworthy, you can't trust the output the LLM produces. Data trustworthiness is really important. Data lineage is critical—you want to be able to understand all the lineage and where the data is coming from, which transformations have been applied, so that as a consumer of the information you can trust where it's coming from.

Going further, another important one is data availability and timeliness. Data insights are most valuable if they're served hot—if you want insights right now. If you're a retail company wanting to predict sales for next week and you get that information in two weeks, it's completely useless. We want it as real-time as possible.

Another thing that also goes into the topic of knowledge graphs is data governance—which data am I actually allowed to see and use, and for what purpose? In the context of LLMs, that's especially important. Once you put an LLM on all my data, I need to be able to tell it: this user is able to look at this data for these use cases.

Seth Earley: There was an interesting point you were making about product manuals—if you have the product manual, you can access that. I was talking with one of the large technology vendors we built a component offering environment for. They use their components for lots of different purposes, but one of the things we were discussing is that those components still need the context of the overall document. If you say "I have an error code"—well, for what product? What model? What module? Are you seeing the need to ingest components with overarching metadata that would provide context for those individual components?

Alexander Schober: Okay, that's going to be a consultant answer—it depends on the use case. But I see how, for a large set of documents, depending on how the user wants to interact with the system, you might want to do some filtering of which documents you're actually looking at. For example, let's imagine I'm a service engineer and I have some heater in front of me. I want to ask an LLM: I have error code 9324 for a heater—can you please tell me how to resolve it? If I had all the manuals for that company in the system, it would be really hard to infer from that context which manual or which heater is being referenced.

That's where it becomes relevant to combine more structured forms of information with the retrieval capabilities of LLMs—where you say: I know the context in which the person is operating. I know the customer. I know they have these features. I can already filter it down. That's how you can make the task easier by providing structured context.

Seth Earley: That makes a lot of sense. And I think that's a piece people are missing—they don't understand how you need to retain context for information in large documents or manuals. Just pointing the LLM at it, the LLM will not know how to disambiguate or differentiate between one error code for one product and another error code for another product unless you tell it what product it's for. So yeah, that's certainly interesting in our world.

When you talk about data preparation for large language models—what are the key considerations? What are the practices you recommend when an organization needs to evaluate their data readiness for an LLM application? What do you walk in with?

Alexander Schober: Yeah, absolutely. I would split it into two parts: one is text or unstructured information which we want to search through, and on the other hand, structured information.

For unstructured information, I would look at which use cases can we most easily get benefits from. Some things are harder and some are easier to retrieve. I would focus on the low-hanging fruit—for example, in manuals you often have tables or visualizations which are definitely harder to retrieve accurately than plain text. So decide based on that what the lower-hanging fruits are and go for the easier ones.

For structured information, I already mentioned the four aspects: we need discoverability, trustworthiness, availability, and governance. But the other thing—connected to the idea of knowledge graphs or ontologies—is semantics, which is also really, really important.

There's a great example: I talked to an airline, and they wanted to calculate their average flight duration. They thought it was an easy question—just go to the airport database, get the data for when flights departed and when they arrived, and take the average. But what they found is that Frankfurt considers a plane taking off as soon as the wheels lift off the ground, while Munich might say a plane departs when it moves from the gate. Now you have a discrepancy, and if you lose this kind of semantic metadata at the source, how should an LLM ever figure out how to calculate this accurately?

Having the semantics in your data is really, really important if you want to interface it with an LLM. From what I've learned, there are two main approaches. One is to define a business ontology at the metadata level, which then maps to specific implementations for different business units, and interface an LLM on top of that. The other is end-to-end value streams for one particular user—companies in the manufacturing space, for example, where you have one interface where a specific persona can ask all their questions, and underneath there's an ontology providing the semantics. Building this more general capability across a whole enterprise becomes really difficult, because then you need to think about how to create and maintain the semantics for the entire enterprise.

Chris Featherstone: Would you agree with the statement that—while a graph environment is not strictly required for an AI model, whether discriminative or generative—it absolutely enhances the capabilities of AI if you do have a graph? Not required, but enhances the ability. Is that fair?

Alexander Schober: My background is more in AI than data engineering. I need to make a distinction—AI is a broad term. If I'm talking about LLMs and the capability of LLMs to answer questions based on structured data, then I would say probably yes. For other AI use cases, I might come to a different answer.

Chris Featherstone: That's the key delineation, because we see—especially when people are getting their data prepped—that you arguably have to have a good knowledge infrastructure and information architecture. That's one of the really key areas Seth pushes hard on: you can't have good AI without good information architecture. And we're getting into scenarios now where utilizing a discriminative model to create pattern recognition and be a data generator is absolutely essential before actually interacting with a large model. I'm also curious—have you had experience utilizing generative models and LLMs to actually generate graphs and RDF files? We're getting into those scenarios now too.

Alexander Schober: That's something we're experimenting with right now. Customers usually want quick wins, so we haven't yet gotten to the really big use cases that require such a system. But personally, if you gave me 100 million dollars, I would try to build a self-constructing knowledge graph that also does entity disambiguation and conflict resolution within the knowledge world, and then build an LLM on top of that.

The reason I think that's relevant: let's say I have an LLM interface on top of my Confluence, and in Confluence I have two pages with conflicting information. If I ask it "when was Napoleon born?" and it's something really specific—company-internal information, like which product has which features—how should it know? It would be interesting to bring both points of information into a knowledge graph and have some kind of transparent, explainable conflict resolution. So when I ask it, it can say "he was born here" but also tell me: "based on these other facts within my knowledge graph which I know to be true, I'm 70% confident he was born then, but I also have information from these sources which says something else." That would be a cool place to go.

Chris Featherstone: Most of the work I do is in telecommunications and communications service providers. We use this to look at all the communication nodes within a graph environment and then dynamically allocate scenarios around that information to drive optimal usage and generate RDF files—and also utilize generative models to look at architectural approval processes. It's starting to get interesting around how to represent the right information structure for systems that would normally be super difficult to query. Now it's giving everyone the ability to interact with and understand those very complex systems.

Alexander Schober: I totally agree. We have a bunch of knowledge graph projects, and what we usually do is go to the user and ask what questions they actually want to answer. We provided template questions—you can insert given points that will help you answer them. But now it became way easier. You can just use LLMs to generate the Cypher queries. The utility of a knowledge graph just gets way, way bigger by being able to answer more and more questions for people who don't know how to write those queries for the data that's already there. It's a real enabler.

Seth Earley: Let's go back to building out a data pipeline. We talked a little bit about creating an enterprise-scale pipeline using metadata. Do you want to expand a little bit more on how that works? What is the role of metadata, and as you construct that enterprise-scale pipeline, how do you think about it?

Alexander Schober: Sure. Actually, that's something I recently tried out with some colleagues. We started from the point where I said: we see that data catalogues and pipelines are actually disjointed. Can we maybe generate DBT integrity checks based on the metadata in my data catalog? That worked—it's actually not a very difficult task. So then we thought: okay, can we go even further? How far can we go using metadata to generate pipeline code? And it turns out it works surprisingly well.

Large language models actually have the capacity to draft whole data pipelines. You just tell them: these are my sources, these are the semantics behind them—what do they mean? For example, use customer lifetime value: I have a raw customer table with these columns, and a raw sales table, and I want to create a customer lifetime value from this. The LLM can first propose a DAG—a directed acyclic graph—for how to come from your source to your target. And then, once you break it down far enough, it can propose transformations. If you interface well with your metadata catalog, it can also propose: this should be the data steward, this should be the governance attributes, this should be the privacy classification for the data in the steps in between.

This is actually quite interesting because in my talks over the last year, I heard over and over that data engineers often feel they don't have enough time to build data pipelines properly. What gets cut is the data quality checks, the metadata definition, the documentation. If we are able to generate the things that are usually cut, my hope would be that it will make sure people actually do it—and over time, this will increase metadata quality, data quality, and pipeline robustness within a company.

Chris Featherstone: What tools do you typically use—commercial, open source, what kinds of technologies?

Alexander Schober: It depends on where we are. Usually we go with whatever our customers already use. We have a structure we call fluid structures where we can select freelancers based on given projects, so we can adapt quite well. But for transformations, for example, DBT has now become the gold standard.

One of the things we talked about was the importance of data contracts. Do you want to talk about what those entail, how they're used, and what's unique about governance in generative AI?

Sure. That's again something I came across on my journey to answer: what is holding us back from putting LLMs on top of our data? I already mentioned data availability and timeliness—and I think that's where data contracts come in.

One problem you find if you talk to people: if I make a change upstream without considering where my data is used downstream, I might break something. Combined with suboptimal documentation, it becomes a real pain. You break something, and it takes maybe two or three weeks to figure out where it actually broke. The idea of data contracts is: I have an agreement between producers and consumers of data about what my data looks like, what the quality attributes are, what the SLA is, and so on. That agreement is then enforced, so I can't make upstream changes that violate my data contract. I can prevent data pipelines from breaking. Chad Sanderson is the one pushing for these kinds of topics.

Chris Featherstone: You have an opinion on model drift and when things get stale. What should our listeners know and understand about that—what to look for, and the key aspects of making sure your models stay relevant and evolve over time?

Alexander Schober: Definitely. If we look at machine learning models, they are trained on a given distribution—a set of data. What happens if data changes over time? It might become a different distribution and the model won't perform that well. A heavily biased industrial example: you have some anomaly detection or predictive maintenance model which uses temperature as a feature. You train it in the summer and it works brilliantly. But then over the year it becomes colder in your factory, and in the winter the temperature drops significantly. Now the data going into your model is something different, and your model likely doesn't perform that well anymore because the input situation in which it's living has changed.

It's really important that you monitor how your data drifts, how your data changes over time, so that you understand at the right time that this is happening and can retrain your models. If you don't, they will just degrade over time and the predictions will suffer.

Chris Featherstone: You need to do a kind of predictive maintenance for your models in that sense as well. I've been talking about it as the health checks—the care and feeding of the intelligence of these things to stay relevant, whether it's accuracy you're looking for or even catching model poisoning from adverse data you didn't plan for.

Alexander Schober: Yeah, absolutely. And the challenge in some applications is that you don't actually get the real accuracy. In a production line where you want to sort out whether products are good or bad, you would need manual labeling in the end to figure out how well your model works, which kind of defeats the purpose. In those cases, monitoring the data stream coming in and figuring out how the distribution changes is a way to approach it.

Seth Earley: Can you talk about a recent challenge you solved for a customer—walk through what they were facing, what the steps were, the quick wins?

Alexander Schober: Okay, I'll pick one where people will think: I have that problem, I need help. It's ongoing and we haven't figured out all the pieces yet. But one thing I see, especially in German SMEs in the manufacturing industry, is the state of their master data for parts and materials. One challenge that resonated a lot: there was a new regulation, and a certain coating was no longer allowed. They needed to figure out which parts had that coating. For one of our customers, this sparked a whole data science project because this information was only in their technical drawings—not even in some kind of database. And even when it was in a database, you usually have massive data quality problems: people put in the wildest stuff for coatings, materials, weight of parts.

This seems to be a huge challenge across multiple companies. We looked at whether we could use machine learning to resolve some of these challenges—at least provide assistance with cleaning it up. And what we want to ultimately build is a single point of truth for all your part metadata and material metadata. That could be a really nice knowledge graph use case.

The extraction from technical drawings was the first problem—extracting information from technical drawings is not easy, there are so many edge cases. What we did initially was: hey, we need to extract this information right now. But then the next question becomes: how can we make sure that in the future we don't need these kinds of projects anymore—computer vision specialists trying to extract this information with OCR and pattern matching?

That's where it starts making sense to connect the systems where the data originates. If you have a CAD system where you actually produce this kind of data, you should put it directly into a knowledge graph instead of first into a PDF. You want to, on one side, clean up what you need to clean up from the past, and then come up with a system that ensures in the future everything will be accessible and in good quality.

Seth Earley: So much of our efforts around AI and machine learning are making up for our past sins in processes.

Chris Featherstone: You know, those scenarios you're discussing are very mutual across a lot of different industries. We see this in telecommunications—taking data and information, giving it some understanding with an LLM to derive what we're talking about and create more human-readable information for those who don't necessarily know where to go. Then also going into scenarios where they can update that data in a more cohesive way—creating a quality-based flywheel of bettering the knowledge base and updating service documents and technical documentation.

What AI is going to do is help fill critical labor gaps—it's not replacing jobs. It will replace jobs where we don't have the people to do those jobs, especially in manufacturing. You know: it's not replacing jobs, but it will actually replace roles where we do have labor shortages, like you're saying. Especially with aging populations and labor shortages, where we need throughput but don't have the people.

Alexander Schober: That makes so much sense. One of the biggest challenges we hear all the time is the skilled labor shortage. It's only getting worse. And then you have all these people retiring whose implicit knowledge is so valuable—you need to capture it.

Seth Earley: By the way, you're based in Europe. What do you do outside of work? What's your world like outside of knowledge graphs?

Alexander Schober: I really love to play basketball. That's my hobby.

Chris Featherstone: I'm in Munich. Dirk Nowitzki, right? Come on, best of all time—one of your fellow countrymen.

Alexander Schober: Yeah! My trainer hated it because it never went in that well. But yeah, Munich is so close to the Alps and it's just beautiful.

Seth Earley: That's great. And what's on tap for you next year? Any new initiatives or continuing to refine what you're working on?

Alexander Schober: I'm really interested in these topics—especially how the different tools in the data space are disjointed. I think there's still some opportunities there. And cleaning up this mess of data for SMEs is also something I feel is really valuable. I want to understand better what the actual problems are and the ways we can tackle them in a scalable fashion. That's what's driving me—and that's why I'm talking to other people in the industry, to learn about these things.

Seth Earley: Your company is Motius—M-O-T-I-U-S—so we can find out more there. And you're on LinkedIn as Alexander-Schober. Those will be in the show notes as well. Thank you so much for your time today. It's been great to have you.

Alexander Schober: Thank you so much for having me. It was a pleasure. Really interesting conversation.

Seth Earley: Great, good to meet you. Thank you. Thanks to the audience—hope you learned something today and found something of interest. Carolyn and Liam in the background doing the production—appreciate it. And this has been another episode of the Earley AI Podcast. We'll see you all next time.

Chris Featherstone: Awesome, thanks Seth. Always a pleasure. See ya.

Alexander Schober: Bye bye.

Earley AI Podcast – Episode 38: Revolutionizing Data Pipelines with Alexander Schober

Unifying Metadata, Knowledge Graphs, and Generative AI for Smarter Data Engineering

Meet the Author

Let's Connect