Why 30 Years of Data Problems Persist - and How Knowledge Graphs, Metadata, and Business Alignment Finally Fix Them
Guest: Juan Sequeda, Principal Scientist at data.world and Co-Host of the Catalog & Cocktails Podcast
Hosts: Seth Earley, CEO at Earley Information Science
Chris Featherstone, Sr. Director of AI/Data Product/Program Management at Salesforce
Published on: February 13, 2023
In this episode, Seth Earley and Chris Featherstone speak with Juan Sequeda, Principal Scientist at data.world and Co-Host of the Catalog & Cocktails Podcast, whose career has been dedicated to making sense of inscrutable enterprise data through knowledge graphs and semantic technologies. Juan argues that the data management industry has failed to solve the same problems for 30 years because it has focused entirely on technology while ignoring people, process, and incentives. He makes the case for a shift from a data-first to a knowledge-first world, explains why your data catalog should be built on a knowledge graph, and shares a powerful real-world example from Prologis showing how tying data quality directly to employee bonuses drives genuine organizational change.
Key Takeaways:
- The data management industry has focused on technology for 30 years while ignoring people and process, which is why the same problems keep repeating generation after generation.
- Understanding who you report to - CFO, COO, or CEO - is the fastest way to align data work with the right business language and success metrics that actually matter to decision-makers.
- Knowledge graphs scale organizations by satisfying both known use cases today and unknown use cases tomorrow, unlike siloed query-based approaches that must be rebuilt each time.
- A data catalog should be built on a knowledge graph, starting with metadata first, because metadata provides the context that makes data meaningful before tackling richer ontologies.
- General AI models trained on public data cannot substitute for proprietary structured knowledge - when humans in the room disagree on meaning, no machine will produce a trustworthy answer.
- The Prologis case study shows that tying 20% of every employee's bonus to data quality created a cultural shift that pure technology investment never could have achieved.
- Moving from a data-first to a knowledge-first world means putting people, context, and relationships before raw data - and connecting all of it directly to how the organization makes and saves money.
Insightful Quotes:
"The problems that we've been trying to solve 30 years ago continue to be the same problems we're trying to solve today. Something is wrong, and if we don't acknowledge something is wrong - I'm sorry, you're part of the problem." - Juan Sequeda
"Figure out where you report to and speak that language. If you report to the CFO you're a cost center. If you report to the COO it's about productivity. If you report to the CEO it's about strategy and innovation. Know your audience." - Juan Sequeda
"The limits of my language are the limits of my world. If everything in your world is tables and SQL, you believe you can deal with everything using tables and SQL." - Juan Sequeda (quoting Wittgenstein)
Tune in to learn how Juan Sequeda's journey from the semantic web to data.world reveals a practical path for organizations ready to move beyond technology-first thinking - using knowledge graphs, metadata-driven data catalogs, and incentive design to finally solve the data problems that have persisted for three decades.
Links:
- Twitter: https://twitter.com/juansequeda
- LinkedIn: https://www.linkedin.com/in/juansequeda/
- Website: https://data.world/home/
- Juan’s Podcast: https://podcasts.apple.com/us/podcast/catalog-cocktails/id1524652737
- Juan’s Portfolio: https://www.juansequeda.com/
- Book: https://www.amazon.com/Integrating-Relational-Databases-Semantic-Studies/dp/1614996288
Ways to Tune In:
- Website: https://www.earley.com/earley-ai-podcast-home
- Apple Podcast: https://podcasts.apple.com/podcast/id1586654770
- Spotify: https://open.spotify.com/show/5nkcZvVYjHHj6wtBABqLbE?si=73cd5d5fc89f4781
- iHeart Radio: https://www.iheart.com/podcast/269-earley-ai-podcast-87108370/
- Stitcher: https://www.stitcher.com/show/earley-ai-podcast
- Amazon Music: https://music.amazon.com/podcasts/18524b67-09cf-433f-82db-07b6213ad3ba/earley-ai-podcast
- Buzzsprout: https://earleyai.buzzsprout.com/
Thanks to our sponsors:
Podcast Transcript: Knowledge Graphs, Business Alignment, and Why Incentives Are the Real Key to Data Quality
Transcript introduction
This transcript captures a conversation between Seth Earley, Chris Featherstone, and Juan Sequeda about the persistent failure of data management to solve the same problems decade after decade - and why the answer lies not in better technology but in aligning incentives, understanding business context, and making the shift from a data-first to a knowledge-first organizational mindset. Juan shares his journey from the semantic web to founding Cap Center to joining data.world, walks through his "pay as you go" methodology for building knowledge graphs, and closes with the remarkable Prologis case study where data quality was baked directly into employee bonuses.
Transcript
Seth Earley: Good morning, good afternoon, good evening, depending on your time zone. Welcome to today's podcast. I'm Seth Earley.
Chris Featherstone: And I'm Chris Featherstone.
Seth Earley: Our guest today has dedicated himself to reliably creating knowledge from inscrutable data. His research and industry work has been on designing and building knowledge graphs for enterprise data and metadata management. When he's not writing books, papers, or serving on boards and committees, he's on the salsa dance floor. Please welcome the Principal Scientist at data.world, Juan Sequeda.
Juan Sequeda: Hello, Seth! Hello, Chris! Thank you very much. How are you guys doing?
Seth Earley: We're doing great. First of all, I want to congratulate you on the new baby.
Juan Sequeda: Thank you very much. Yeah, it's a new adventure - some new knowledge being produced. She's a girl, Sophia. Right now she's 2 months old. So much fun. The interesting part, though, is that babies don't come with knowledge graphs, so you can't do any governance or discovery. What's interesting is my wife and I are both scientists, so we are very diligent on all the data we're taking about everything. We have a friend who gave us a book on experiments you can do on your baby. We're just having so much fun seeing how that knowledge is being created by our little one.
Seth Earley: Give us some background. Tell us about the world according to Juan - your background, how you got into the space, where you're from, and how your education evolved.
Juan Sequeda: If I go back to the start - I did my undergrad and then my PhD in computer science. I got interested early on, around 2004-2005, when I was exposed to this thing called the Semantic Web. I was living in Colombia - my parents are from Colombia, and I lived there from the ages of 10 to 20. A professor came and gave a seminar, and she was a professor in Madrid, introducing this whole concept of the semantic web. The example she used - you say "Paris Hilton," and what do you mean? Do you mean the person, or the hotel in the city of Paris? And which city? That was an eye-opener for me. I was 18 or 19 years old, and it was like - oh, there's this thing about semantics and meaning.
Very early on I got into describing knowledge and learning about ontologies using tools like Protege back in 2004-2005. I ended up at UT Austin around 2006 and met a professor who became my adviser, Dan Miranker. He was learning about all the semantic web standards that were coming out, but he was a database person, and he was asking: what is the relationship between relational databases and all this semantic web stuff - the RDF, the graphs? That question - what's the relationship between relational databases and semantic web technologies - changed my life. That has been basically the quest I've been on.
The first talk I ever gave was in 2007, presenting our thoughts about integrating relational databases with semantic web technologies, at a W3C meeting at MIT. Tim Berners-Lee was in the front row. I got to meet him and many of the early pioneers in the semantic web space. I decided to stay and do my PhD at UT - my dissertation was literally titled "Integrating Relational Databases with the Semantic Web." A lot of the work we did was defining mappings of how to map relational data to RDF, and the standards that came out in 2012 are based heavily on work we did.
All of this was addressing data that I call inscrutable - incredibly complicated, and you can't tell what it means. If you look at a real enterprise database like an SAP or Oracle EBS system, you're talking about database schemas with thousands of tables and tens of thousands of columns with names that are completely inscrutable. There is meaning in there - but where is it, and how do I expose it?
We took our research and said - if all this semantic web stuff is going to take off, people will say "my data is in Oracle or SQL Server and it's really complicated, how do I put this together?" We had our approach and our technologies. We hypothesized that people would start knocking on our door, and around 2012 they really did. We started a company called Capsenta around 2014 to commercialize all of this - creating what we now call semantic layers, what is now being called the knowledge graph space. We were doing that work in oil and gas, pharma, finance, e-commerce - understanding this vision of creating your ontologies, creating your knowledge, and mapping that to inscrutable enterprise data. One of my customers was data.world. They were using our IP, and we realized we were on the same mission and vision. We joined forces, I joined data.world about 3 years ago, and have been keeping working on the same vision. Here we are.
Chris Featherstone: Fast forwarding to now - now that you're in the middle of all these scenarios with customers, what do you find is the biggest barrier to entry?
Juan Sequeda: First, one of the things is incentives. It goes back to behaviors - what do people really want? To be very honest - which is the brand I handle - I think the market is pretty immature when it comes to knowing what they want. What does the market want? They need to go find data. That's why a data catalog exists. The majority of use cases are: I don't know what data I have, I need to go find my data.
Second, from the perspective of the data management world, we have focused so much on just technology. The problems that we've been trying to solve 30 years ago continue to be the same problems we're trying to solve today. 30 years have passed and the same problems remain. Something is wrong, and if we don't acknowledge something is wrong - I'm sorry, you're part of the problem. My observation is that the focus has always been just on technology. Our success metrics are just technology. We incentivize just for technology. What does success look like? I created a data warehouse, a data lake, whatever. That's how we've defined success, and that's how we get paid. But the real meaning of success is - was somebody in the business able to make money and save money because of the technology work you did? If you did all this work for technology's sake, but they still weren't able to solve their question to make money and save money, you were not successful. But we're not celebrating or incentivizing that - we're only celebrating the technology.
What has happened over the last 30-plus years is that we've only focused on technology. The shift that needs to occur is a sociotechnical paradigm shift - we need to look at data management not just from a technical perspective, but from a social perspective. People and process. As technologists, we don't like to talk about people and processes because it's complicated. Just give me my technology. And we're incentivized to focus on technology. That needs to change.
Chris Featherstone: You're allowing your tactical operational direction to dictate your strategic business objectives.
Juan Sequeda: It's going completely bottom up. The answer to most questions is - it depends. It's hybrid, bottom up and top down. For a lot of these things it's always technology-centric in terms of the thought process, as opposed to the business process and the business objective.
Seth Earley: That's the nature of emerging technologies. Executives don't understand it, so they pass it to their technical team, who look at it through a technology lens. We were recently with a client building ontologies and knowledge graphs, and I asked: what are the use cases? They said "we really don't care about the use cases." That's a red flag. What's your business case? "To use the ontology." That's not a business case. We've found one organization that spent half a million dollars on knowledge graphs with no outcome - it was a science experiment. I'd like to see real business value. Where are people seeing the low-hanging fruit in terms of business value?
Juan Sequeda: I think a good tactic I've seen is: understand where you report to. If you're on the tech side, and you report up to the CFO, you're considered a cost center. So your value comes from reducing cost. If you report up to the COO, that's about efficiency and productivity - speak that language. If you're in a digital transformation office, it's about migrating to the cloud as fast as possible. If you want to do data democratization and you report up to the CEO chain, then you're about strategy and innovation - how are we making more money with this data? So the answer to your question is: figure out where you report to and speak that language.
And another thing I want to add is something I'm now calling "business literacy." We've been talking about data literacy for so long - telling business people they need to be more data literate, which honestly sounds offensive. Those people know so much stuff that you as a data person don't know. How about you learn about those things? Learn how the marketing works. You need to understand how the organization that you work in makes money. Where do you pour money in? What does that flow through and how does it generate more money? What are the operational goals and strategic goals of the company? What are the KPIs of the department you report to? How do you know the work you're doing is contributing to that? Do you even know what the strategic goals of your company are? If that's unclear - go figure it out.
Seth Earley: That's a great point. A lot of times we do metrics programs, we talk about data at the lowest level, but data supports some process, that process has KPIs, that process supports a strategic initiative or business outcome, and that outcome supports the strategic direction of the organization. People doing day-to-day jobs need to understand that linkage just as much as executives. I do want to hear about your book - what caused you to write it, how long did it take, and who is it for?
Juan Sequeda: I wrote the book with Ora Lassila - one of the original authors of the Semantic Web article in Scientific American, and one of the authors of the original RDF specification in 1998. The book is about how you actually build knowledge graphs from relational databases - but it can be abstracted for any type of structured source. What I've always struggled with is seeing people get excited about knowledge graphs and then thinking, "I'll just turn that into a graph and it's all there." No - there's this whole issue of ontological knowledge, of understanding what the stuff means.
We also make very clear in the book that this is for any type of graph, not specific to RDF. The notation we use works for property graphs or RDF or whatever. And let me say quickly - the discussion of RDF versus property graphs is a complete waste of time. They are just technology discrepancies that are inevitably going to be merged together. It's already happening. In a couple of years that discussion will be moot. Stop wasting time on it.
My clear definition of a knowledge graph is: representing real-world concepts and the relationships between those real-world concepts, which happen to form a graph - nodes and edges. By the way, a really well-structured third normal form relational database is also a graph - every table is a concept and you have foreign keys as relationships. The reason the graph model is really valuable is because it's nimble, and you can now integrate data coming from so many diverse sources - relational tables, CSVs, spreadsheets, XML, JSON, unstructured text through NLP. A graph is just a great integration medium.
One of the issues is that people believe we're going to put tables away and move to graphs. No - the graph is just an intermediate layer. You'll have a graph view over your data, a tabular view, an API view, a GraphQL view. Whatever you want. The graph is an intermediate layer for integration. When you go talk to end users and ask them to describe their business, they naturally draw nodes and relationships on a whiteboard. That's a graph.
The book is all about how to design that semantic layer - the ontology - and how to design the mappings between that layer and your source data. We go through a bunch of mapping patterns because these are very repeatable. But importantly, it's also about the process - how to avoid boiling the ocean. We explain a "pay as you go" methodology where you start with the business questions you want to answer. You go through what we call the Knowledge Capture phase - understanding what words mean in those questions, who uses those terms, where that data lives. You generate a knowledge report. Then you implement it, create the mappings, generate the data in the form of a knowledge graph, deliver it in whatever form the consumer needs - table, API, whatever - and let them verify whether it actually answers the question. Success if yes, learn from it if no, and iterate. You start small and keep growing.
The reason this scales is not technical scale - it's social scale. I want to be able to deliver a body of data knowledge that satisfies the known use cases I'm given right now and also the unknown use cases coming tomorrow. That's the value. That's how you get resilience as well as efficiency.
Chris Featherstone: This is critical for an AI podcast - you can't have good AI without good information architecture. I'd love your perspective on how you advise organizations around data governance and discovery, especially as the knowledge graph evolves over time as new questions generate new answers.
Juan Sequeda: I came to a key realization when I joined data.world. Brian Jacob, the CTO and founder, told me something that was an aha moment: the secret is that we're going to get the world into knowledge graphs through a data catalog. A data catalog is just a way to inventory and manage your data resources - it's fancy words for metadata management. And your first knowledge graph should be of your metadata.
I had been working on creating knowledge graphs at large scale - customer 360, patient 360, drug 360. And you run into a lot of friction because people say, "we've been doing 360 with MDM tools." Then the argument goes back to technology. But what we realized was that metadata is that first entry point to your data management.
If we go back over the history of data management, it has been focused just on data - ETL, put it in a warehouse. The missing element has always been the lack of metadata. And metadata gives you context. Understanding the semantics of metadata is actually easier - if you get people in the room, everybody is going to agree that a database consists of tables, tables have columns. The ontologies behind this are very straightforward and already exist as open standards like DCAT, Dublin Core, and Prov. So you start building your knowledge graph about your metadata, and people use that to search and find the data they have.
You also need to understand what people are searching for, what the most important data is, who agrees on what. Then you add a business glossary - just a list of terms - and you say, this table actually represents this concept, let me create that relationship. You start getting that muscle of understanding what data you have and what concepts people are using. Then the next thing that comes into play is connecting the people. Your catalog should not just be of tables and columns and dashboards and glossary terms - it should also include the people and team structures. Who stewards this database? If that person leaves, what department are they in so someone can pick it up? You start building the knowledge of your metadata, which is also a smaller amount of data. People get it. They say, "can you connect this thing too?" And suddenly your knowledge graph of metadata just starts growing.
That's the crawl phase - focus first on your metadata. That's why you start with the data catalog, and that's why your data catalog should be built on a knowledge graph.
Seth Earley: What do you say to organizations who are hearing from vendors that they don't need a knowledge graph - that their AI system does it all? I always find that a red flag.
Juan Sequeda: So let me self-promote something here. I host the Catalog & Cocktails podcast - the honest, no-BS, non-salesy data podcast. In the last season we had two really good AI episodes, one with Andrew Ennis, CEO of Closedloop AI, and one with Patrick Bangert, VP of AI at Samsung. What I'm going to say is informed a lot by what I learned from them.
Patrick said something that surprised me: autonomous vehicles is actually an easy problem. Because it mostly just needs a lot of data - a lot of unstructured data, images and video. The knowledge you need to learn to drive a car is pretty small - the driver's manual is a thin book. So autonomous vehicles are essentially a high-skill, low-explicit-knowledge problem, and you can build that skill by looking at lots of images. That's fine. Those are AI applications where you don't necessarily need a rich knowledge graph.
But then you go to other areas - narrow problems in healthcare like radiology - where you've had humans doing enormous amounts of labeling and annotation. AI for detecting certain conditions in chest X-rays is now more accurate than doctors, precisely because it's so narrow. That's also fine. But once you start dealing with a lot of structured enterprise data where the knowledge surface is broad and humans in the room don't agree on what things mean - that's where you need a knowledge graph. If even the humans disagree, why would you believe a machine is going to come up with the right answer? And even if it gives you an answer, why would you trust it?
Everybody is having fun with ChatGPT and GPT right now. But once we translate that to real business use cases, we're going to realize we need governance behind that. Where does this stuff come from? That's where the knowledge comes in. You can have all these great AI models, but you haven't trained them on your data. The devil is in the details, and you need your own details - your own proprietary data. So just using whatever random open generic AI model is not the way to go for enterprise use cases where the knowledge is complex, contested, and specific.
Seth Earley: Even pure machine learning applications like vision systems detecting defective parts on an assembly line still need knowledge context - what does "defective" mean, what manufacturing line, what product, what suppliers, what processes, who is responsible for remediation? That is the knowledge context that needs to surround the AI observation. Any final big-picture takeaways?
Juan Sequeda: I want to share a story from one of our customers, Prologis - one of the world's largest warehouse logistics companies. If you buy anything on Amazon it'll probably go through a Prologis warehouse. They realized that when they would visit their warehouses, sometimes it was kind of empty - they could have been stacking more pallets in there and making more money. Sometimes it was packed and stuff was sitting outside. What's going on?
When they looked at this problem, somebody traced it all the way to a data quality issue. They didn't have accurate data around the height of their buildings. In a data-first world, I just see a table with some numbers in it. The column is called "length," the next column has units of feet. If you give that to an AI model, it has some number, but what does that number mean? You need someone to tell you: that column is the height of a warehouse. Once you know that, you can immediately infer that warehouse heights are between 20 and 50 feet. Numbers outside that range? Somebody probably typed something wrong. You can get even more precise - before the 1990s regulations, buildings could only go up to 30 feet, after that they could go to 50. So you can validate against that. A null value? Well, a building obviously has a height - let's go track that number down.
Why does this matter? Directly to business value: if you have the wrong number, you might leave money on the table because you miscalculate how much you can stack in a warehouse. That is the knowledge-first world - putting people, context, and relationships first. Context connects the data directly to the business.
But here's the real lesson. Their goal wasn't just to solve this one problem and make more money from correct height data. What they wanted was a cultural change. So you know what they did? They said 20% of everybody's bonus depends on data quality. Everyone - people working the warehouses with clipboards, all the way up. Twenty percent of your bonus money depends on the company improving its data quality as a whole. Their ultimate goal was not just to improve specific business processes. Their goal was to create a culture and an incentive of treating your data with care. True leaders are people who think about data that way.
This all goes back to Skinner and incentives. Are you being incentivized to be efficient right now, or to build something that's going to stand the test of time? Be honest with yourself as a technologist - people like to job-hop, so you're often not even incentivized to put in the extra effort for something that will survive beyond your tenure. That needs to change.
For 2023 and beyond - I will keep pushing on moving from a data-first world to a knowledge-first world. I don't even like the term "data catalog" anymore, because it should be about cataloging data and knowledge together. We need to see more examples of knowledge work connecting directly to the bottom line. Ask your executives and leadership team what keeps them up at night - catalog that, and realize that people are often thinking about the same things in different words. Is your team doing anything about those concerns? Is it actually aligned with your company's goals? That's understanding the business. That's understanding the knowledge. That's how we're going to tie all our data work directly to making money and saving money. Because we live in a capitalist world - and that's what we've got to do.
Seth Earley: That's fantastic. Well, that's all of our time today. It's really been a pleasure talking with you. I always enjoy it - great to catch up with you, and I especially love hearing your ideas about the integration of knowledge and data. Thank you again for today. Good luck with the new baby as well.
Juan Sequeda: Cheers everybody. Great - thanks!
Chris Featherstone: Thanks, Juan. Appreciate it.
