Earley AI Podcast - Episode 31: Knowledge Graphs, Unstructured Data, and Long-Term Memory for AI with Kirk Marple

Written by Earley Information Science Team | May 22, 2023 9:00:00 AM

It's All About the Data: Organizing Your Information for the Age of Large Language Models

Guest: Kirk Marple, Technical Founder and CEO at Unstruk Data

Hosts: Seth Earley, CEO at Earley Information Science

Chris Featherstone, Sr. Director of AI/Data Product/Program Management at Salesforce

Published on: May 22, 2023

In this episode, Seth Earley and Chris Featherstone speak with Kirk Marple, Technical Founder and CEO at Unstruk Data, about the foundational role of data organization in successful AI implementation. Kirk shares how organizations can move beyond treating unstructured data as a simple hard drive and start thinking of it as a structured, queryable dataset. They explore knowledge graphs, long-term memory for large language models, fine-tuning versus retrieval approaches, and the critical importance of data lineage and governance as AI adoption accelerates.

Key Takeaways:

Knowledge graphs extend far beyond organizing web data, serving as powerful tools for structuring business-specific information and enabling AI applications across industries.
A common misconception is that organizations must build their own models, when in reality rapid commoditization means leveraging existing AI is more practical and cost-effective.
Organizing data well is the core prerequisite for AI success - without proper structure, even powerful large language models cannot deliver meaningful enterprise value.
Long-term memory for LLMs depends on how well historical data is structured and retrievable, not on the model itself; the data pipeline is the real differentiator.
Data lineage and governance are often treated as afterthoughts, but organizations need to track exactly where data originates to ensure AI outputs are trustworthy and auditable.
Verticalized AI models fine-tuned for specific industries will proliferate, but abstracting data pipelines from any single model protects organizations from rapid technology shifts.
Data security and multi-tenancy are critical concerns for enterprise AI - organizations should ensure their data remains partitioned and protected from shared infrastructure risks.

Insightful Quotes:

"It's a dataset, not just a hard drive. You have to think about it in a different way - the ability to organize that in a way that you could pull out the entities that are interesting, the people, places, and things." - Kirk Marple

"The real takeaway is it's about the data. Organize your data in a way that you can leverage the new models that come out and abstract yourself from that. Be pragmatic." - Kirk Marple

"Don't overcommit to any one thing. Be pragmatic, but also focus a bit on where this is heading, not where we are today." - Kirk Marple

Tune in to discover how treating your unstructured data as a structured dataset - not just a hard drive - can future-proof your organization's AI strategy and unlock the full value of large language models.

Links:

LinkedIn: https://www.linkedin.com/in/kirkmarple/
Website: https://www.unstruk.com/
Twitter: https://twitter.com/unstruk

Ways to Tune In:

Earley AI Podcast: https://www.earley.com/earley-ai-podcast-home
Apple Podcast: https://podcasts.apple.com/podcast/id1586654770
Spotify: https://open.spotify.com/show/5nkcZvVYjHHj6wtBABqLbE?si=73cd5d5fc89f4781
iHeart Radio: https://www.iheart.com/podcast/269-earley-ai-podcast-87108370/
Stitcher: https://www.stitcher.com/show/earley-ai-podcast
Amazon Music: https://music.amazon.com/podcasts/18524b67-09cf-433f-82db-07b6213ad3ba/earley-ai-podcast
Buzzsprout: https://earleyai.buzzsprout.com/

Thanks to our sponsors:

Podcast Transcript: Knowledge Graphs, Unstructured Data, and Long-Term Memory for AI

Transcript introduction

This transcript captures a conversation between Seth Earley, Chris Featherstone, and Kirk Marple about why data organization is the true foundation of enterprise AI success. Kirk explains how knowledge graphs apply to business applications beyond the semantic web, why long-term memory for LLMs depends on well-structured historical data, and why organizations should abstract their data pipelines from any single model to stay agile in a rapidly evolving landscape.

Transcript

Seth Earley: Good morning, good afternoon, good evening. Welcome to today's podcast. My name is Seth Earley.

Chris Featherstone: And I'm Chris Featherstone.

Seth Earley: And I am really excited about today's guest. He is a customer-focused technology leader with extensive expertise in cloud-based microservice, multimedia data ingestion, and computer vision integrations. He has a great deal of expertise in knowledge graphs and AI in general, with extensive experience in AWS and Azure services. Chief Executive Officer and Technical Founder at Unstruk Data, Kirk Marple. Welcome to the show.

Kirk Marple: Thank you so much. Glad to be here. Looking forward to the conversation.

Seth Earley: So what I wanted to start off with is what are the misconceptions that you run across in the industry around knowledge graphs and around AI? You can take that as two separate questions or you can combine them.

Kirk Marple: I think a lot of people, if they hear the word knowledge graph, they think of the semantic web - kind of what Google has proposed with their classic "Things Not Strings" article back in the day. And people think of it as maybe just a way to organize web data. But there really are applications for business and how to organize business-specific data for different verticals, in a knowledge management kind of approach. And I think that's one of the misconceptions - thinking it's just about the web.

And then for misconceptions about AI, there's classically so much leverage around the idea that you have to build your own models, you have to have data scientists. There's so much commoditization that's happened that a lot of people underestimate just how fast things are moving. Leverage AI for your business, but don't lean in so far that you overcommit to one specific ML model or one specific API, because commoditization happens very, very quickly in this market.

Seth Earley: Interesting. And what would you say the message to executives would be around AI, especially generative AI, which is all the rage these days?

Kirk Marple: Everything's changed in the last six, nine months so, so fast. There's capabilities - I've used different text analytics models and video analytics and audio analytics over the last several years, and the step change that's happened with OpenAI and large language models - it's easy to get overwhelmed. Like, can I even commit to anything because it's going to change in three months?

But I think the real takeaway is it's about the data. Organize your data in a way that you can leverage the new models that come out and abstract yourself from that. Be pragmatic - that's really the big takeaway.

Chris Featherstone: Double-click on that a bit - in terms of how people miss with their data models. Where do they miss?

Kirk Marple: We had worked in some areas of the built world - construction engineering, railways, and things like that. And there's this classic tendency to treat unstructured data in the classic IT sense. Like, hey, I throw it in a SharePoint folder, and thinking about file naming and folder conventions.

The miss is really that you have to think about it in a different way. It's a dataset, not just a hard drive. The ability to organize that in a way that you could pull out the entities that are interesting - the people, places, and things. It's a bit of a phase shift for people to really think that way.

The thing I've really seen the last six or nine months is with large language models - being able to talk to your data, what we all think of as named entity retrieval or entity extraction - people are now talking about this more. There's now this phase shift where people have kind of caught up to things that were behind the scenes the last couple of years, last ten years probably.

Seth Earley: So when you come back to organizing unstructured information and knowledge, a lot of folks think, well, the AI is going to do it. But obviously generative AI is a mechanism to generate content based on patterns that it has learned. So how does it fit in with knowledge management and organizing that knowledge?

Kirk Marple: What you're starting to see is there's a pattern that's developed over the last three to six months of long-term memory for large language models. The large language models - let's just say GPT-style models - are very useful, but because you can only give them so much context to answer from, it's very limited. So you have to have some structure to pull from - point it at a SharePoint directory or scrape a website and then be able to ask questions and have it generate from that.

OpenAI released their open source Retriever plugin, and it's starting to be a pattern that everybody's following because it's a way to say, look, I can answer this. The models are very powerful, but where's my sort of history? So what I really see is the power is in organizing that historical data, organizing that long-term memory.

In a business sense - what if I have a lot of historical data sitting in SharePoint or SAP or something like that? How do I format that and organize that into a structure that the model can answer from?

Seth Earley: So if it's just crappy, messy, unstructured data with poor naming conventions and lack of tagging or inconsistency, you're saying that AI magic is not going to work on it?

Kirk Marple: Plus more. It's a great point. The things that folks like yourself have done for years in terms of canonicalization of the data, intent, providing taxonomies - it's only going to help. Right now the model is only as good as what it's been trained on. But can it differentiate a company name versus a street name? If they happen to be the same thing, that semantic value of organizing that data is super important.

Seth Earley: Yeah, I would agree. Go ahead, Chris.

Chris Featherstone: I was just going to ask about your perspective on data evolutions that happen - governance not only from the people side, but from the system and process side.

Kirk Marple: Data lineage is super important. You're going to have to track - hey, I'm pulling in data from an S3 bucket, I'm pulling it in from Dropbox or SharePoint - and be able to track that history. Especially now when you see some of these semantic search engines providing attribution of, hey, I found it in this chunk of data in this article. If you could go to the lineage and be like, okay, it's this article, but where did I grab it from? I haven't seen that commonplace - that a lot of people are thinking in that direction.

Chris Featherstone: I think you're spot on. Who's going to document the lineage? And put together all the attribution for it - so you come to it and you know exactly what the data is, where it's being used, who it's being used by. That lineage that contains all the audit trail, but also the data sources that make it up. We don't see that as a first-class citizen yet in designing up front, as opposed to, oh wait, we'll do that after the fact.

In fact, this week I was working with a customer and they actually utilized some generative AI techniques to drive the lineage of their stuff and generate all the metadata so that it was almost in natural language query form, but for the metadata.

Kirk Marple: That's really interesting. Microsoft Purview kind of does that, but from the structured world. And that's another thing - there's a lot of tooling and data ops platforms that are so focused on the SQL world. My thesis is there's just a lack of that tooling for unstructured data. Where's the Purview for unstructured data? I think that's really where the next couple years are going to become pretty interesting - taking some of those patterns and applying them more to unstructured data.

Seth Earley: Talk a little bit about the role of a knowledge graph in some of these applications. The way I look at it is those tags are important for the embeddings because they're signals about the content. For example, if you need troubleshooting from an installation manual for some router equipment, you need to know the model, the make, the version, and what error codes - all of that is metadata that content needs to be tagged with so you can pick up those hints from what people are asking. Talk a little more about that and how knowledge graphs fit in.

Kirk Marple: There's the data extraction side - pulling out the entities, finding the people, places, things, products, and so on. But the other part that I don't think is talked about enough is data enrichment - the correlation of that data to the common data you have in other places, in SAP, in a Snowflake. Where are my SKUs for all my widgets and screws and wrenches in my organization?

You're not really giving the model enough context to know that this is actually a company and here's its website - because you may want to join on different properties. You may have it spelled differently, you may have somebody reference the URL versus the name. So putting all that metadata together - where it's a company, but that company is a logical entity that has a name, an address, a URL - you have to correlate all that data from different places.

I often say it's like the customer data platform world where you're trying to create a 360 view of a customer. It's a similar kind of concept, but for any entity.

Seth Earley: That's great. Can you talk a little bit about a specific solution or problem you solved for a customer recently?

Kirk Marple: One of the interesting use cases is actually in the podcast domain. There's a concept of being able to correlate all the deep information - even in this podcast, we're talking about companies, we're talking about topics. And the ability to correlate from different sources, not just the transcripts, but other web content, social media, and create that 360 view of a podcast.

That's super interesting to me because in a podcast, there's so much information that gets dropped on the floor. You can search transcripts, but how do you follow links? How do you kind of do it as a research tool? That's a passion of mine - how do you learn better from content and from that information? It's an area we're looking at that I think is super interesting.

Seth Earley: Well, when you get that figured out and deployed, let me know.

Kirk Marple: Yeah, over the next couple months we'll have that released and I think it's super exciting.

Chris Featherstone: I was looking at some different models and thinking about where these things break down. One thing that stood out to me was that if you're going to utilize a large language model to do something generative, you're not going to get much historical information past 2021 or so. That becomes a really rich opportunity for organizations that have deep historical information - it's more of an art, not a science, because there's not a one-size-fits-all model. So maybe you use a large language model from one of the core vendors, but you also take your own historical records and generate your own model off of that. What's your take on the balance?

Kirk Marple: The pattern right now is fine-tuning versus retrieval - or memory retrieval. Fine-tuning is expensive today. The idea would be you take a model and kind of wrap another layer of the onion skin around it of knowledge and fine-tune an existing model. I think that's going to get cheaper.

But the only really cost-effective model right now is to take an existing model and give it memory. Then there's the idea of verticalized models - we're seeing this now with things like taxGPT, a model specifically fine-tuned around a bunch of data for a specific vertical use case. I really do think those verticalized models will proliferate.

But the good thing is it's all just an API call at the end of the day. As long as whatever solutions you build can say, look, I can start with GPT-3.5 Turbo, use GPT-4 when I get access, maybe there's a new model like the Bloomberg model - you just want to abstract your data pipeline from that. AWS just announced Bedrock this week. If you can abstract the pattern of enrichment, the data storage, and the search side of it, it just gives you much more future-proofing.

Seth Earley: I just wanted to remind our listeners we are speaking with Kirk Marple, Chief Executive Officer and Technical Founder at Unstruk Data. Do you want to talk a little bit more about what Unstruk Data does?

Kirk Marple: We call ourselves an unstructured data platform. We've been focused on taking anything from documents, images, video, 3D, and creating useful knowledge from that. We had been focused in the built world community where organizations have CAD drawings and 3D files as well as documents just sitting in SharePoint - creating essentially a semantic search and visualization tool for that.

But we've actually seen there's so much capability to build verticalized applications on a platform like this. We have a new product releasing later this quarter called Graphlit - a platform for anybody to build tools on top of this essentially graph data pipeline. It's super exciting - using large language models for data extraction, pulling entities out of the data, as well as semantic search and building prompted queries on that. Kind of like a Hasura or a Supabase where it's a backend as a service. We're super excited to open this up to anybody to build these capable apps on top of.

Seth Earley: Wonderful. And talk a little bit about the role of company-specific reference models or ontologies. Large language models have term relationships and concept relationships that are generalized, but when you look at an organization, maybe there's specialized terminology that is part of their IP or competitive advantage. How does that need to be considered and integrated?

Kirk Marple: We've started off with a layer following schema.org - their JSON-LD model where you get this wide range of common object model, essentially the taxonomy that covers a pretty large swath of the data sources we see - people, products, organizations, and things like that.

In terms of very bespoke areas like healthcare where there might be other object models, we haven't done a lot there yet. But you could essentially - I come from an object-oriented background - subclass different entity types and add your own properties. For us it's really about extracting to the few dozen common entities that are most common and then extending from there.

Chris Featherstone: What about these models and approaches that make you nervous - data poisoning, hallucination, those kinds of things?

Kirk Marple: I think one is just that it's a bit of a closed loop. If you think about what it's trained on - a dataset up to 2021 - it's not being fed a lot of new information. One concern I have is if a lot of new content is generated via these language models and gets fed back into the top of the funnel, you're not injecting a lot of new knowledge into the corpus of the dataset. Do we lose value in the models over time of not having fresh data?

The other concern is really around security. An insurance company isn't really going to be happy putting their data up on OpenAI servers. There's going to be a big push - almost a backlash - back to on-premise, or at least private cloud. That's something we've been anticipating with the idea of a backend as a service, making sure we can partition the data. The multi-tenancy side of it is a day-one thing for us. I don't see that in a lot of the tools out there today.

Chris Featherstone: Yeah, I love the perspective that if we don't feed it, we're going to reach the limits of what that data cannon can look like. I wonder if it's going to end up like when you create a duplicate of a duplicate of a duplicate - you get something far from what it started at but super limited in its capabilities.

Kirk Marple: Exactly. If you think about generating marketing copy, generating web copy, and then it just gets fed back in - what's that going to end up being? I hope that new people continue to write new written words.

Seth Earley: Right. That's where human creativity comes in. And right now we're mining the treasure troves of human creativity. What else should organizations be concerned with besides data security?

Kirk Marple: I think cost is an important one. Because we're a usage-based platform, it's really easy to let these things go off the rails and spend a ton of money. Like any cloud service, you need to give visibility to how many tokens are being used, how much data you have. Hashing is a big thing - you don't have to call OpenAI for everything. Some of this is just Software Engineering 101 that people are catching up with.

Giving visibility as people build more complicated applications - it's really easy to spend a bunch of money on this stuff. Cloud costs at a granular level for application developers. You make a code change, how does this affect your costs? As we get more API-centric, people are going to have to be more aware.

Chris Featherstone: Where do you often see pitfalls that folks can avoid? Everybody rushes to the new thing and thinks they've got to build every single thing in their organization on it, and then it just doesn't work out.

Kirk Marple: I think - be patient. The last couple months have been moving so fast that I've actually been doing less building to move as fast. I've been more doing architecting and thinking about where's the puck going to go?

I feel like we've seen this step change in speed to execution, but people are kind of losing the architectural vision a bit. The right direction is just to take a step back - where is this really going? What's the arc? That made me want to abstract more and say, look, a new model pops up every week, a new capability pops up every week. Don't overcommit to any one thing, but anticipate the concepts that are coming up - data retrieval, data organization, data lineage. Be pragmatic, but also focus a bit on where this is heading, not where we are today.

Chris Featherstone: There's definitely wisdom in that. It's how we've had to deal with the new new - always come out with it and then apply old patterns to make sense of it, centered around governance and guardrails and where it's going to make the most sense to build and generate the right pipelines to give access to those people, systems, and processes that need it, in a very systematic and auditable way.

It's an art form, not a science. There's not one or two models that may work, but test them all. Test variations. Some are fine-tunable, some are not.

Kirk Marple: That's true. We want to lean in more to the tooling layer around it - leave this in the hands of the app developer, not the end user. Is GPT-3.5 better than 4 in this situation? I was doing some stuff with addresses and realized how much better GPT is than other NER solutions - miles better - but that needs testing. You need testing tools and validation tools on a test dataset. It's not building a model - it's having a test set and being able to run prompts like a playground and then publish that to be used in a workflow. Prompt engineering from a software development side of things.

Chris Featherstone: We still always get into the scenario where unfortunately people try to solution the problem before they actually figure out what the end results and outcomes need to be, and then work backwards. What semantic measure of accuracy do you need for that thing? Then go back and look at whether you have the foundations to support it. If not, you know where to start.

Kirk Marple: Yeah, in the address use case, we had an oil rights company doing legal documents for oil rights. You then need to do data validation - show it on a map. Did it extract things in Europe that you shouldn't be seeing? Visual testing and visualization - validating the interconnections of what you extracted are really super interesting too.

Seth Earley: This morning I read an article about how a large language model was given a problem around chemical synthesis and was connected to a fog lab - physical equipment - and did the research, learned what the chemical synthesis should be, and then performed the steps. The researchers had looked at what happens if they ask it to do illicit substances like methamphetamine - it refused. But there were ways of getting around it by referring to chemical derivatives or compound A. It really does open a very scary chapter when you start thinking about how these things can interact with the physical world.

Kirk Marple: I've seen that come up recently a lot. I'm a bit of a skeptic on it because there still has to be a human in that loop. You're not going to get AI access to the power grid, nuclear weapons, or chemical weapons without some physical element in the real world. But I get that they can run off the rails without proper guardrails.

There was an interesting project - this Baby AGI project - it's essentially GPT in a loop, a planning execution thing. It reminds me of Conway's Game of Life or The Sims. There was actually a really cool paper from Stanford - almost like a Sims world driven by large language models. They're injecting this concept into simulation technology we've had for years. Maybe I'm being too skeptical, but it's interesting to think where this goes.

Chris Featherstone: So maybe the warning sign should be if you hear somebody call their project Whopper and you hear it asking if you want to play a game, run like hell.

Kirk Marple: I know. It's like we've all been seeing the bad movies around these things. But somebody still has to connect the dots for these things to go off the rails, and hopefully we're smart enough not to do some of those things.

Seth Earley: Yeah, well, it is interesting, especially when people talk about a hint of AGI - artificial general intelligence. These things are incredibly powerful. When I use ChatGPT to do some of my research, it's just astounding - the level of detail, even when it explains itself, it's mind-boggling.

Chris Featherstone: Part of this is we have a very short memory. When e-commerce sites came out and got hacked, that was a good lesson - we need secure sockets. Now we have some nefarious stuff going on with prompt injections and phishing. Those guardrails - we're probably going to have to see some failures first before we see them get put into place. Any thoughts on those pieces?

Kirk Marple: It hasn't come up specifically yet for us. It's more of a data quality question. But there are some open source projects now, like one called Guardrails, that kind of does a prompt and then checks the response against a conformance pattern. Because the LLMs are non-deterministic - you could ask it the same thing twice wanting to return JSON, and one time it'll do it, and one time it'll give you a little paragraph after it.

I think some of these are kind of hacks to get around the weirdness of language models today. That'll just get cleaned up when OpenAI releases one that's a little more predictable for developers. But from a conceptual rule-based engine - hey, go double-check what the answer is - that'll happen and it's going to be needed. The biggest thing for me is people aren't used to programming things where you put in A and get B and then the next time you always get B. That's just a different programming paradigm - a fuzzy logic kind of thing that people are just starting to get their heads around.

Chris Featherstone: Yeah, where did binary go?

Seth Earley: I just want to ask you a more personal question. Tell us more about you. What do you do for fun and where are you located?

Kirk Marple: I have kids now - they're all in their 20s, so a lot more time on my hands. One of them's in grad school in the UK and a couple are in Seattle. I always wanted to be a chef actually - I was thinking I was going to go to culinary school. So cooking is my big hobby other than watching Seattle sports. Happy that the Kraken got in the playoffs for the first time. Hockey and baseball are my two big things. Other than just working on this and always trying to learn and research - those are kind of my passions.

Seth Earley: And where can people find you? Your website is unstruk.com?

Kirk Marple: Yep. And LinkedIn's great. On Twitter a little bit, but LinkedIn - anybody wants to hit me up, that's usually the easiest place to get me.

Seth Earley: Great. Well, listen, thank you so much for your time today.

Kirk Marple: Oh, thank you. This was a fun conversation.

Seth Earley: Thank you folks for listening in. If you learned something today and found this interesting, let someone know about the podcast. And again, thank you, Kirk. I really appreciate it.

Kirk Marple: Yeah, I appreciate the opportunity. Exciting stuff.

Chris Featherstone: Good luck to you, my friend, in the future. Hopefully our paths cross again soon.

Kirk Marple: Yeah, definitely.

Seth Earley: This has been another episode of the Earley AI Podcast, and we will see you all next time.

Chris Featherstone: Thanks, Seth. Thanks, guys. Take care.

View full post