Earley AI Podcast - Episode 9: Practical and Scalable Knowledge Graphs with Sean Martin

From Olympic Scoring Systems to Enterprise Data Fabrics - Why Knowledge Graphs Are Finally Ready for the Mainstream

Guest: Sean Martin, Founder and CTO, Cambridge Semantics; Co-Author, "The Rise of the Knowledge Graph"

Hosts: Seth Earley, CEO at Earley Information Science

Chris Featherstone, Sr. Director of AI/Data Product/Program Management at Salesforce

Published on: February 22, 2022

In this episode, Seth Earley and Chris Featherstone speak with Sean Martin, founder and CTO of Cambridge Semantics and co-author of the O'Reilly e-book "The Rise of the Knowledge Graph." Sean traces a career that began building IBM's first online sports scoring systems for Wimbledon and the 1996 Atlanta Olympics, moved through early semantic web research at MIT, and led to founding Cambridge Semantics in 2007 to solve the one problem that had frustrated him for two decades: making knowledge graph technology actually scale. He explains what an ontology is and why knowledge graphs cannot work without one, walks through real production deployments in financial services surveillance, FDA drug data integration, and pharma adverse event processing, and makes the case that the technology - long restricted to the most sophisticated Fortune 500 and government environments - is finally democratizing to reach a much wider audience.

Key Takeaways:

Knowledge graphs solve the data integration problem that relational models cannot - when you have too many entity types, complex subclass relationships, and constantly evolving data, forcing everything into rigid relational schemas produces dumb, limited models that cannot represent real-world complexity.
An ontology is an abstract, standards-based schema for knowledge representation - the difference between a relational schema and an ontology is that an ontology operates at the abstraction level of business knowledge rather than storage artifacts, making data self-describing and portable across systems and communities without transformation.
The full knowledge graph stack had to be built from scratch because no production-ready software existed - Cambridge Semantics built their own database, middleware, BI tools, and ETL tools over eight major iterations because research-grade tools available as recently as 2015 could not handle enterprise-scale data volumes.
Large language models and big word models are rapidly approaching the point where they can populate ontological structures directly from text, which will dramatically reduce the cost of extracting structured knowledge from the roughly 80 percent of enterprise data that remains unstructured.
Natural language query and automated ontology generation from text are the two most exciting near-term frontiers - both draw on similar underlying techniques, and both will make knowledge graphs significantly more accessible to non-technical business users who do not want to write SPARQL.
The data fabric vision is becoming a practical reality - Cambridge Semantics now runs clusters handling tens to hundreds of billions of facts in production, ingesting data at three to four million triples per second per node, enabling organizations to get a connected view across all enterprise data for the first time.
The right way to get started with knowledge graphs is to pick a small, high-value use case relevant to the business, demonstrate the principles cleanly, and let success breed success - trying to build the complete data fabric from the start is how projects fail to deliver anything.

Insightful Quotes:

"I've actually had to build the entire stack - me and my friends, not just me. We built the database, the middleware, the BI tools, the ETL tools - every element of the stack - because there wasn't anything out there that actually anticipated flexible knowledge graph data structures and knowledge representations at every layer." - Sean Martin

"The knowledge graph is the ontology with the data. To make a decent knowledge graph you need an ontology to describe the business data you're going to find in there - and ideally that ontology speaks to the domain in ways that the people who will consume the data actually understand, in plain business English." - Sean Martin

"Data is the new oil - and the big speed hump in the way of machine learning is getting integrated data into training. Once you solve those issues and make it significantly cheaper, and the tools get that much better, we're talking about a complete change in the game for sophisticated data applications." - Sean Martin

Tune in to hear Sean Martin explain why IBM told him JavaScript had no future (while 25,000 internal users were running his JavaScript productivity apps), why the 1996 Atlanta Olympics online ticketing system still gives him nightmares, how a neurosurgeon at Mass General Hospital looking for help with a cancer computing model repository launched him into RDF and the semantic web, and why his advice to any organization wanting to get started with knowledge graphs is simply: start now, start small, and let your first success pay for everything that comes next.

Contact Sean:

https://www.linkedin.com/in/seanjmartin/

Get the book: The Rise of the Knowledge Graph

https://info.cambridgesemantics.com/the-rise-of-the-knowledge-graph-oreilly

Thanks to our sponsors:

Podcast Transcript: Practical and Scalable Knowledge Graphs - From the Olympic Scoring System to the Enterprise Data Fabric

Transcript introduction

This transcript captures a conversation between Seth Earley, Chris Featherstone, and Sean Martin about the twenty-year journey to make knowledge graph technology production-ready at enterprise scale. Sean traces his path from IBM's first online sports scoring systems through early semantic web research to founding Cambridge Semantics, explains the technical and commercial realities of where knowledge graphs stand today, and offers concrete guidance on how organizations can start deriving value without trying to boil the ocean.

Transcript

Seth Earley: Welcome to today's podcast. I'm Seth Earley.

Chris Featherstone: And I'm Chris Featherstone.

Seth Earley: Before we get started I do want to thank our sponsors - CMSWire, the Marketing AI Institute, and of course Earley Information Science, the company that I founded. Today our guest is an award-winning technologist who has been on the leading edge of Internet technology innovation since the early 90s. He's a super smart guy who has written a number of publications, holds patents, and is co-author of an e-book called "The Rise of the Knowledge Graph," which we'll be talking about today. Please welcome Cambridge Semantics founder and CTO Sean Martin.

Sean Martin: Hi everybody. Very happy to be here and thank you both for having me. Really good to be talking to you.

Seth Earley: Terrific. Sean, why don't you give us a little bit about your background. I noticed you're based in Boston but you don't have a Boston accent.

Sean Martin: I have a pretty mixed-up accent. I was born in Johannesburg, so I'm originally from South Africa. I grew up on the eastern side - if you think about Africa, the bottom right-hand side on the east coast - in a city called Port Elizabeth, which is really my hometown. But I left there when I was 17 and moved to the UK, where I went to university up in Edinburgh in Scotland. That's where part of the accent comes from.

While studying I had an internship with IBM, and after completing my degree I joined IBM and ended up working for them for about 15 years. The first couple of years were in southern England, supporting what was a brand new operating system and PowerPC chip technology. I learned to be a comms engineer. But I kind of got bored of that after a year or so and moved to London, where I spent a year working with a very interesting group called the Early Support Group. What made it interesting is that it had people from all the research labs on assignments in one office, essentially supporting all the new products coming out of the labs. So I got exposed to all of IBM's labs and products in one place, which was really unusual.

Then came the dawn of the Internet. I built some early Internet applications for IBM, was really on the leading edge of that, and that got me invited to go to the US. I was kind of grabbed - you might say kidnapped - by a guy who had figured out that Sun was stealing a march on IBM for the Olympics. He asked me to build one of the first web applications and content management systems for IBM Europe, and they thought: why not put the Olympics on the Internet for Atlanta, coming up in 1996? So this was in 1995.

An agreement was reached where I would go to White Plains, New York, where they had a data center with some SP/1 parallel computers I was pretty familiar with. The deal with IBM UK was that if we did Wimbledon in 1995, I could stay for the Olympics in 1996. And so not only did we do Wimbledon that year - that was the first online sport scoring system ever - but then we did the US Open. In fact we did the entire Grand Slam in both golf and tennis. We'd send people with two trailers to each stadium, hook into the scoring systems, and put all of it online. All of that as a warm-up to the 1996 Atlanta Olympics.

Chris Featherstone: And after the Olympics?

Sean Martin: It was a hell of an experience - six months with literally not a day off. My sleep really suffered. After that I was done with big-scale scoring systems and websites. I moved to Boston. That team that had put the Olympics and all those systems together really was the beginning of IBM e-business.

I continued to build another team. IBM had bought Lotus, so someone gave me an office and drafted me into the Lotus payroll with a zero or one dollar salary so I looked like a Lotus employee, and I built a new skunk works team there. That lasted about ten or twelve years.

One big thing we did was start the IBM Extreme Blue intern program, which I was amazed to find is still running 22 years later. It was kicked off in Lotus Cambridge around 1999 or 2000.

We had also been working on technology for building applications using JavaScript that could run on Windows, Linux, and very early smartphones - a bit like Adobe AIR. We'd built a really elaborate infrastructure that had something like 25,000 users internally within IBM for productivity apps. But IBM, I was told, didn't think JavaScript had much of a future, wasn't really a serious language. So eventually I had to give that up. And that's around the time I got into semantics, around 2000.

Seth Earley: When you say you got into semantics, what did that mean back then?

Sean Martin: We had been working on this JavaScript runtime infrastructure, and what we really wanted to do was allow different apps to send messages to each other without having to hard-code them to work together. At that time, the way you exchanged data was XML schema - but that was too brittle. We were looking for a way to bring new apps online that could hook into a data bus and pick out messages that made sense to them without having to pre-agree on everything.

Around the same time I got dragged into a really interesting project over at Mass General Hospital, where a neurosurgeon involved with the National Cancer Institute was looking for help building a cancer computing model repository. As we began to understand his problem it was the same problem: how do we share and integrate data, particularly when things are moving very fast? There were maybe fifteen centers around the world all doing different kinds of cancer modeling but in total isolation.

One of the guys working with me headed back to MIT to finish his PhD - he was at the AI Lab, right next to CSAIL, just down the road from where Lotus was. He came back for a visit and said: "Hey, I found this interchange format called RDF, it's a W3C standard, could we use that?" And that was the beginning of my semantic journey, which I'm still on 20-something years later.

Chris Featherstone: I remember those days of putting a semantic layer in because the computing power wasn't able to handle it. We had to have a read layer or write layer that apps could read from fast enough because the systems just weren't architected to handle that kind of workload.

Sean Martin: You're describing my last 20 years. That's exactly it. We adopted a graph model and back then - we're talking 2001 - when we were looking at Jena hooked up to IBM DB2 Graph and relational, it was absolutely awful. You'd get to a million facts and you were done. That doesn't get you very far in terms of practical use of data. For the last two decades we've been figuring out how to do this at scales where it's actually meaningful.

Moore's Law, parallel computing, and very fast networking have all been what I call the ARCO underlying technology working in our favor. But of course there also had to be software. It's not just hardware - you need software that is far more flexible than the rigid ways we were doing things previously with APIs and very inflexible structures.

You just don't know what data is going to look like. If you did, it would be easy to hard-code. But if it keeps changing on you, how do you build IT systems flexible enough to accommodate that without breaking? You're adding at least one layer of indirection, probably more if you're doing knowledge representation, and suddenly you're paying a terrible price in terms of performance.

Sean Martin: Anyone who tried to use knowledge graph technology and semantics circa 2001 to 2015 probably has some of the facial tics. The number of projects that failed is a ridiculous number, very expensive. It seemed like a really good idea, but when you actually tackled it you found the software was research-level quality and fundamentally didn't scale. In every part of it you ended up doing it yourself. That's my 20 years: I actually had to build the entire stack - me and my friends, not just me. We built the database, the middleware, the BI tools, the ETL tools - every element of the stack - because there wasn't anything out there that actually anticipated flexible knowledge graph data structures at every layer. So we had to start from scratch. It's taken many iterations - I think we're at major iteration eight.

Seth Earley: When did Cambridge Semantics come about?

Sean Martin: Around 2007, IBM was doing a lot of cutting, and our group kind of nobody could understand what we were doing. It looked like we were going to be absorbed back into product groups. A bunch of us decided we were too interested in the technology and its potential, so we left and I founded Cambridge Semantics. That's been the vehicle for this technology, which we now believe has solved all the significant scaling issues and are selling successfully.

Seth Earley: Tell us a little bit about the business problems that Cambridge Semantics solves.

Sean Martin: We're a little unique in that we use knowledge graphs to solve the nuts and bolts of data integration. We're not just an endpoint for data that's already been integrated and then loaded into a graph to run graph algorithms. We actually use the graph to integrate data in graph. We have a data engine - we call it Anzo Graph. It's an MPP data warehouse. The same people who wrote it wrote ParAccel, which is the same code base that's in Amazon Redshift, but that's a relational overlap. An MPP system is very different in terms of performance characteristics from regular transactional databases - it's designed for analytical workloads.

So the business problems we're solving are all around ingesting data as rapidly as possible from a big parallel system that runs as a single logical database but on many servers all doing the same thing simultaneously. We're doing something like three to four million triples per second per node server, which is really quite extraordinary.

We then allow users to run transformational queries - just as you do with PL/SQL in traditional data warehouse integration - but written as graph queries, to clean up and knit together data into models usable for higher-level use cases.

Some examples: monitoring for insider trading, where you've got watch lists, seating data, call logs, emails, and all sorts of disparate data, and you're building a comprehensive view an analyst can use to make sure nothing naughty is happening. Or at the FDA, where they have a very large data lake for clinical trial data coming in from many departments that are really very siloed - creating a single view of all aspects of a drug means pulling in data from those different departments. Another example is in pharma - compound adverse event reports. When something goes wrong with a drug on the market, big pharma companies are on the hook to report to the FDA very quickly or face huge fines. If you have a popular drug you could have a quarter of a million of these reports coming in per year that you have to sift through and triage - understanding what's medically relevant, looking at what other drugs were taken at the same time, the patient's medical history. That's really quite complex data. And then there's automotive - keeping track of which production lines can make which parts across many factories, or integrating data from multiple PLM systems.

Really it's any situation where you have data integration problems that won't fit into a nice simple relational model, where you have too many entity types - what we call an entity explosion - and complex relationships between entities, including subclass relationships. Knowledge graphs handle that very well. And the more you can model without limit, the more you can work at an abstraction level that takes you away from the low-level artifacts of indices and keys that just get in the way of human understanding of data.

Chris Featherstone: Something like 80 percent of data is unstructured. How do you handle that side of the equation?

Sean Martin: That's exactly the problem. What's really exciting is that on the forefront of machine learning, the big large language models and word models are starting to get close to the point where they can populate ontological structures - knowledge representation structures - directly from text. That's one side of the coin. The other side is natural language query. There are many similarities between those two problems, and both are getting much better.

Currently we use what we call annotators, which can be as crude as a regular expression, a statistical model, or a dictionary lookup. Annotators tend to be relatively inflexible and of varying degrees of accuracy, and they're very specialized. Previously building these things right was very expensive. But what people like NVIDIA, Google, and AWS are putting out in terms of large models will completely change the game in terms of making it possible to decide what you want to extract from text - whether that's company reports, drug labels, anything - and populate those structures fairly directly.

On the analytics side for unstructured data, currently we can do things like sentiment, finding occurrences of products in the literature, phone numbers, email addresses - relatively crude ways of identifying people and things. You can create nice knowledge graph meshes that tie documents to data sets so you can see the relationships, but it's not quite the same as literally extracting structured data from text. I think we're going there faster and faster.

Seth Earley: Sean, tell us about the book.

Sean Martin: It's very short - 79 or 80 pages. It's a good read if you're interested in getting into knowledge graphs. It breaks down what a graph looks like, what's nice about graphs, what knowledge is and how we represent it - tried to do that as quickly as possible in something short and sweet. It's designed to give people who are knowledge graph curious a chance to get their feet wet with something very accessible, lots of diagrams and simple examples, and then use cases so they can not only educate themselves but talk to their colleagues. That's the intention.

Seth Earley: Let's talk about ontology - give your definition and explain how it relates to knowledge graphs for our audience.

Sean Martin: An ontology is a way of describing knowledge. If a knowledge graph is a graph of connected data - business data, enterprise data, data that's used by some community - you need a way of describing what's in there.

In the relational world you'd have a schema, but a schema describes the storage artifacts as well as the model. An ontology is a more abstract schema for representing knowledge. Ideally it uses a standard so you can pass this data representation around to other parts of your community - or other communities - and establish a shared understanding of what you expect to find in the data. It's really a template for defining what's in your data, described at an abstraction level that is genuinely knowledge.

To make a decent knowledge graph you need an ontology to describe the business data you're going to find there. Ideally that ontology speaks to the domain in ways that the people who will consume the data understand - in plain business English. The advantage of using a standard is you get to use lots of other people's software. Because they can all read the same standard, you can move this data easily out of wherever it started to lots of different systems without transformation.

One wonderful thing about this is that the ontology can accompany the data and describe it, which kind of future-proofs your data for when the application that created it no longer exists. It makes the data self-describing so software can come along later, read that data in, and start doing useful things with it.

I'd put it simply: the knowledge graph is the ontology with the data.

Seth Earley: I think of the ontology as the knowledge scaffolding - the organizing principle within which you place your knowledge.

Sean Martin: Exactly. And what's also cool is how you can stack ontologies together and connect them. Unlike schemas, which tend to be really fixed and static and tied to a particular database instance, ontologies are free. Parts of them can be reused in other ontologies. You can dip in and say: someone over there has defined the concept of a drug or a person really well - I just want to use that definition here. That lends itself to the notion of the data fabric, where you start describing certain important business concepts - products, customers, whatever - and as you expand your fabric you add additional concepts and create linkages and relationships. That's why graph is so important: it's holding these connections between different things, just like real life.

Another thing we've observed is that people treat ontology like schema and try to create the ideal ontology upfront, and then when they actually try to apply data to it they find either the data doesn't conform or they've missed the point. Data conformance turns out to be really important. We find it's very important to maintain flexibility and essentially model the data you actually have, then use upper ontology as a kind of highway map to the concepts. You get these very low-level representations bubbling up out of the data and then you try to distract them as much as you can and connect them to that upper ontology to make them navigable and help people find data.

Seth Earley: Where do you see things going - what's practical today versus aspirational?

Sean Martin: Where we are today is we can handle really large knowledge graphs. We can build out clusters on our engine that let you handle tens to hundreds of billions of facts. That's now a practical reality - we have that stuff in production at lots of places. What people are doing with it essentially is integrating data through ELT-style queries, some of those transformations going from raw schema-oriented representations to what we call canonical ontology - business user-facing ontologies. Creating data products is absolutely reality. Blending unstructured data, to the extent you have good annotators for it, is also reality. It's a much better way of integrating data than anything we've seen before.

Where it's going is: as the annotators get better we'll get much better with unstructured data, natural language query is something people really want and is coming, and I also think we're going to start seeing ML integrated directly with knowledge graphs. A lot of the ML community is using graph structures - neural networks and so on - and I suspect those worlds are going to converge.

The analytics you'll be able to do on integrated data without pulling it out but actually doing it in place is going to be very interesting. Our graph engine is essentially schema-less - you can ingest dirty data, turn it into a graph on the fly, and manipulate it. Once you get to a fully integrated knowledge graph, you're going to want to infer links automatically, cluster things automatically, predict parts of the graph automatically. But you're also going to want to keep track of what was established by a human versus what was inferred, and which models and training data were used. The graph lends itself to that kind of provenance tracking because unlike in a warehouse world where metadata and data are kept in separate systems, metadata is part of the graph. We often lift what we call technical metadata into the graph - file system folder names, file owners - because that may be relevant to your queries. The provenance of everything is huge.

The other thing that has my real attention right now is democratization. This technology currently is only available to the most forward-thinking, sophisticated environments. It's been very expensive to develop - our customers have self-selected to be very large companies with very interesting use cases willing to be on the leading edge. But I think we've now reached a point, literally in the last year or so, where this technology is ripe to go out in a much more democratic way, available to many more organizations and individuals. Once cost drops down and there's more automation, everyone says data is the new oil - but the big speed bump for machine learning is getting integrated data into training. Once we can solve those issues and make it significantly cheaper, the game changes completely. We're going to see an explosion of much more sophisticated data-based applications.

Seth Earley: Any final words of advice - what's the killer app for someone going back to their organization to make a business case?

Sean Martin: The e-book ends with advice I'm going to repeat here: get started now, and pick something that is relevant to your business, that drives business value, and keep it small. Show off the principles of knowledge graph and knowledge engineering in that small thing. The success of that will allow you to grow it, do adjacent applications, or extend that one. Success breeds success. Don't try to boil the ocean - don't try to build the entire data fabric in one go. Do it, as one of my co-authors Dean Allemang says, a stitch in time. What your killer app is I don't know - that's going to be relevant to the people who pay you and valuable to them. But pick something that you can use as a vehicle for this. That has worked for me, which is why I'm suggesting it.

Chris Featherstone: It all goes back to KISS - do something simple so that we can make something more complex out of simplicity, as opposed to making it super complex at the beginning and never finishing.

Sean Martin: That's exactly right. You never finish if you start that way. I really do think it's been a pleasure. Thank you so much, Seth and Chris.

Seth Earley: Thank you Sean. This has been great. Looking forward to talking again soon. Thanks everyone.

Chris Featherstone: Keep your test scores down and your spirits up, Sean. Safe travels.

Sean Martin: You too. Cheers.

Earley AI Podcast - Episode 9: Practical and Scalable Knowledge Graphs with Sean Martin

From Olympic Scoring Systems to Enterprise Data Fabrics - Why Knowledge Graphs Are Finally Ready for the Mainstream

Meet the Author

Let's Connect