Earley AI Podcast – Episode 60: Optimizing Search and Security with John Lenker

Evolving Enterprise Search with AI, Data Governance, and Granular Security

 

Guest: John Lenker, Enterprise Search and Data Governance Expert at BigID

Hosts: Seth Earley, CEO at Earley Information Science

             Chris Featherstone, Sr. Director of AI/Data Product/Program Management at Salesforce

Published on: October 7, 2024

 

 

In this episode, hosts Seth Earley and Chris Featherstone are joined by John Lenker, an accomplished expert in enterprise search and data governance. John, who has presented to technology leaders like the CTO of NASA and the CIO of the Marshall Space Center, shares his extensive knowledge and experiences from working with leading companies. Currently engaged with Big ID, John delves deep into the evolving landscape of search technologies, data security, and enterprise-specific solutions.

 

Key Takeaways:

  • Enterprise search is evolving from keyword-based systems to question-based interactions that provide synthesized answers rather than just document lists, fundamentally changing user expectations.

  • Granular security is critical for enterprise search—systems must respect document-level, section-level, and even sentence-level access controls to prevent unauthorized information exposure.
  • Organizations can trust search vendors to enhance data findability while retaining existing systems, but must balance common needs with unique requirements shaped by their specific environments.
  • Legacy system integration remains a major challenge as organizations modernize—accessing and securing data from outdated systems requires careful architectural planning and migration strategies.
  • Retrieval Augmented Generation (RAG) represents the future of search by combining large language models with enterprise knowledge bases to deliver contextually relevant, interactive search experiences.
  • Cloud migration complicates security models as data moves from on-premise to distributed cloud environments, requiring new approaches to access control and data governance.
  • Effective information architecture—including metadata, taxonomies, and content organization—is essential for AI-powered search to deliver accurate, secure, and relevant results at enterprise scale.

Insightful Quotes:

"Effective information architecture is the backbone of superior search experiences. It’s not just about finding information—it’s about finding the right information securely and efficiently." – John Lenker

"The challenge isn't just making search work with AI. The challenge is making sure that when someone searches, they only see what they're authorized to see—down to the paragraph or even the sentence level." - John Lenker

"We're moving from 'here are 10,000 documents that match your keywords' to 'here's the answer to your question, synthesized from authorized sources.' That's a fundamental shift in how people interact with enterprise information." - John Lenker

Tune in to discover how enterprise search is evolving with AI and why security, governance, and information architecture are more critical than ever for organizations implementing generative AI solutions.


Links:
LinkedIn: https://www.linkedin.com/in/jlenker/

Website: https://bigid.com

X: https://x.com/LenkerITPro


Ways to Tune In:
Earley AI Podcast: https://www.earley.com/earley-ai-podcast-home
Apple Podcast: https://podcasts.apple.com/podcast/id1586654770
Spotify: https://open.spotify.com/show/5nkcZvVYjHHj6wtBABqLbE?si=73cd5d5fc89f4781
iHeart Radio: https://www.iheart.com/podcast/269-earley-ai-podcast-87108370/
Stitcher: https://www.stitcher.com/show/earley-ai-podcast
Amazon Music: https://music.amazon.com/podcasts/18524b67-09cf-433f-82db-07b6213ad3ba/earley-ai-podcast
Buzzsprout: https://earleyai.buzzsprout.com/ 

 

 

Podcast Transcript: Enterprise Search, Security, and the Future of AI-Powered Information Discovery

Transcript introduction

This transcript captures a conversation between Seth Earley, Chris Featherstone, and John Lenker about the evolution of enterprise search with AI, exploring granular security challenges, legacy system integration, retrieval augmented generation, and why information architecture remains essential for effective AI-powered search in organizations.

Transcript

Seth Earley: Welcome to the Early AI Podcast. My name is Seth Early. And I'm Chris Featherstone. And today we're really excited to introduce our guest who has been around the search industry for many, many years. He's really understood the core tenets of search. He's done lots of search implementations, and he's at the forefront of two things that I think are very, very important. Large language models in search and search security. Enterprise search security, those are two critical, critical areas. Because at the end of the day, you know, we're using large language models in generative AI to access information, the source of truth in the enterprise. And of course, without security considerations, that doesn't work very well. Right. And we've learned that in implementations, and we've also heard apocryphal stories about that. So our guest today is known for his extensive experience in enterprise search and enterprise product development. Regarding search, he's presented to some very key industry figures, such as the CTO of NASA and the CIO of the Marshall Space Center. Currently, he's very involved in the space of internal search adoption and data governance at Big id. John Lenker, welcome to the show. Thank you, guys. Thank

John Lenker: you, Seth. It's great to have the

Chris Featherstone: applauding, you know, like right in the. In between. We need that applaud.

Seth Earley: That added in as the effects. So next to the laugh track button,

John Lenker: too. You need the laugh track. Exactly, exactly. Exactly. At

Seth Earley: all of our corny jokes. Anyway, so, Oliver, search jokes. We're looking for search jokes from you today. So if you have any search humor, you can certainly insert that. So we'd love to start the show by kind of looking at where people have misconceptions about AI, specifically about search, about security, about all of those things. And what I really wanted to do is have you kind of talk about your experience in enterprise search and data security. What are the things that are most significant in terms of misconceptions that you've encountered in search, in security, in artificial intelligence and information architecture? I know we're both big fans of ia, and so, John, why don't you give us your thoughts about some of those misconceptions that you run into? Sure. I think there are two main

John Lenker: categories that I've seen. One is on the sales front, if you're trying to sell search solutions and you're a company evaluating a search solution. The other is on the opposite of the sales front. It's more of the implementation and use front. From the sales front, a lot of misconceptions I see is that, for example, I work in a tech company and this tech company has specific skills and they make a product that's good at one particular thing that happens to not be search. But because there is technical expertise and there's ingenuity and their desire inside the company, the overall mentality is, well, it's better for us to just build this solution. So from the search vendor standpoint or the search platform company standpoint, I'm always thinking about the big picture, which is what I've tried to communicate internally at my current company, which is if we're selling to someone in the data governance and privacy space and they're a tech company and they believe that we have in house expertise that can do this, well, how hard can it be? Is the attitude we can do all this other cool tech stuff, why can't we just build data governance and privacy platform? Well, is that really your expertise? And do you want to be maintaining that? Do you want to have to keep up with all the vendors in the space that you would have to deal with, or would you rather purchase a solution and have that a lot of that work offloaded from you onto what you're good at doing? And so I see that as a huge misconception in the search industry. So if you're a company evaluating search, especially if you have technical expertise, do you have it in search, which is not as simple as typing something into a bar, hitting enter, getting 10 blue links from Google. It's not, as you know, it's way more complicated than that now. It's way more capable and thus you shouldn't treat it as simply as what I'm seeing in terms of those misconceptions on the sales category.

Seth Earley: Yeah. And the question really is when you start looking at those areas, are those core competencies, is building a search platform or a security platform, a data governance platform, is that your core competence? And if it is not, exactly outsource it to somebody who does have that as a core competence. But even if you take the search platform and capability and set that aside, there's still a lot of issues around do it yourself, buy versus build configuration versus is, you know, pre configured. And then all sorts of issues around what's going on in this, in this gen AI space, that that leads to some common problems in this conception. So that's on the sales side where people kind of run into that a lot. And, and then the answer to that is really, do you really understand the depth and breadth of this problem in this problem space? And do you really want to get into all the endless iterations of building that. So that's a really great point. If you wanted to say more about that, feel free. But did you want to touch on the, the other implementation piece? Yes. And this is, I think one

John Lenker: case that exists in between both of those categories. It could be labeled as either both sales and use. And that is our company is unique. We have unique needs, we have a higher standard to adhere to because we, for example, we secure data. We're in the business of privacy or in the business of security, whether that be data security, whether it be you're a single sign on provider, like an identity provider, whatever it may be. If you're in the tech space, I've seen this a lot, you think you're unique. And thus when you're looking at search solutions, you say, well, we can't do what everybody else does because we have special needs. And the message that I keep trying to send to the prospects and anybody else that will listen frankly is you're not special from the standpoint of search. If it's good enough for NASA, if, if it's secure enough for NASA, it's secure enough for you. Tech company that has $100 million in revenue, that's good enough for you. And so from the, from the aspect of sales that this meets the category of sales because you're trying to convince the person that you don't have unique needs. You actually need to consider yourself just like everybody else. You have data that you want secure, you want things to be findable, yet not exposed to danger. And you want people to be satisfied and efficient doing what they're, they're paid to do, which is find information. And then on the use and implementation side, you're not unique. As in it is possible for you to trust a search vendor with your information because they're not going to try to replace your system of record. So let's say they, they want you to, they want to crawl your confluence in your salesforce information, your Google Drive, whatever it may be, they're not going to replace that as a system of record. So again, you're not unique in the sense that everything has to be hoarded and clutched into your hands and your, in your grasp. You are just like any other company that needs their information findable and there are people that are going to do that for you. But you're not special from the standpoint of search. And that's something that I have to communicate often to just keep people realistic about this. Yeah, there's a Common

Seth Earley: set of capabilities that people require, that organizations require, and of course, the way they execute that. And this is where you start thinking about, you know, efficiency versus competitive advantage. You need efficiency. But the way you are implementing this specifically in your environment is what's going to be different about your information profiles, your information needs, your use cases, differentiating competitive IP and all those things. Right. So those, so there's a certain set of common functionality that yes, you're, you're, it completely applies across industries. And then there's the unique differentiation where you are an individual little snowflake that is based on how you use your information, what you name things, how you need to go to market, how you differentiate how you deal with your customers, how you deal with employees, and all, all of those things and all that becomes kind of an interesting set of requirements when you start looking at the execution piece. And that's where I think a lot of the tech companies that try to implement these things internally kind of miss a little bit of the boat because they do have the tech chops to do technical implementations. But it's really on the business side when you start to get into some of those unique intricacies and nuances. Yeah, they,

John Lenker: that is very true. You almost get into the special needs though of

Chris Featherstone: this when it's legacy systems that haven't been updated and they don't want to touch them. Set to your point in terms of, of getting some of that access to that information. But you get, you get this scenario where what, you know, what they believe things may, they may, you know, what makes them unique is their approach to the market. Whatever. What made them unique is not what makes them unique. Now most of that is because they had some, you know, we'll say, you know, systems that were contemporary at the time and they haven't touched them or updated them. And now that becomes the special need is trying to get access to that information, which is hard. Right. So now you get into all the app modernization stuff that goes on as well and they don't believe maybe that that's the sticky point and, or the linchpin and all of it. You ever run into those type of scenarios as well? Because that's what I find is that's what they believe make them, makes them used to make them unique and now they're not unique anymore because of. To your point,

John Lenker: I've seen that when it comes to migration from a traditionally on prem data source, and you probably know this extremely well, doing what you do, going to some cloud data source, maybe there's A corporate initiative to move everything to the cloud. You hear people say cloud first and that's wonderful. But what does that actually mean when it comes to findability of information? There's a whole separate conversation we could have about security and governance and removing the rot, the redundant, old and trivial information. Do you really want to keep that? When you put everything in something like aws, there's that part of it. But then how do you find it? The stuff that you want to migrate, how do you find it? So the point you're making about the modernization piece, I see it a lot with where does this stuff live? And when I go to find it, how do I connect to where it lives now versus what I did before? And the benefit and the advantage that people see is it'll be better if it's in the cloud because it's quote unquote modern and so it should be easier to connect to. But now you've introduced another layer of security that is required that wasn't there before when you had it in your data center on your corporate property. So there's another side to it. I still think it's a great way to go, but I think you're making a good point about modernization, not just of the application, but where does this stuff live, the data that you want to find? Yeah, all of that infrastructure.

Seth Earley: Did you have a follow on comment or question, Chris? No, I think I, I

Chris Featherstone: was just gonna, at, at some point want to double click into like you said, that other discussion about security, because it does raise a bunch of other security risks. Right. I think that's part of it is, yeah, it's helping folks understand like, like. To your point,

John Lenker: being on prem is. Not a bad thing. Being in the cloud is, is not

Chris Featherstone: the end all, be all either. Right. So we have a good fine balance depending on the use case and whatever the customers focus on. At the same time, the searchability of content is super, super important. But not everything needs to be searchable to everybody. So locking that down, putting the right mechanisms in place, the governance and guardrails, all that kind of stuff. Talk a little bit about that in terms of what you've seen, the evolution of where it is and maybe even where you would like to see it go, like, what does that look like to you in the work you do? I think this

John Lenker: is a good introduction to this whole concept of RAG and what RAG is doing for findability of information. And anyone listening that doesn't know this term, it simply stands for retrieval, augmented generation. And I think when you look to the future. All the analysts say this. I know, Seth, you and I have talked about this before. All the analysts that research cognitive search engines and analyze all of these sorts of operations say the interaction that people expect to have with things like chat GPT is the future of all technical interactions. Yes. And so it used to be you had 10 blue links coming from Google and you type something in, you found the 10 blue links. That was groundbreaking. I remember that was so cool. Even talking about things like Alta Vista and Yahoo back in the day. And then we moved to Google and Google had the 10 blue links and then there was this giant push into enterprise search and you had companies like Fast who now made things so much easier in terms of if you're a company, let's say like Autotrader and you want all of these vehicles across all these different dealers to be findable. Things like FAST made that so easy because of the technology at the time, allowed you to customize your customers needs. Meaning how am I going to search for stuff? How am I going to filter all that stuff? It kind of packaged it and then we moved into more of a corporate internal sort of search where you had SharePoint really go boom in terms of success and all. Everybody wanted everything in SharePoint. Okay, we've got SharePoint as our, our company intranet. We've got to get everything findable in there, put it all in there. Entire company has existed for the purpose of connecting data systems or sources to SharePoint and crawling it into Microsoft search. And now we're at this new forefront of it's no longer good enough. What did all those things share, the things I just mentioned, what did they all share From a user interaction standpoint? They shared the I must type some keywords into a box. And we were limited to statistical or keyword based search. And now this new forefront is I'm no longer going to type keywords, I'm going to ask questions and I'm going to phrase them as questions. Even like with my voice, if I were to interact with Siri or Alexa or something, I'm going to start those sorts of interactions that way. And instead of getting links and documents and results, I'm going to get synthesized answers that include references to where the generative technology found the material. That's the whole essence of RAG is can I find the material and then can I package it up nicely for the user to be more efficient with what they find, the results and that I think Chris brings up a great point about. You brought up a great point about the security that's inherently risky in that sort of endeavor. Just like it was when the whole corporate search thing went boom. SharePoint. The sudden fear was oh no, the HR documents are exposed and the engineers don't have permission of the HR documents. What if they find them? Got to make sure that that's secured down to the document level. All that is still true. It's just the manner of interaction to finding it is now different and it's more conversational. So the element is still there. We just have to go to a few more lengths to make sure the security is still intact.

Seth Earley: Such a great point. I think we talked a little bit about the story of a pharmaceutical company we were working with where their portfolio review application was an isolated application that had such a strict security rules that even certain executives on the executive leadership Council were not allowed to see certain documents and certain information because it had to do. It was such market moving information companies, they're looking at acquiring phase 3 clinical trial results, new drug development, new molecular entities and so on. And then when they, when. And we'd laid overlaid a large language model on top of that application to have conversations with it. Then when Microsoft Copilot was deployed, it was deployed across the enterprise and everything was visible to it. And suddenly they started freaking out because you know, everybody could see what was in this portfolio review application. And it became you know, a crisis that I had to deal with over the weekend by, by driving to my developers home to find him. Luckily he lived in the same town and you know, we were able to solve it but it became a very, very urgent, pressing issue. We just, just had to lock it all down. But that is something people are not thinking about. And you know you had, you had told a story about a vendor you were talking to and handled document level security and the answer that you received was quite surprising. You don't have to name names. Was, was that the search vendor or was that. Yeah, I think it's the search vendor or maybe it was a rag vendor. But when they said no, we don't have that it was like what, you. Know, it's like yeah, I won't, I won't name names. But this is a,

John Lenker: this is part of a group of companies and I, I don't say this with any arrogance or any judgment. I, I understand companies go through evolutions and you start somewhere and you build. But this particular vendor wanted to be a player on the enterprise level for search solutions. And my company was interviewing them in a sales context for, for a, as a potential vendor. And so I was called on with my expertise in this, in this area. And I just started asking questions. The first question I asked was, do you offer document level security? First of all, they needed clarification on what that meant, which was alarming to me. And then when I clarified it, the answer was we don't offer that. The best we can do is put people in a list and say these people belong to team X and team X has access to source A, B and C, but not D. And I thought, okay, well what about the documents within source A? If team X should not view some of those documents, you're telling me it's just all or not? Yeah. Yes. And I, and I, and they said, well, currently that's all that we can do is, is what I just described. And I went. So I thanked them. And then internally when we discuss it, I said this is, this has to stop here. This discussion was non starter. Yeah. Because now your, your goal is to get a generative technology in front of this. And Seth, you brought up the copilot example, which the things that copilot has exposed thus far unintentionally are frightening. But this is an example of you put a generative technology in front of something like this without any kind of document level restriction. You are asking for so much trouble. And I'm not talking about just the copilot incident where the CEO's salary is revealed to the whole company. Anybody that asks, I mean much more serious things like for example, intellectual property. Right. Things that, things that could, to your point about the pharmaceutical, if that information even in the, in the right person's hands were to somehow escape that person's hands. You're talking about a completely monumental change in the stock price, perhaps because you're talking about a clinic. So there are examples where this is cataclysmic in nature and other examples where it's fairly innocent. But the same thing to me is always true if you can't do that as a search product, meaning if you can't secure an item at the document level, then you should not be in it in the enterprise business. Yeah.

Seth Earley: And so, and that, that is regardless of the mechanism of access. Right. Whether what, whatever that access mechanism. Right now we're talking about generative search and generative AI and generative interactions, which are more true conversations, natural language conversations, which is what everything is evolving to. But it really kind of doesn't matter what that accent, what that mechanism is. That that's a fundamental piece that if you don't start with that, you're you're going to have problems. And, and so that. So, so there's also the other fear about generative AI and that is releasing information to the public or to a model. Do you want to talk a little bit about that and what some of the safeguards can be around ensuring that your IP is secure even with a generative AI, even if you have document level security. What about ip? The right people are getting access to this information, but what about exposing that IP to a model, a training of a model? Do you want to talk a little bit about that? Yes. And one follow on point to the

John Lenker: previous discussion before we get into that. We talked about how things change and evolve for search and we went through the sort of the two or three decades worth of that. The one thing, the huge leap I see in the security standpoint is with, sorry, when you're talking about generative AI, the big leap I see in security meaning now it's going to be tougher. The leap to, in terms of difficulty is before it was fine for you to secure documents at the document level to a user or group and you had grant lists and deny lists and these people can explicitly see this, these people can explicitly not see this, and so on and so forth. That was all fine and adequate. But now when you talk about generative AI, you introduce the possibility that if it doesn't have sufficient information, it will just make it up. So not only are we are dealing with the risk of exposing the wrong information or sorry, exposing factual information to the wrong person, now we're dealing with the risk of exposing something potentially hallucinated to the wrong person. So that's an extra layer of difficulty. I see. But, or even to the right person. If it's the wrong information. Right. With the stock price pharmaceutical example, but then with the, the question you just asked. So in 2023, I built rag applications for a number of prospects and customers and almost always probably 90 plus percent of the time the first question, 100% of the time it was a question at some point, but 90% of the time the first question was I don't want to call chat GPT and give it even information you retrieve through your search engine. Because who's to say OpenAI is not going to just use that to augment their model? And so there were a couple ways that we had to mitigate this risk. The first was Microsoft, since they were controlling the OpenAI calls that we were making would allow you to enter an NDA such that there would be no augmentation of the public model for chat GPT. So this has now evolved into chat GPT enterprise. So now this is OpenAI is kind of taking care of this by themselves. But prior to Chat GPT Enterprise being released, there was this NDA that you could sign with Microsoft as a proxy for OpenAI saying, none of the information you send me in an API payload will be used to augment our public model. And there's all these legal consequences if in fact that occurs. That was one way. The other way was to employ governance and security tools ahead of or in front of search that would redact certain information. So let's say you were dealing with PII and you were dealing with information in a corpus like Social Security numbers or email addresses, addresses, medical records. You could still use that information retrieved to the right person. Let's say the user had the ability to see it from a security standpoint, but you didn't necessarily need that Social Security number exposed. You just needed the fact that this person's personal information exists. We just don't need to know exactly what it is. We had ways of being able to redact information to the point where the generative AI could still function and say, yes, so, and so is a customer. They are a customer for this period of time. Their personal information exists, but it's not visible to me, quote unquote, the chatgpt saying me, because it's been redacted, things of that nature. So that was another way to mitigate the risk. That's about as far as I know we can go right now with retrieval. Augmented generation ChatGPT enterprise has helped a lot with the whole the first point, which was giving information to OpenAI in general. So now there's a nice dividing line between you just going to chatgpt.com and messing around. That's totally public. But now there's another layer of ChatGPT.

Seth Earley: Enterprise and there is, on the individual level, there are some. It's a little bit buried, it's a little bit hard to find, but you can opt out at the individual level, I recently learned. So you can go in there and you can execute an opt out request and then you will get a written confirmation saying that your data that you're putting into your generative AI requests will not be used to train the public model. Most people don't know about that and it's pretty hidden, but I think those are really good points. And then of course, the other way of thinking about it is using a private model behind your firewall where you have more control. Do you want to talk about that a little bit, yeah. So that was the third

John Lenker: sort of backup way for our prospects when dealing with RAG applications. And the reason I say it was a backup method is usually it was a much more expensive method because you would have to procure a custom, so to speak, model for your purpose. But that would act like ChatGPT. And since 2023, I know it sounds like that's a long time in the past, but since then, you know how fast this stuff moves. We've seen the advent of companies like Harvey, who produces models for legal practices, law firms. We have a company out there called Cohere that is in the business of creating a private model that acts just like the best LLM on the market in terms of function, but is specifically within, like you said, the firewall of your company. And so that was, that was a solution I would point towards as the third option only because they would ask for a lot of money for that. At least they did in 2023. So those, those were solutions for the die hard. Absolutely not. I'll never engage Chat GPT or, or even something like Claude. I'll never use it for my corporate information, even if it is controlled through rag. And it's. We're so confident in the security. No, some people just drew a hard line and that's where you'd have to say, well, if you still want to do this, because remember that the whole conflict ultimately is I want this cool functionality and I want all the positives this thing brings, but I don't want to deal with any of the negatives. So everybody wants their cake and eat it too, but they realize you can't have the cake unless you're willing to give up a little bit of the ingredients so that we can actually work for you. You have to do something or else you're never going to accomplish anything. So then meeting that person that had that attitude in the middle was, could you look at something like Cohere or Harvey? I tried to do that with

Chris Featherstone: my kids. I want only the good. None of the bad. Never works. It's like these. Is there an offline model? Is there a model

Seth Earley: for that? So, John, you're talking about, you

Chris Featherstone: know, really, it's, it's. I love the fact that you're outlining really a maturity model here. Right, because we're going from Rag and then we talk about, you know, just off the shelf. I mean, McKenzie talked about it as their, their taker, maker, shaker, you know, perspective, right? Where you're, where you're going to, you Know, just take a model off the shelf and just use it. Right. Or you're going to take and fine tune a model, right. That, that maps directly into the data needs and stuff you have, or you're going to build your own, right? So what, what type of, do you ever get into like maturity models and, or talk about it at that level with the organizations? Because everybody believes they're a builder. Everybody believes that, that like to your point that they've, they're the most innovative and they're really not. And so now it's just a matter of, you know, hey, you want time to value, you have multiple use cases. Some may use this model that's just off the shelf. That's great. But you may want to fine tune this model or to get away from prompt injecting and something, some of those other, you know, techniques that people try to hack stuff with. Right. How do you, how do you articulate that to your customers? Or what. Or is that something that's interesting that you find, that you talk about? I love that question because it is

John Lenker: interesting when the company that you're dealing with has this propensity for data science. So you're dealing with a company like Nvidia, let's say not. They're not an actual customer of mine. I'm just thinking about who's really well known that has a tremendously huge and capable data science team. If you approached Nvidia and said, I know you have a problem with findability, search is obviously not your core competency to cesspoint, but we could still use you. We can use the fact that you have in house expertise in data science. So my recommendation is to have the best of both worlds the cake and eat it. You build a model or augment a model, like take a version of Claude and augment it or LLAMA or something. Take that and augment it to your purposes. And what are your purposes? Often it's company specific acronyms, that sort of language. Because in traditional search you can do that with synonyms and phrasing and all that. But the generative models are so good at that sort of thing already if you just give them the right foundational training sets. So yeah, augment a model that exists so that you could leverage the horsepower of something like Claude, but do it in such a way that is domain specific. So yes, if you have a company that you're dealing with that is really into data science, they don't have to be tinkerers. Like the earlier example I was giving where you're a Tech company, you like to build stuff. We think we can do this. No, I mean specifically this love or pursuit of data science and in house data science team, then yes, that would be a great candidate for something like that, I think. And when you think about there's this new

Seth Earley: kind of trend around smaller language models and I've even heard small action models, agentic approaches. You know, there's a lot of new acronyms and new terminology, but essentially I look at a small language model as something that's fine tuned for particular purposes, a little lighter weight, but has the capability to process very specific information, usually tuned with a corporate ontology. As you say, your own acronyms, your terminology, your ip, your procedures, your product names, your troubleshooting, your engineering drawings, all of those types of things, all of the organizing principles behind them. Do you want to talk about this kind of trend to look at smaller language models and maybe they're a little less compute intensive, maybe a little less costly. How are you seeing that in the, in the search space and talk a little bit about what's of the vendors are doing, if you could, the coveos, the sinequas, those types of organizations. And because they do usually come packaged with a language model, a language model for doing things like entity extraction or auto classification. Sometimes those are industry specific, sometimes those are broader. There are industry specific language models like CYTE for life sciences. I still find those to be too big and too cumbersome and really containing a lot of irrelevant terminology. But what are your thoughts around some of those trends? It's a much more complex and varied landscape than people would imagine. I agree. I think

John Lenker: there are a lot of points in that question. One of them is when I've referenced earlier, something like Harvey. I think Harvey, even though it depends on who you ask, I think opinions vary. But Harvey could be considered a large language model because it acts like one. But in reality it's more like a small language model or even an action model that is geared toward quite a specific purpose. And that is if you ask the same legal sort of question, and by legal question I mean if you're an attorney and you're going through depositions and you need to summarize certain depositions from multiple witnesses or multiple targets of deposition, then a large language model, like a generic one like chatgpt would do that, but not in a way that would be meaningful, as meaningful to an attorney who's looking for a specific flow or summary. Something like Harvey is geared towards that. So you almost. It's almost small in the fact that yes, it works in a particular industry with a specialty, but it's also behaving like the large language model. So there is that. The other thing that you asked about with how are the search vendors using this? Most of the vendors, the biggest players in the space, and by that I mean not just revenue success, but on the Magic Quadrants and the Forester Wave Gartner quadrant, the people that are in the leader space are offering plug and play capabilities. So yes, they'll ship with usually an inference model that will process user queries, but they will always give you the option of plugging in a model for the response or the synthesis part of the other side. So the G in the RAG acronym, they're going to offer you the ability to use your own model and they've often set up ahead of time connections to multiple different options like Llama, Chat, GPT, a number of different sort of mainstream options. But then for example, if you have a company like Nvidia that says we like to build our own models for all these other purposes, we're going to build an LLM quote unquote for retrieval or for sorry for a search process like rag, then you could, you could put that into the equation. And that's where the, the best search companies in terms of flexibility are offering that plug and play, right? Choose your own language model and, and also

Seth Earley: incorporate your terminology. Now this is where it gets a little bit complex or convoluted because one of the things we talked about in the pre, in the prep call was at what point are you doing security trimming? Because when you think about, when you think about doing the retrieval itself, you know, concept is you're building a vector index and then you're quiring that vector index. And the question becomes, are you importing content, enriching those embeddings with your content along with security details? Is your vector similarity search working against that sort of, I don't want to say security trend, but security enhanced vector space? Or are you doing a different type of a query against this more standard index and then doing a vector similarity like where, where does the vector space come in? Where does the true RAG piece of this, you know, the, the whole idea of where are you querying and where are you trimming? Right, because that starts to get into some pretty technical nuance.

John Lenker: Yeah, great question. I think the main concern of when you did security trimming was around performance, at least historically. So the main reason you wanted an early binding sort of security method was, well, it's going to be so costly Especially to the user experience. If I wait to get the results set and then I dynamically go call these sources for their permissions and then start dropping stuff. If you, if you did it on that end, you, it would result in a poor experience. Now, with hardware being so cheap and hardware being so much better, it's not as much of a concern for performance reasons. But I think you're bringing up a great point, which is if I'm going to semantically evaluate somebody's query and try to retrieve the most semantically relevant information to that query, I have to obviously vectorize that. And it depends on who you ask. So some companies like Cynequa, they do it in a proprietary way where they blend a semantic and a keyword search into one. And with their model they look at the best relevancy scores from both sets of retrievals. But under the covers, it's unified in terms of security identity. So not only is your identity singular in nature, as in you have a Particular login for SharePoint, you have a different login for some other system. And by different, I mean, let's say with one system you're known by your email address. Another system you're known by first letter, last name, whatever it may be, companies like Cynequa that operate in this more mature model, they, they understand you as a user in one form. So they, they look at you ahead of time based on all your different identities across these systems, they unified into one and therefore it's much easier to then filter out or trim items in a result set. Now, when they're semantically evaluating your query and getting information both from a keyword and a semantic standpoint or a vectorized standpoint, they already know where the keyword information is coming from and the semantic, or excuse me, the vector information, because they know the origin document, the document, let's say the document came from SharePoint. SharePoint is almost 100% of the time going to have Active Directory permissions imposed on it. Sometimes SharePoint has its own permissions, but most people use Active Directory and therefore we have who can see and can't see this item in SharePoint. Companies in the, in the, the, at the enterprise level for search are going to honor that permission. And when they build vectors of information out of that document that came for from SharePoint, they still know where the document came from, because that's fundamental to the RAG process. You must be able to cite where you got the information to give the user confidence that they're not getting a hallucination back. So, and because we have the reference, we know the permissions on the document, which make the trimming process pretty straightforward when you're using it in this rag context versus the traditional context. That's how I don't see a huge departure there, only because fundamentally it's the same. We still need to know who can see the origin content and it came from a document, wherever that lived. And as long as that's secured, we can impose it just like we could on the, the document itself. We can impose it on the vectors also. Now you're

Chris Featherstone: just exposing the chain of thought which goes into the explainability of which a lot of organizations don't think about it from. Because what you're outlining to me is, hey, listen, this is not anything new. And you're not, there's, there's not anything that's so revolutionary here that what all we're doing is we're saying this is the data and the data had this permissions on top of it. Now we're just applying a natural language understanding on top of that. And it still takes into consideration the inheritance of what you already had in place. So reuse what you have in place. Don't throw it all out. Just remember that you had this in place and it mapped to your policies and procedures anyway. And now let's, let's honor that going forward. And then let's put in, now, I will add, let's put in some specific things that may, you know, enable, let's say, more contemporary technologies to think about that. But you're still in, you're still utilizing your old inheritance models of what you can and can't do with identity and access management. That's really what it boils down to. So,

John Lenker: yeah, you're. Yes, you're touching on the most important reason, I think, to consider buying these solutions versus building them, because I hear all this talk in the market about how I'm gonna, I'm gonna create a rag solution piecemeal. I'm gonna do it myself. Well, how am I gonna do that? Well, first I'm gonna find something that'll connect to my data sources and extract information or crawl it. Then I'm gonna go build or buy a vectorizer that'll break it all up into vectors and chunk it. Then I'm going to go buy Pinecone or MongoDB Atlas and I'm going to put all this stuff in one of those. So I'm going to maintain that separately. But the thing that people always forget in this process, at least the people I've talked to is what about the security of that? So you pulled all this stuff out because you wanted to get it out of there? Of course. Did you get it out with the security on it? Are you planning to do that at runtime? Because if you do it at query time, yes. We talked about how performance isn't as big a factor as it used to be, but it's still not the best way to do it. It's not efficient and you could expose potentially a miss in the query time process that would have not been missed had it been done at crawl time or at the time of vectorization. So that is, to me, one of the main reasons you want to get a solution that already does this. Well, and it's their core competency. Because now if you do it piecemeal, you're gonna, if something goes wrong, someone's gonna get blamed in your company because you built it. And not to mention, you're gonna have to maintain it. What if Mongo changes their APIs? And so now there's a. You have to maintain all of your interactions with all of these different pieces, whereas these mature vendors, they have everything packaged for you. Even if you can plug and play a model to do whatever you need to do, it's. It's built for you, right? And it just takes so much of the work and worry off of an organization. This is not your competency. Don't. My advice is not to do it. Right. Don't do this home. So, so

Seth Earley: we were talking now about how does the information architecture fit in? Because, you know, we know that search is all about content quality and ia, right? We know that the more, the better your metadata is, the better your search results are. When we, when people say make it like Google, I always say, you know, put as much time, energy and resources into optimizing your content as people do who want to get good search rankings and it will be like Google. So when you think about applying information architecture, metadata, taxonomies, ontologies, all of that stuff, you know, what we're trying to do is trying to define the isness and aboutness of a piece of content so that it's more retrievable in the right context for the right individual. So how does this fit in? When you can either do that by importing your IA with the chunks so you can semantically enrich that. Those embeddings, right? You can enrich those embeddings so that in the vector space it contains additional signals on that content, on that chunk content, which is what we did with that Life sciences firm. Or you can say well, I'm going to maintain a separate index that's going to be more traditional index that I can do faceted retrieval on and, and so on. And then I'm going to try to integrate these or have higher, you know, combine the relevancy scores or something along those lines. Tell me your how to do that. What are the, what are the different implications around different search engines? And then how do you make the best use of an information architecture, you know, corporate ontology language model that is specific to the enterprise and to their processes and ip? Is that a fair question?

John Lenker: Yes, I just had to gather my thoughts on it for a second there. I think there are two main ways to do this. One is to do it ahead of time, similar to the way somebody might want to do ocr. If anyone's listening, that's not familiar. Optical character recognition. Let's say you're an oil and gas company, you have land leases in Texas that are decades old, maybe even over a century old. It's all going to be either handwritten or typed. You've scanned it as a PDF. It's not, it's not readable by a search engine and among other things. So OCR is one example of this. But what you just mentioned, meaning classifying the content, naming entities within the content. There are things out there that will do this for you just in place. There are APIs you can call for Google and Microsoft and Azure that will go and do these sorts of one off type of operations and they'll label content. They'll do all these, Microsoft as an action where you can go back and through MIP, label content in SharePoint in a reverse direction. So there are ways to do this when it sits in place and then the other way to do it, the second way is to have a search product or a search technology do it for you when it crawls the content and it examines it and the best products on the market. Yes, as, as it puts the content into the index, it will do these things in the process. Now I've always, when I've consulted customers on the best practices here. Yes, there are lots of variables, but you take the example of ocr. There are tools out there for OCR that are better than what search engines will do. And a lot of people use the same libraries from the same open source vendors to do this stuff. But again they're just like any other of these services. OCR is a service you can pay for on a document basis through something like Google. Same thing for Naming entities or for classification. You can pay for these services or you can have a search product do it. The best search products on the market will work with something that's been classified, bin entity recognized, so on and so forth in place and they will also offer you a solution to do it while you index the content. So earlier you mentioned side byte for a scientific specific purpose. Plenty of open source things like spacey for, for naming entities. I don't love spacy for. I think there's a lot of noise and a lot of these open source things but all that stuff exists. The best products out there are the ones that will allow you to choose your option that's best for you and if you want to do it outside the search product then they're happy to work with that. I think you also made a point about do we keep a separate index that's full of of keyword or statistical based searching capability as well as our vectors? There are companies out there and one I'm very familiar with that does just that in a proprietary way and they blend those two things upon query into a result set to get the best of both worlds. I don't see a huge advantage in one over the other which way you choose to do it. I think there are cases to be made for both but I think it's good to talk about what is involved in doing both and what to look for. And I think what you should look for is a product, like I said, that will allow you to do it on your own. They won't force you to. Okay, well if you want named entities to show up in your search results and you must use our named entity recognition product or mechanism. I don't like that limitation because not everybody has the need for your particular idea of what named entity recognition should look like. So I like the products that will allow you to both use them for more of our hands off turnkey solution. But also back to the early example about data scientists that will allow you to use that too should you have the desire and the expertise. Right, right. Great. So

Seth Earley: wanted to switch gears a little bit and learn a little bit more about about you. How did you get into your this space? Tell us a little bit about John the person and you know, what do you do outside of work? Just give us a little, little bit of the history. Oh, thank you. I,

John Lenker: I don't realize. I didn't even realize we were coming up on the hours. I know it's fascinating. The topic is, and these have been fantastic questions guys, I appreciate it. How did I get into this space. I, in 2008, I was a. A software developer. I have a computer science degree and I was just a programmer. And I got a call from a recruiter in 2008 and he said, I'm interviewing for a company in Atlanta. They want somebody who's willing to learn enterprise search. And I said, what is enterprise search? And. And he said, they specifically want someone that has a development background but is willing to learn something new. And they need someone that understands the development side of what is involved with search. And what I'm referring to specifically is how you tinker with fast, ESP. When it used to exist. Yeah. And so I interviewed with someone who became my very close friend. In fact, in the latter stages of 2023, I exchanged more text messages about employment opportunities as a friend with this person than anybody else. More than my wife, more text messages than anybody else. So this guy that hired me in 2008 as a contractor to work on a fast project for a big company here in Atlanta, he became the person who introduced me to search and taught me about it. And then I sort of followed him everywhere he went after that because I got hooked on search and he made all these opportunities materialize for me.

Seth Earley: Nice. And we were together at Seneca for a while. We're not. We're now

John Lenker: apart again, which has been the case on and off 16 years. But that's how I got into the, into the space as, as far as me as a person. I have, I have a six year old son and he is a baseball player. He's playing soccer. I'm coaching him in baseball. I'm really into to weightlifting and snowboarding, so I'm on a five day a week weightlifting program. I take one or two snowboarding trips a year. I'm totally addicted to that. But I'm also totally handcuffed because I live in Atlanta. So if you want to, if you want to go snowboarding, you need to get on an airplane. If you're listening and you have any ideas on going snowboarding in North Carolina, I wouldn't. No, don't go to North Carolina. I'm in Salt Lake. You're more than welcome

Chris Featherstone: to always come out to the best snow on earth. Oh, Utah. I love

John Lenker: Utah. Yeah, definitely come. Yeah, I'll take you to some spots that'll,

Chris Featherstone: that'll make you probably want to move here. So just be careful.

John Lenker: I would. That's a whole nother conversation. I would, but I can't. Did

Seth Earley: you say you had a pilot's license or you're Working on a pilot's license. Oh,

John Lenker: yes. I. That's why we did talk about that. I forgot about that. Yes, I have a private pilot's license. Along time ago, I actually wanted to be a military aviator. My father was an A10 pilot and I was going to go that same route, but I aged out. I turned into a pumpkin. When you're 27 and you want to fly in the military and you decide at that point you want to do it, it's kind of too late. Yeah. So it's a. That's a really long story. But as a result of that pursuit, I have a private pilot's license.

Seth Earley: Nice. But you. But you. You went past your expiration date in the military. Yeah. Yes. It was

John Lenker: not the path for me. It wasn't my choice, but here I am. Yeah.

Seth Earley: Well, this has been a real pleasure. I really enjoyed it. The time has just flown by and it's really been great to have you. So thank you so much for your time and, and your thoughts and your expertise. It's. It's really been terrific, John. I love it. Thank you. And Chris, also

John Lenker: thank you to. Thank you to you for the invitation and the great questions today. I really enjoyed it, guys. Thank you so much. Absolutely.

Seth Earley: And thank you to our audience for tuning in and listening. This has been another episode of the early AI podcast. And thank you for Carolyn's work behind the scenes, our producer and we will see you all next time. So again, thank you, John. Thank you, Chris. Great to. Great to see you and thanks for your time today.

John Lenker: Cool. Thanks, guys. See you guys. Thank you. Bye. Now visit early.com to find links to the full podcast on all audio platforms. To listen on the go. Thank you for watching.

 

Meet the Author
Earley Information Science Team

We're passionate about managing data, content, and organizational knowledge. For 25 years, we've supported business outcomes by making information findable, usable, and valuable.