Why Ontologies Are the Missing Foundation in Every Enterprise AI Project
Guest: Linda Andersson, Founder and CEO, Artificial Researcher
Hosts: Seth Earley, CEO at Earley Information Science
Chris Featherstone, Sr. Director of AI/Data Product/Program Management at Salesforce
Published on: January 12, 2022
In this episode, Seth Earley and Chris Featherstone speak with Linda Andersson, Founder and CEO of Artificial Researcher, a Vienna-based startup building AI-powered semantic search tools for scientific and patent research. Linda's path to this work is one of the most distinctive in the field: she has dyslexia, which forced her to learn how language works the same way a computer does - through explicit rules, pattern recognition, and tree structures rather than intuitive hearing. She started programming in BASIC at age nine, was unable to read until twelve, and then spent the next three decades building expertise that crosses computational linguistics, library and information science, computer science, mathematics, and journalism. In this conversation she explains why domain ontologies are the non-negotiable foundation for any AI system that needs to understand meaning, why 70 percent of any AI project is data curation before the first algorithm runs, and exactly what organizations should do in the first steps of an AI journey - before touching a model.
Key Takeaways:
- Domain knowledge encoded in ontologies is the key to any successful AI or machine learning system - without it, algorithms learn statistical associations from the data they happen to see, not the meanings organizations actually need them to understand.
- The word disambiguation problem is more serious than most organizations realize: the word "bus" means a vehicle to one person, a computer data pathway to another, and training a model on an imbalanced corpus will make it associate the word exclusively with whichever meaning appeared most often, losing the other entirely.
- A balanced training corpus is essential - training only on a specific domain produces a model that fails to understand general language, but training on general data without domain augmentation produces a model that misses domain-specific meanings; both must be combined deliberately.
- Developing any AI system - chatbot, search, classifier - requires spending approximately 70 percent of the project timeline on curating, cleaning, and structuring data before the algorithm ever runs; skipping this step and sending unstructured data directly to a model produces results that simply do not work.
- The supervised-to-semi-supervised progression is the practical path for most domain AI projects: start with rule-based supervised learning to build a seed of well-labeled examples, use that seed to train a BERT-style model, then gradually transition to semi-supervised learning as the model becomes capable of predicting with enough accuracy to bootstrap further labeling.
- Academia needs to promote genuinely interdisciplinary education for NLP and AI practitioners - being able to read a language is not the same as being able to do computational linguistics on it, and the best practitioners understand both the mathematical algorithms and the linguistic structures of the problem they are solving.
- The right starting point for any enterprise AI project is not the model or the use case - it is an honest assessment of the current condition of the data, followed by the business vision, so that the gap between the two can be realistically scoped before any investment in tooling or talent.
Insightful Quotes:
"My philosophy when it comes to this is that domain knowledge is the key - domain knowledge in an existing ontology, some kind of structured knowledge. If you want to start text mining a new area, you need that structure first. If you just send in a lot of unstructured data with no guidance on what the algorithm should be looking for, you won't get anything that can actually predict what you want." - Linda Andersson
"When I look at a sentence I usually see a tree structure - where the noun phrases are, where the verb phrases are. I see the structure. I know that the noun at the beginning and the noun at the end have a connection, and the verb tells me what that connection is. My dyslexia actually gave me an edge, because I look at text the way a computer looks at text: as a lot of symbols joined together that you need to extract in order to see what information you can get out." - Linda Andersson
"A lot of the applications for AI and machine learning and cognitive semantic search are trying to make up for past sins in information curation. If organizations had done things right in the first place, they wouldn't have some of the problems they're facing now. The knowledge sources themselves need to be curated and structured appropriately for retrieval - and that work has to come before the algorithm, not after." - Seth Earley
Tune in to hear Linda Andersson explain why she describes herself as "as stupid as a computer" - and why that turned out to be one of the most powerful advantages a computational linguist can have, how she spent three years on a Master's thesis on Swedish patent retrieval before getting the algorithm right, why ontology is the knowledge scaffolding that tells an AI system what matters and how terminology relates to other terminology, and what to do when your data is in such poor condition that your first task is not building a model but simulating a smaller clean dataset to prove the concept while curation of the real data proceeds in parallel.
Links
Information about Artificial Researcher
Demo pages for index and the ontologies generated by the Artificial Researcher Data pipeline solution:
Contact Linda:
Thanks to our sponsors:
Earley Information Science
CMSWire
Marketing AI Institute
Podcast Transcript: AI and Semantic Search - Why Ontology Is the Foundation, Not the Afterthought
Transcript introduction
This transcript captures a conversation between Seth Earley, Chris Featherstone, and Linda Andersson about the foundational role of domain ontologies and structured knowledge in making AI systems work. Drawing on her background in computational linguistics, library science, and computer science - and on a unique personal experience with dyslexia that taught her to process language the same way a computer does - Linda explains what organizations consistently get wrong when starting AI projects, and what the right starting sequence actually looks like.
Transcript
Seth Earley: Good morning, good afternoon, good evening, depending upon your timezone - welcome to today's podcast. I'm Seth Earley.
Chris Featherstone: And as always, I'm Chris Featherstone.
Seth Earley: Before we jump in, I just want to thank our sponsors: Earley Information Science, CMSWire, and the Marketing AI Institute. They have some awesome courses you should definitely check out, and a terrific conference as well.
We have a wonderful guest today. She is a true polymath with expertise in library and information science, programming, machine learning, AI, and information architecture - all near and dear to my heart. She has a background in research and academia, holds multiple patents for her work in natural language processing, and has developed innovative AI-powered semantic search tools for scientific and patent research. Her work has been adopted by a range of institutions, and she is at the bleeding edge of cognitive AI. She is the Founder and CEO of Artificial Researcher, a startup based in Vienna, Austria. Please welcome Linda Andersson.
Linda Andersson: Thank you. I'm very happy to be here, and honored to be the first guest on your podcast this year.
Chris Featherstone: Linda, your background is so interesting - it seems like the different degrees and areas of focus have been building blocks, one after another. I'd love to get a sense of why that path, and how it's been helping in the work you've been doing.
Linda Andersson: I've always been interested in information and knowledge. It could have to do with being a late starter with reading. When I finally cracked that encryption of how to read, my idea was that if I could stay in a library my entire life and just read books, it would be lovely. To structure books in different ways to extract knowledge - that is my true interest. There's so much you can learn from extracting knowledge from different books or different mediums, joining them together, and coming up with new solutions that you had not thought of before. That has always been my driver, regardless of whether I was doing legal work at twelve or my first programming, which I did when I was nine - back in 1984, programming in BASIC.
Chris Featherstone: Double-click on that - you mentioned cracking the code on reading, and you've told us in the past that was a real difficulty for you in the beginning.
Linda Andersson: Yes. When we started reading in Sweden, around age seven, everyone could read after just a few months. I couldn't even understand how these different symbols - I didn't know they were letters - could be joined together and come out as a word, and then you could connect all those words into a sentence. For me that was a puzzle. And I really struggled with that for a long time because of my dyslexia.
When my teacher said here is how you should pronounce this, I didn't hear how those sounds connected. I had perfectly fine hearing, but I couldn't hear the connection. So I started trying to recognize how different sequences of letters are connected to a specific sound, and how that sequence is semantically connected to a meaning. I see it in patterns - instead of putting letters together to form a word, I know that if it is this particular sequence of letters, then this is that word. I had to jump directly to that.
And everything I learned about how to join a sentence together grammatically correctly was explicit rules I had to learn - exactly as a computer works. One thing I usually joke about when I talk about my dyslexia is that I am as stupid as a computer, because I need everything explicit.
What I find fascinating is that if you can decode language and make all these things explicit - things that people know are correct because they hear it, because they feel it - for a computer, you cannot rely on that. You need to tell the computer why it is correct. You need to give it the rules. And that is how I work too. I need all the rules, all the exceptions, and then I can produce grammatically correct sentences. This has actually given me an edge, because I look at a text the way a computer looks at a text - as a lot of symbols joined together that you need to extract in order to see what type of information you can get out.
When I see a sentence I usually see a tree structure - where the noun phrases are, where the verb phrases are. I see the structure and I know that the noun at the beginning and the noun at the end have a connection, and the verb tells me what that connection is. So if I give Seth a book, that means I am the person doing something, and the book is what I am giving to Seth.
Chris Featherstone: At what age were you able to see the forest from the trees - the context and the narrative?
Linda Andersson: Around twelve. Before that it was more or less me struggling with putting one letter before another. I was actually doing programming before I could read. Math was much easier for me because it was numbers and logic. Meanwhile, language - why should you use an "s" on a third person singular in English present tense? Why? And so around twelve I started reading a lot of books - probably one book per week, sometimes more. Every type of book. Right now I am reading crime novels in German because that is how I practice my German. When I started at twelve it was everything I could come across, including Shakespeare.
From that point forward, text technologies were always interesting to me. And when you read at twelve you also need to start doing exams. You need to extract information from books, and I never had time to read all the big books to get all the information. So I needed to learn how to extract information quickly. If it was physics or chemistry I just needed to identify what was the most important part, what would be on the exam. So I started actually doing information extraction - manually - because I needed to.
Seth Earley: The framework, the mindset, the approach - all very analogous to what you are doing now. Let's talk about Artificial Researcher and the kinds of semantic work combined with AI. What I liked about our first conversations is that you are really at the intersection of ontology, information architecture, and artificial intelligence. There is a misconception that you just point the AI at all the data and it figures it out. Our philosophy is that you can do that to some degree, but you really do need to think about that core architecture and the semantics. Want to talk about your philosophy at that intersection?
Linda Andersson: My philosophy is that domain knowledge is the key, and domain knowledge exists in ontologies - some kind of structured knowledge. If you want to start text mining a new area, you need to be familiar with ontology first.
Seth Earley: Do you want to provide your definition? I have a quick one as well.
Linda Andersson: Yes, I will give the linguistic definition. If you have WordNet, for example, you have phrases - domestic pet is a phrase which is a parent to dogs, cats, and so forth. You have some kind of structure so that you know how different phrases are connected and related, and that is also how you explain certain concepts. What is the synonym to cat - is feline a synonym? It is partly a synonym, but it is actually a broader term. All felines. So those parent and sibling relationships are key.
Seth Earley: Taxonomy - parent, child, whole-part relationships - and then we have multiple taxonomies that describe any knowledge domain, and then we have the relationships between them. When I think of ontology, it is the knowledge scaffolding of the organization - the framework in which you can hang your knowledge. Taxonomy of source structures, with equivalent terms, related terms, and all those relationships. If I have products and services, here are the services that serve this product; if I have a problem and a solution, here are the solutions for those problems. So did you want to continue on and talk about how that relates to AI and knowledge extraction?
Linda Andersson: For knowledge extraction, we use those broad and specific terms. We take a large amount of data from a specific domain, because different words have different meanings in different domains. Take the word "bus." All three of us think of something different. If you think about it in computing, it would be a bus - as in a data bus, a memory pathway. If you are thinking generally, it would be a vehicle - a double-decker bus, transportation.
But if you train your algorithm, for example on patent data, and you forget to achieve a good balance in your training corpus, it could be that the word "bus" becomes associated only with memory and registers, because that is the context where it occurs most - and that is what the computer learned from its teacher. That is where we come back to the need for ontology. Ontology is some kind of manually curated or semi-manually curated structure that can help the deep learning algorithm understand that there are different types of "bus" out there. If you let it go by itself, it will only learn what occurs most frequently. That is why you need to give it guidance.
Chris Featherstone: How do you instruct organizations about this bias they don't know they have? When they get into it and they're looking for "bus," they don't know they should be looking for computer bus versus transportation bus.
Linda Andersson: I can take the example from my PhD, working with patent engineers and patent examiners. They have exactly this problem. The primary approach is: you need to use the taxonomy of the technical fields - which field are you in? What does this word mean in that particular field? So the first thing you do is apply structured data. If you have the International Patent Classification system, for instance - it has sections, classes, subclasses, main groups, subgroups, going down to potentially millions of subcategories in the end - you use that as a way to say: when this word occurs in this context, it belongs to transportation; when it occurs in this context, it refers to some kind of memory or microchip. That is the first step.
You cannot chunk everything into one pot and hope it comes out good. There are cases where generalized models work, but you need to be aware of the context in which you are going to use that model. If you want to build a search engine on transportation, pizza, and computing all at once, you need to make sure that the training of the algorithm is on all those domains, balanced. And then people say - but doesn't that mean I only train on that domain? No, that is not good either, because you also need a general understanding of language. You need a balanced corpus that covers the basic grammatical and semantic information from all types of areas, and on top of that you put the target domain you want to go to. Training only on a specific domain means the model will not understand what the word "dog" is if it never appeared in your domain corpus - in patent data, it would probably just be treated as an acronym for something.
Seth Earley: That makes a lot of sense. Begin with the structure, the architecture, the concept space - which has to relate to the domain you are trying to retrieve information from. That is the reference data. It tells the system what is important, what terminology means, and how terminology relates to other terminology. Let's also talk about the content itself - the knowledge base, the actual assets we are going after. A lot of people wonder whether they really need to think about structuring and curating that, or whether they can just point algorithms at it. I have my philosophy and I think it is aligned with yours.
Linda Andersson: I would go back and talk about pre-processing - not just the high-level architecture, but actually the importance of pre-processing data. It is really the key thing. Usually when I say developing an AI system - if you want to develop a chatbot - 70 percent of your time is spent on curating, cleaning, and structuring the data. And then hopefully you can send it to the algorithm to do the first learning. If you have not structured the data first and you just send in whatever you have into the algorithm, then you come out and say it does not work. It does not work if you just send in a lot of unstructured data with no guidance on what the algorithm should be looking for. You won't get any good results that can actually predict what you want.
For example, I did a project for a company working with food allergies. In Sweden, around 10 percent of us have some kind of food allergy. So the need was: I want to make this recipe, but I am allergic to all of these ingredients - what can I use instead? In order to do that, you first need to collect a lot of recipes and see how they differ. Then you need to collect information about all the ingredients and see how they can be substituted. And you need to know which ingredients are common allergens. And then you need to know what people actually use instead - if you are allergic to eggs in a pancake, banana is a common substitute. That is the knowledge you need to start collecting, and it needs to be structured. What are the different types of data? How are the recipes structured in the files? How can you extract where and when things are cooked and how they should be cooked? So yes, 70 percent of the work is that pre-processing curation before anything algorithmic happens.
Seth Earley: Great example. So we are curating the relationships between concepts in order for that type of application to work correctly. And as I mentioned, a lot of organizations are not clear on the idea that they need to spend a lot of time and energy on the knowledge sources themselves before retrieval will work.
Chris Featherstone: I'd love your take on where academia is missing the mark in terms of educating people to get started in the right direction. We get these bright, brilliant people in the professional world who are somehow unequipped to deal with these scenarios. Where could academia do a better job?
Linda Andersson: I think it is very important to promote interdisciplinarity. In natural language processing especially, you can come from two different directions: you can be a computer scientist and be an expert on algorithms and math, or you can come from computational linguistics. I did both. I did computer science as well as computational linguistics - for my Master's work I focused on computational linguistics, and for my PhD at the Technical University of Graz I was in computer science. That combination lets you really understand that these are not independent from each other. You need to learn what a noun is in order to be able to do text mining and identify technical terms, because technical terms in English are at least 80 percent nouns, and they are usually multi-word compound nouns. These are things you need to be aware of.
Just because you speak a language and can read a language does not mean you can do computational linguistics or text mining on that language, even if you are a native speaker. You need to understand the structure of the language. And that is where I was fortunate - because of my dyslexia, I needed to get everything explicit in order to teach myself the language. So teaching the computer that same thing was actually easy for me. I would say: promote interdisciplinary education, and make sure students understand that there is not one solution that will solve the problem. You will probably have to go left, then right, then take a roundabout, because there are layers in language and speech technology that you will not identify if you only have one perspective. Broaden it - not just STEM, but also social sciences, linguistics, library science, journalism - all of these things are important disciplines for understanding how knowledge actually works.
Seth Earley: So getting the right interdisciplinary talent in place. Most organizations are still getting their minds around cognitive AI - virtual assistants, offloading call center tasks, allowing customers to self-serve. Where would you advise them to start, regardless of the end application?
Linda Andersson: If you are starting and you know you have data - let's say it is machine-readable - the first thing is to actually know what the condition of your data is. Can you even use it? In Swedish, for example, we have characters like A-with-an-umlaut or A-with-a-circle. In the early days of computing, these were often reduced to just "a" or other substitutions, and you could have millions of words in your corpus that are actually corrupted forms of perfectly normal words. You need to know the condition of your data before you do anything else.
Then I ask: what is your vision? And once I know both the data condition and the vision, I can see whether it is even possible to match them, or not. If it is not possible as-is, I will say: given the condition of your data, cleaning it will take two years. Let us instead take a smaller set, potentially simulate it or collect it from somewhere else, and try that first while the curation of your real data proceeds in parallel. Right now, for example, distributional semantic models are actually very good for OCR correction - because OCR errors have certain characteristics, and the context around a misspelled word is usually still correct, so you can detect and reduce those errors using semantic similarity. In patent data we have millions of corrupted words because of poor historical OCR, and distributional semantic models can help normalize those.
So the sequence is: start with a supervised, rule-based approach to curate and build a seed. Once you have curated enough data, introduce it and start doing semi-supervised learning, moving gradually toward a fully trained model that can predict with enough accuracy to continue bootstrapping. But the first step is always: what is the state of the data, and what is the vision, and are those two things compatible in the timeframe you have?
Seth Earley: To summarize: knowing your business objective, understanding your data - the state, the structure, the quality - and then working from that reality toward the vision. It has really been a pleasure talking with Linda Andersson, Founder and CEO of Artificial Researcher. Information about her company and services is in the show notes. And again, thank you to our sponsors - CMSWire and Simpler Media, the Marketing AI Institute, and Earley Information Science. Thank you Sharon for all the production work behind the scenes. Chris, thank you as always. It has been a pleasure.
Chris Featherstone: Linda, thank you so much. It is always a pleasure to have really intelligent colleagues on. You are definitely smarter than a computer.
Linda Andersson: I will agree - I am definitely smarter than a computer.
Seth Earley: Thanks everyone, and we will see you next time.
