Beyond Aggregate Metrics: How Scenario-Based Testing Reveals the Hidden Failures That Sink AI Products
Guest: Gordon Hart, Co-Founder and Head of Product at Kolena
Hosts: Seth Earley, CEO at Earley Information Science
Chris Featherstone, Sr. Director of AI/Data Product/Program Management at Salesforce
Published on: March 27, 2023
In this episode, Seth Earley and Chris Featherstone speak with Gordon Hart, Co-Founder and Head of Product at Kolena, a machine learning testing platform built to expose the hidden behavioral failures that aggregate accuracy scores routinely miss. Gordon draws on seven years of developing computer vision models for defense and security applications to explain why even a 99.5% recall score can mask catastrophic gaps - including a real incident where a state-of-the-art gun detection model failed to identify a prop gun sitting alone in an empty bin. The conversation covers scenario-based testing, the counterintuitive primacy of test data over training data, full pipeline evaluation, saliency maps for explainability, and how to build a regression testing framework that gives ML teams genuine confidence before deployment.
Key Takeaways:
- Aggregate benchmark metrics hide silent failures - a single recall score can improve overall while masking severe regression in a critical and common scenario.
- Testing data is vastly more important than training data because it drives every deploy/no-deploy decision about whether a model has the required behaviors.
- Scenario-based unit testing - breaking benchmarks into well-scoped subsets by use case - is the only reliable way to know precisely where a model succeeds and where it fails.
- Testing the full product pipeline end-to-end often produces counterintuitive results, where a technically weaker individual model yields better real-world outcomes than a stronger one.
- Evaluation metrics must align with how the system will be judged in the field, not just internal accuracy scores that engineers are incentivized to optimize against.
- Saliency maps reveal which input features a model is actually relying on, enabling teams to find and fix brittle shortcuts before they surface as production failures.
- A regression test suite - continuously expanded with new failure cases discovered in the field - is the foundation for sustainable, trustworthy ML product development.
Insightful Quotes:
"When you have all this data, if you're just looking at that one number - that aggregate metric computed across your entire benchmark - that is abstracting away all of the different ways in which you succeed and the ways in which you fail." - Gordon Hart
"Your testing data is vastly more important because it is what you're using to actually decide if your new model has the behaviors it needs to have. Your training data is in some ways just an implementation detail." - Gordon Hart
"Having your evaluation metrics align with the way that your system is actually going to be evaluated in the field is a key thing that you can do to get a better understanding of - is this model better for what I set out to do?" - Gordon Hart
Tune in to learn how scenario-based testing, saliency maps, and regression test suites help ML teams eliminate silent failures, build genuine explainability, and ship AI products with confidence they will behave as intended when it matters most.
Links:
- Twitter: https://twitter.com/kolenaIO
- LinkedIn: https://www.linkedin.com/in/gordon-hart/
- Website: https://www.kolena.io/
Ways to Tune In:
- Website: https://www.earley.com/earley-ai-podcast-home
- Apple Podcast: https://podcasts.apple.com/podcast/id1586654770
- Spotify: https://open.spotify.com/show/5nkcZvVYjHHj6wtBABqLbE?si=73cd5d5fc89f4781
- iHeart Radio: https://www.iheart.com/podcast/269-earley-ai-podcast-87108370/
- Stitcher: https://www.stitcher.com/show/earley-ai-podcast
- Amazon Music: https://music.amazon.com/podcasts/18524b67-09cf-433f-82db-07b6213ad3ba/earley-ai-podcast
- Buzzsprout: https://earleyai.buzzsprout.com/
Thanks to our sponsors:
Podcast Transcript: Silent Failures, Scenario Testing, and Building Trustworthy ML Systems
Transcript introduction
This transcript captures a conversation between Seth Earley, Chris Featherstone, and Gordon Hart about the fundamental gap between how ML models are typically evaluated and how they actually need to perform in production - covering the real-world "gun in an empty bin" failure story, why testing data beats training data, how product-level pipeline testing differs from model-level testing, and the practical steps organizations can take to build a regression testing framework that scales with their products.
Transcript
Seth Earley: Welcome to today's podcast. Our guest today is co-founder and head of product at Kolena, which is a machine learning testing platform that allows machine learning and AI teams to test their model's behavior and effectiveness. He's currently living in Washington, DC. Please welcome Gordon Hart. Nice to have you.
Gordon Hart: Thank you, Seth. Thank you, Chris, for having me on today. As the founder of an ML testing platform, over the last 2 years since founding I've done little else other than think about ML testing, and it's my favorite thing to talk about. So I'm excited to be here on your show today and chat about it with you.
Seth Earley: Machine learning algorithms many times are challenging from different perspectives. The performance of these algorithms sometimes varies from what we expect, and many times there's not a lot of transparency and visibility into how neural networks work. So let's talk a little bit about why it's challenging to test machine learning algorithms. Give us your background from that perspective.
Gordon Hart: Absolutely. Over the past 7 years I've been the product owner for various machine learning-based computer vision products, mostly in the defense and security space, across different companies. First I was the head of product and first engineer at a company called Synapse, a startup based out in the Bay Area in California. Our mission was to automate the detection of all sorts of prohibited items in security X-ray scans - guns, knives, improvised explosive devices in an airport context, or narcotics and other illicit substances in a border or international mail facility context. The job that these screeners have is extremely challenging. They're looking at image after image for long periods of time, looking for things that really don't occur all that frequently. It's something that humans aren't that well suited to - it's a thankless task. So it seems like an obvious candidate for automation given the recent advances in detection algorithms.
That company ended up getting acquired by a larger organization working on similar detection systems across many different sensor modalities, from security cameras installed on buildings all the way up to satellites flying in low earth orbit - capturing, processing, and performing prediction on images right there at the edge.
There's really been one consistent thread through all of that. As we're developing these algorithms internally or buying them from other model vendors, the one thing that has remained constant is that unexpected model behavior when you actually go to deploy these into the field. You really can't trust models to behave in ways that you think are sensible as a human. And so I didn't set out with the intention of founding an ML testing platform company. I was kind of forced into it because time and time again we're running into this - something completely unexpected and out of the blue blindsides us. And you think, there has to be a better way to develop these models and validate that they're going to do the things you want them to do.
Seth Earley: You walked through a couple of really interesting examples. Do you want to talk about the one where you had an image that was perfectly perceivable by a human that your algorithm missed?
Gordon Hart: Yeah, absolutely. At Synapse with this detection system we had some extremely advanced detection capabilities - to the point where you could take a handful of small caliber handgun ammunition, like .22 bullets, and throw them in a backpack that has a laptop and a camera and chargers and all the other things that make it really hard to detect tiny pieces of ammunition inside a bag - and we'd detect every single one of them, almost every single time. This is the kind of superhuman performance that you often expect from these ML systems.
We had our latest and greatest model, gotten our customers really excited, and we installed it at a new airport somewhere in the United States. The Security Director at this airport is, of course, very excited to give his fancy new technology a spin. He starts putting it through the paces - starting with what he thinks is the easiest example. He takes a prop gun and puts it in the center of an empty bin and runs it through the scanner.
Seth Earley: And this is a clearly recognizable gun from the side view, even if it's all there was in the bin.
Gordon Hart: All there was. The bin itself is pretty much completely transparent under X-ray. You just see this gun. It's very bold - it's unmistakable, even if you've never seen an X-ray before. And we pushed it through. We missed it.
Seth Earley: It missed it. As obvious as it can get.
Gordon Hart: Absolutely. We're telling him this is our best system yet. We've been building this for years. We had 99.5% recall for guns and had shown him examples of extremely difficult cases where a human almost certainly wouldn't have found the gun. And of course, 99.5% recall - you'd think you're going to miss one out of every 200 guns that goes through your scanner. As a human, you'd say okay, that one in every 200 is probably going to be the hard case. So the 199 easy cases should be fine.
Seth Earley: So what went wrong? And how should that have been validated or tested?
Gordon Hart: The core problem was the approach that we were using to validate these models and decide they're good enough for deployment. We had a large benchmark data set - hundreds of thousands of images, all different sorts of varieties and presentations and orientations and occlusion levels. A vast and varied database of images. We would run a new model against that benchmark and compute some high-level metrics. At this threshold we had 99.5% recall across hundreds of thousands of handgun images in X-ray scans. And that aggregate approach is what led us into this scenario where our model has severe underperformance in an important scenario, but we're completely unaware of it.
This came to be because we had some problems with false positives in similar scenarios. So we worked to improve the false positives there, and that had come at the cost of now missing this extremely common presentation - the gun in an empty bin, as easy as it gets. Our evaluation didn't give us visibility into what we were and weren't doing.
Chris Featherstone: This is such a fascinating and overlooked topic. Because what I feel like we do in this world is get super hung up on accuracy scores. But accuracy scores are only as good as the outcome you're driving towards. Who cares if it's accurate but it doesn't meet the business objective? Codifying the outcome - what the stakeholders actually need - is what we should be aiming for. So is training data and testing data mutually exclusive? Is one more important than the other?
Gordon Hart: At risk of going out on a limb, I'll say that your testing data is vastly more important. Your testing data is what you're actually using to decide if your new model has the behaviors it needs to have. You're using this testing data to inform: can I deploy this? Does it do what I set out to do? Is it going to be better than what I have previously deployed? All of these business-critical decisions are informed by your testing data. Your training data is in some ways just an implementation detail. If you can have some magical architecture that only requires a couple of training examples, that's just as good from a business perspective as an architecture where you feed in tens of millions of examples. What matters are the imparted behaviors. And if you don't have a meaningful way to evaluate whether or not those behaviors have been imparted on your model, you're kind of shooting in the dark.
Seth Earley: So if you have a small training set but your testing data is showing you the performance and the granularity - who cares? Tell me about the sandbox-versus-production gap. When you're developing these things, maybe the model performs really well in a controlled environment. But then when you get into production deployment, what variables are changing?
Gordon Hart: Yeah, this is typically where a lot of the surprise comes into play. You think you've rigorously validated something in the proof-of-concept environment, and then you deploy it into production and you see things coming totally out of left field. There's a ton of reasons for this. The data itself might be different. In many cases it's not feasible to use the same data you have access to in production for training and testing during your development cycle.
But there are other ways your testing can reduce uncertainty about how you're going to perform in the field without requiring a perfect one-to-one match. One of those is that your model is probably part of a larger system. There's something capturing data, something doing pre-processing on that data, which is then fed into the model. Then there are probably some post-processing or business logic rules applied on top of what the model puts out. Only then are those predictions finally surfaced to the user or used to make decisions. Testing that full pipeline from pre-processing through post-processing, rather than just testing the model component, can often dramatically improve your visibility into how this thing is actually going to work when you put it out there.
But probably the most effective thing you can do is make sure that your metrics in the lab are actually meaningful to the domain you're deploying in. For detection in security X-ray scans, what we actually learned is there's a threshold at which - when you have too many false positives without actually detecting prohibited items - operators completely lose trust. For an assistive technology like that, you need to be very conscious of that ratio. Otherwise there's a threshold where your system becomes worse than useless - it's just a nuisance. A model might have a better recall score, but on actual data from an airport security terminal it could produce a worse operational metric - too many false positives about things that annoy operators while underperforming at detecting actual prohibited items. Having your evaluation metrics align with the way that your system is actually going to be evaluated in the field is a key thing you can do to get a better understanding of whether this model is better for what you set out to do.
Chris Featherstone: What about when we get into generative AI models? Neural nets are pretty straightforward - data in, data out. But these new generative models are kind of off the wall in terms of what they're looking at and how they're bringing results back. Where does testing break down there?
Gordon Hart: It's a very good question. The jury is still way out on how exactly to do that. One thing that we've seen work kind of effectively for a lot of teams is: if you have certain behaviors from these generative models that you really don't want to get out there - you really don't want it to spew hate speech, you really don't want it to produce NSFW imagery - you can use a model to test your other model. Your evaluation results are only as good as the model that's trying to detect hate speech or NSFW imagery. But at least that gives you a way to probe these generative models and say, are there any inputs we can provide for which it spits out inappropriate outputs? That's one way we can at least put some guardrails on this process.
Chris Featherstone: That's kind of what I was thinking - you use a generative model to generate, then use other models to validate. One generative output becomes the input for other models, to make sure what you're creating is safe and is the outcome you're trying to drive.
Gordon Hart: Exactly. And you can evaluate those classification models on manually annotated, known high-quality large data sets and get a relatively good idea of what their strengths and weaknesses are. As long as you always maintain that awareness in the back of your head when you're looking at how that model was used to evaluate the generative model, you can extract meaningful signal from that evaluation.
Seth Earley: One of the things you touched on earlier was product testing versus model testing. Can you walk through that distinction?
Gordon Hart: So the pre-processing and post-processing algorithms running in production are one part of the system. But often in production you actually have pipelines of models where one model is producing predictions that get fed into another model, which feeds it down the chain. Your overall product is not any of those individual models - it's that whole pipeline.
A good example that we saw at Kolena: we were evaluating a pedestrian detection system - the kind of collision avoidance system that runs on the edge on vehicles. Almost every new car you can buy these days has some form of this. The system was set up with a per-frame object detector looking for people in each frame, feeding those detections into a tracking model that tried to resolve detections across frames - this detection and this detection at different times are all really the same person. Then those tracks went into an action classification model trying to decide: is this person crossing or about to cross? This is useful so you can alert the driver to avoid collisions, particularly useful for pedestrians who are an imminent collision risk.
What we saw when evaluating this was that we could improve the object detector - the first model in the pipeline. And if we evaluate that on a large data set, the better model, when plugged into the actual pipeline, might actually yield worse performance at the end goal. We could improve the object detector and be better at detecting people far off or on the edges of frames. But when it came to detecting the people who mattered - the ones crossing or about to cross, close to your vehicle - the technically worse model produced much better results when fed into the tracking model and the downstream action classifier. This is why it's always important to keep that product-level visibility in mind.
Seth Earley: So you had two models. One was performing more effectively at pedestrian detection in isolation. But combined with the tracking model, it performed worse than the weaker individual model. That's very counterintuitive.
Gordon Hart: Exactly. And if at all possible, you should have at least some tests that run your whole system. It's usually much easier and more feasible to test models in isolation, and maybe the bulk of your testing should be done in isolation so that teams can operate independently. But you also need to have your whole product modeled as something you can test, with a wide array of test data pushed through that entire pipeline. And one of the things you can do in full end-to-end testing is look not just at accuracy metrics but at scenarios that actually matter. Of the pedestrians who are actually going to be a collision risk - those who are close to your vehicle, those crossing into your lane - how does this new pipeline perform on that subset? The relationship between meaningful pedestrians and all pedestrians in the frame is not linear. If you perform better on all pedestrians, you might not perform better on the ones that matter. That's what you need to keep your eye on.
Seth Earley: So when you think about setting up these programs - defining use cases, defining outcomes, looking at product-level metrics - what are the initial steps to setting up a comprehensive testing program?
Gordon Hart: I really feel that you should think critically about your tests upfront, prior to kicking off the project. Think about what are the actual behaviors we need in order for this to be a success. I've seen too many proofs of concept fail because there wasn't a clear thesis about what behaviors they need to have - and what behaviors they don't need to have - in order to prove out that concept and turn it into a real product.
Try to have the people who are planning and devising this product think critically upfront. If I'm doing a person detector - a person is not just a person. There are many different ways people can present. You can have vastly different camera resolutions, people who are very close, people who are very far away, people from all different orientations. You should understand upfront which of these scenarios are important, and model those in tests - well-scoped subsets of your test benchmark targeting each specific scenario you need to care about. For people familiar with test-driven development in software engineering, this will sound pretty familiar. But I think it's much more necessary for machine learning, where systems are a lot harder to control.
Chris Featherstone: Do you feel like this is actually a catalyst for explainable AI? I believe this testing framework is the most explainability you can even have - here's my testing framework, here's what I know my model does and doesn't do. What are your thoughts there?
Gordon Hart: Yes. Having all of the different behaviors that you want your model to have laid out in test cases - that behavioral report card doesn't tell you the why, but it is the best starting point towards explainable AI. Rather than just having that top-level metric, you understand your model's behaviors at the actual behavioral level, at the sub-class level or the scenario level. Not only does this model succeed in 90% of all circumstances we threw at it - but what are the circumstances where it doesn't?
And there's a lot further you can go from there. If you have a proper automated testing system, there are powerful techniques you can use to inspect the actual why. In computer vision, you can render saliency maps that visualize which pixels in the input most contributed to the output prediction. You can render this for a given image and see, okay, for this prediction my model was looking primarily at this region.
We used this really successfully at Synapse. We rendered saliency maps for all sorts of predictions and learned that our gun detectors were relying very heavily on the presence of the trigger - the specific shape and size of the trigger was one of the key features our models leaned on to identify whether a gun was present. So we went and collected a bunch of data where we either removed the trigger assembly or occluded it. Sure enough, our models really underperformed in that scenario once we removed those salient features. But when we fed those examples back into training, we developed models that were more robust to that particular case without sacrificing behavior elsewhere. Saliency mapping can tell you the why and help you make better, more informed decisions about what to actually do to improve robustness.
Another thing we saw with saliency maps was that this explainability showed us things we never would have known otherwise. We had a filing cabinet of thousands of different knives - all different shapes, sizes, form factors. We thought we had pretty good coverage. Then we started doing saliency mapping and saw that for fixed-blade knives, cheap knives typically have what's called a partial tang - the blade only extends partway back down into the handle. Our models were relying heavily on that partial tang to predict a fixed-blade knife. So when we saw full-tang knives - where the metal extends all the way through the handle - we were really underperforming. And we only discovered this in the field because even though we thought we had a lot of diversity in our testing data, it was all cheap knives sourced from overseas. None of it had the full tang. If we had had that in our testing data, we would have identified that shortcoming a lot earlier.
Seth Earley: So when you start thinking about setting up a comprehensive testing program - defining use cases, outcomes, product-level metrics, how a model performs under certain scenarios - what are the first steps?
Gordon Hart: Most domains have some metadata that is meaningful for them - metadata that they already have associated with all their data - that they can use to create test cases showing how behavior breaks down in specific scenarios. For a robotics company, you'll have data collected from a lot of different locations, and those deployment locations are very meaningful for stratifying model performance. That's a good way to bootstrap a good set of test cases that you can build over time.
For an autonomous vehicle company, when you collected your data you know what time of day it was, you know where it was, and it's very easy to annotate with how the weather was. Time of day and weather stratification - how do I perform in the morning? In high glare? At night? When it's raining or snowing? You can almost certainly immediately generate these test cases with metadata you already have, and use that to bootstrap a fine-grain testing approach.
Once you have the framework in place for creating and evaluating models at the test case level, you'll start identifying new bugs and failure modes from your model in the field. Then you start this process we call regression testing - if you notice a lapse in behavior, a bug from your model in the field, you create a test case, maybe a few hundred to a few thousand images, that verify yes, this bug exists. In the gun-in-an-empty-bin scenario, that regression test would be maybe 500 examples of guns alone in a bin. Once you have that test case, you add it to a regression test suite - so now whenever you're developing any new models, you're always testing them on the full set of regression tests you've put together over time.
That gives you assurance that yes, this bug that I squashed last quarter - the underperformance in the gun-in-empty-bin scenario - we're still performing well there. You can change your focus, go heads down on one area and improve there, then change your focus elsewhere, still having the peace of mind that your improvements in this one area are not going to wipe out all of the improvements you fought so hard for in another part.
Seth Earley: That's great. Some really valuable insights. I just want to remind people that you are at Kolena - K-O-L-E-N-A - dot IO. Gordon, this has been terrific. Any final thoughts?
Gordon Hart: I would like to say that model testing and ensuring that you get the performance you need out of your models is something that a ton of people within an organization care about - from the engineers developing the models, to their technical managers, to the product managers, to the business leaders, to the sales and business development team interfacing with customers, to the customers themselves. Everybody along that chain is invested in model performance. They don't necessarily have the visibility or the precise understanding of how these models are performing. But improving the visibility of this process for all of those different stakeholders is always worth the effort for an organization, once you've gotten out of that initial proof of concept phase and are really trying to solve a real world problem.
Chris Featherstone: I really appreciate you coming on and talking about this topic. For the longest time it's felt like such a Wild West area. And the fact that there are now people focused on this - it's part of this maturity curve we're in. No longer is it okay to not have these mechanisms in place. Having a data science team is just a fact now. It's no longer the exception, it's the rule. My hat's off to you. Is there anything we should be looking for from you or from the company this year?
Gordon Hart: Stay tuned. Take a look at our LinkedIn presence and website presence, and there should be some exciting news coming out about Kolena in the next couple of months.
Seth Earley: Wonderful. Thank you again for your time today. It's been a pleasure.
