Guest: Mike Pappas, CEO at Modulate
Host: Seth Earley, CEO at Earley Information Science
Published on: May 22, 2026
In this episode, Seth Earley speaks with Mike Pappas, CEO of Modulate, whose work began in gaming - one of the most demanding environments for real-time voice intelligence - and has since expanded to enterprise applications including fraud detection, customer abuse prevention, AI agent guardrails, and sales coaching. They explore why transcription is not the same as understanding, what gets lost when audio is reduced to text, and why voice is the most powerful tool fraudsters have. Mike shares candid and specific insights on deepfake detection, the fine line between safety and surveillance, and what organizations need to put in place before deploying voice AI at scale.
Key Takeaways:
Insightful Quotes:
"When you hear a voice, you hear the intonation, you hear the emotion, you hear pregnant pauses - there is so much information being carried in that audio that gets lost when you pull down to a transcript. And whenever we talk to someone who professionally works in a contact center, they are always saying, we know these transcripts are losing tons of good value." - Mike Pappas
"If I am actively harassing you and the platform is able to come in and put a stop to it live in the conversation, that feedback actually systematically changes behavior. Getting an email 30 minutes later saying we noticed you did something wrong - that just infuriates people, it does not lead to change." - Mike Pappas
"There is a fine line between safety systems and surveillance systems. How do you design voice AI that improves safety and trust but does not cross that boundary that makes users and employees uncomfortable?" - Seth Earley
Tune in to discover why real-time voice intelligence is one of the most consequential and least understood frontiers in enterprise AI - and what organizations need to get right before they deploy.
Links
LinkedIn: https://www.linkedin.com/in/mike-pappas-9a30a858/
Website: https://www.modulate.ai
Ways to Tune In:
Earley AI Podcast: https://www.earley.com/earley-ai-podcast-home Apple Podcast: https://podcasts.apple.com/podcast/id1586654770 Spotify: https://open.spotify.com/show/5nkcZvVYjHHj6wtBABqLbEiHeart Radio: https://www.iheart.com/podcast/269-earley-ai-podcast-87108370/ Stitcher: https://www.stitcher.com/show/earley-ai-podcast Amazon Music: https://music.amazon.com/podcasts/18524b67-09cf-433f-82db-07b6213ad3ba/earley-ai-podcast Buzzsprout: https://earleyai.buzzsprout.com/
Podcast Transcript: Real-Time Voice Intelligence, Fraud Detection, and AI Guardrails
Transcript introduction
This transcript captures a conversation between Seth Earley and Mike Pappas about why voice remains one of the hardest and most consequential problems in AI. They cover the gap between transcription and true audio understanding, how real-time intervention in gaming environments translates to enterprise fraud detection and contact center safety, what it takes to supervise AI voice agents that cannot supervise themselves, and how organizations should think about the privacy implications of building safety systems on voice data.
Transcript
Seth Earley: Welcome to the Earley AI Podcast. I'm your host, Seth Earley, and what we do in each episode is examine how AI is changing business from the perspective of creating value, managing information, managing customer experiences, managing employee experiences, and changing how organizations operate. Today, we are going to be talking about a part of AI that is becoming increasingly critical - and that is voice. Not just speech recognition, but real-time voice intelligence, where AI has to interpret intent, emotion, and risk as conversations unfold.
Joining me today is Mike Pappas, CEO of Modulate. His work began in gaming, which is one of the most difficult environments for voice AI. That technology is now being applied to a much broader set of challenges, including customer abuse prevention, fraud detection, and AI guardrails for voice-driven systems. Mike, welcome to the show.
Mike Pappas: Thanks so much for having me, Seth. Excited to chat.
Seth Earley: Mike, one of the things we like to start with is misconceptions. When it comes to AI and voice and real-time understanding, what are the biggest misconceptions you see?
Mike Pappas: The core misconception is that taking voice and transcribing it is the same as understanding it. And even if we are very good at understanding text, transcribed text is not the same as audio. When you hear a voice, you hear the intonation, you hear the emotion, you hear pregnant pauses - you even hear things like the timbre of a voice that tell you things about age and potential gender. There is so much information being carried in that audio that gets lost when you pull down to a transcript.
Whenever we talk to someone who professionally works in a contact center or deals with large amounts of audio content, they are always saying, we know these transcripts are losing tons of good value. We just assumed that was all we had - that the only way for computers to understand audio was to transcribe it.
Seth Earley: A lot of people think of voice as solved because they think of it in one dimension - we have speech-to-text, we have voice assistants. What do people misunderstand about what it takes to make voice AI useful, safe, and reliable in real environments?
Mike Pappas: The classic example is you are late to an event and you get a text saying, "you okay?" Are they genuinely concerned? Are they being passive-aggressive? You cannot tell from the text. You need the audio.
There are a lot of impressive demos of technology that seems to understand this stuff. But they are all designed for pristine environments where recording quality is perfect and everyone is speaking clearly. When you jump into the real world, you have messy cell phone connections, background noise, emotion, jargon - and building a system that holds up under those circumstances and can still tell the difference between two old friends trading insults and a genuine confrontation is exactly why gaming was such a good starting point. Gaming is filled with contextual, complicated conversations where you need to deal with emotion and noise to tell the difference between something positive and something negative.
Seth Earley: Tell me about those early use cases in gaming and what you were trying to solve.
Mike Pappas: In online games like Call of Duty, studios are trying to encourage players to use voice chat to coordinate and have a more immersive experience. But the danger is that anonymity on the internet brings out hostile behavior in some people. These studios asked us to look for things like bullying, hate speech, and harassment, so they could identify intentional harm and escalate it. They also asked us to look for positive behaviors - identifying players who are particularly good at coaching new players and making them feel welcome, so the platform could pair new players with those people rather than with someone who would take advantage of their inexperience.
Seth Earley: You are not looking for keywords - you are looking for intent and meaning. How do you approach that problem given how humans actually speak?
Mike Pappas: There is no shortcut. You have to look at every single aspect of the audio to see everything a human being can see. The approach is an ensemble model that consists of over 100 different component machine learning models. Some are looking at emotion, some at cadence and prosody, some at the timbre of the voice, some at higher-level characteristics like interruptiveness or whether a conversation feels scripted. And of course many are looking at the words being said as well.
The first innovation is having all of those models looking at every possible lens of the audio. The second is weaving them together. Five years ago, you could have bought a sentiment analysis tool that could tell you a comment was sarcastic, and you could have bought a transcription tool that gave you the words "nice job, genius." But if you tried to pass that into a summary system, it would say, Mike complimented Seth's great idea. All the data was there, but it could not connect the dots. The sarcasm poisons the meaning of the rest of the conversation, and you need to be able to recognize that and reinterpret everything through that lens.
Seth Earley: Many platforms rely on user reporting or after-the-fact review. What changes when you can intervene in real time?
Mike Pappas: In social and gaming spaces, real-time intervention matters for two reasons. First, if something harmful is happening and the platform can put a stop to it immediately, it is protecting the person being targeted. Second, for the person causing harm, getting feedback fast drastically changes how quickly they learn. If you say all this awful stuff and then get an email 30 minutes later saying we noticed something wrong and we are suspending your account, that just infuriates people and does not lead to behavior change. When you get a notification live in the conversation saying that is not okay, this is our code of conduct, it actually systematically changes behavior. In our work with Call of Duty, that kind of intervention led to about an 8 percent drop in repeat offense rate month over month.
Seth Earley: Talk about the shift from gaming to corporate environments. What are the enterprise use cases?
Mike Pappas: The philosophy is the same - if something bad is happening in this conversation, you need to know immediately. But what bad looks like is different. It could be fraud. If you are talking to a deepfaked voice right now and it is in the process of convincing you to hand over access to your accounts, you need to be notified in real time.
Another use case is AI agent guardrails. If you are deploying real-time voice agents, there is a risk those agents go off the rails. We spoke with people in a recruiting context who were using AI interviewers and had prompted the AI to check whether a candidate was flexible. What they found was that the AI would sometimes ask the candidate to perform a yoga pose during the interview. You genuinely cannot make this stuff up. But you need a system that notices what is happening in real time so you can escalate and take action before you lose a customer, lose a candidate, or end up in a lawsuit.
Seth Earley: Talk about the fraud use case in delivery and how voice is being used by fraudsters.
Mike Pappas: One of our first corporate customers was a Fortune 500 food delivery company that came to us after seeing our work in gaming. They started out wanting to know if someone was threatening a delivery driver - unfortunately that is a real thing that happens. But when we deployed with them, we quickly found we were also detecting scams against the drivers.
Those scams usually look like one of two things. The first is order fraud, where someone calls the driver on the way to the restaurant and claims the app only placed part of their order, making enough of a threat that the driver feels obligated to resolve it out of pocket. The second is saving the driver's contact information and calling back days later claiming to be from IT, then using that conversation to steal personal information.
Being able to monitor those calls and alert the platform specifically when someone is actively trying to manipulate a driver - not in a way that captures everything said, but just for those specific risk signals - allows the platform to protect those workers before the damage is done.
Seth Earley: Why is voice so effective as a tool for manipulation, and how are fraudsters evolving their techniques?
Mike Pappas: Voice is how we express and understand emotion, and emotion is a powerful way to move human beings. When you are trying to manipulate someone, you create an emotional context that makes them fail to meet their better judgment. We are seeing things like fraudsters playing a recording of a baby crying in the background to create urgency. If you are only looking at the transcript, you will never catch that. But if you are listening to the audio, you can tell that the echo of that sound does not match the acoustic environment of the speaker at all.
Fraudsters are also scaling their operations with AI in ways that were not possible before. A single bad actor can now place automated calls hitting 20 of your agents simultaneously, running the same script, betting they can find the weak link before you shut them down. Being able to detect that the cadence of a conversation feels scripted, or that nine different callers in the last 30 seconds are all claiming to be the same person, becomes extremely valuable even when the words seem legitimate.
People assume fraudsters are unsophisticated because of obvious email scams. But those email scams are actually designed to be obvious - to filter out everyone who would not fall for them, so fraudsters do not waste effort. The real danger is a phone call where someone is using a deepfaked voice of a family member saying they are in trouble and need money wired immediately. That can get very realistic very quickly with the tools available today.
Seth Earley: Voice agents are going to be handling more and more of our interactions. How should organizations think about guardrails for AI-driven voice systems operating at speed and scale?
Mike Pappas: There are two critical things you need to know about what your AI voice system is doing. The first is when a call is going very poorly in real time so you can jump in - whether by bringing in a human or redirecting the system. What counts as going poorly will vary. For some organizations, a customer implying they are likely to churn is a crisis. For others it is only when there is significant regulatory liability or a high probability of fraud. But you need some way to know in real time when your system is doing something wrong.
The challenge is that most AI systems cannot introspect. They cannot ask themselves whether something is going wrong. They are just doing their best at each moment. So you need something supervising from outside that can say, this is not what your actual goals in this conversation were.
The second thing you need is a way to digest all of those conversations and surface trends. If you are using AI agents at scale, you cannot QA every call. You need a system that can tell you that the number of callers frustrated about billing issues went up 300 percent in the last 24 hours. Something is wrong and you need to investigate.
Seth Earley: There is a fine line between safety systems and surveillance systems. How do you design voice AI that improves safety without crossing the boundary that makes users and employees uncomfortable?
Mike Pappas: Privacy is a genuinely tricky balance. In order to protect what is happening in a call, you do need some way to listen to it. In general, you should be collecting and storing as little of that information as possible.
The way to approach this is to lay out in advance what you specifically care about - toxicity in a gaming context, fraud in a banking context, AI misbehavior in an interviewing context - and build the system to escalate only those specific signals rather than recording and storing everything. That methodology means collecting much less data overall and only surfacing what the organization was trying to surface in the first place.
For organizations already required by law to record all calls, the question becomes whether it is better to have humans listening to all of it, or to have AI extract only the information that matters and focus human attention there. The latter is actually more privacy-safe - there is less exposure risk of things that were not the focus of the monitoring getting surfaced or misshared.
Seth Earley: When executives are trying to decide where voice AI fits into their strategy, where should they focus first?
Mike Pappas: Start by being really clear about the difference between voice analysis AI and voice agents. Right now there is a lot of confusion where people use voice AI to mean both, and they have very different purposes and very different limitations. Voice agents can be enormously helpful for supporting customers at scale. But they cannot introspect - they will be fooled by fraudulent and manipulative activity, and they will not notice when a call is going off the rails. That is not their purpose.
Voice analysis AI, on the other hand, cannot talk to you. It is a supervisor. But when it is accurate, it allows you to deploy voice agents with much greater confidence.
The most important thing before deploying any of this is to make sure you know why you are doing it and what your KPIs are. If you have a contact center and you are measuring customer satisfaction, deploying voice AI should not be driving that score down. Build out your analytics first so you can actually validate whether the system is solving the problem you think it is.
Seth Earley: Mike, this has been a fascinating conversation. You have really made clear that voice is not just another input channel - it is a highly complex, high-impact environment for AI. Thank you so much for joining us and for the insights.
Mike Pappas: Thank you very much, Seth. This was a great conversation and I am glad to have had the chance to talk through it all.
Seth Earley: And thank you to our listeners for tuning in. Stay with us for more conversations about how AI is transforming business, technology, and the ways organizations operate in the real world. We will see you next time on the Earley AI Podcast.