Earley AI Podcast - Episode 29: Real-Time Accent Conversion and the Future of Inclusive Voice AI with Maxim Serebryakov

Breaking Accent Barriers in the Enterprise: How Phoneme-Level AI Is Transforming Contact Center Communications

Guest: Maxim Serebryakov, Co-Founder and CEO at Sanas

Hosts: Seth Earley, CEO at Earley Information Science

Chris Featherstone, Sr. Director of AI/Data Product/Program Management at Salesforce

Published on: April 24, 2023

In this episode, Seth Earley and Chris Featherstone speak with Maxim Serebryakov, Co-Founder and CEO of Sanas, a Stanford-born AI company building real-time accent conversion technology for enterprise contact centers. Max shares the personal story that inspired Sanas - a friend's experience with accent-based discrimination despite a Stanford engineering background - and explains why solving accent bias required building a novel phoneme-level algorithm that operates on the client edge with near-zero latency. The conversation explores deployment challenges, bias in speech-to-text engines, ethical considerations, and the broader vision for giving everyone a choice in how their voice sounds.

Key Takeaways:

Sanas was inspired by a Stanford systems engineer whose thick accent caused him to underperform in a contact center despite exceptional technical knowledge.
Standard voice conversion research cannot transfer accents because accent is encoded at the phoneme level, not just in pitch or tone modulation.
Sanas built a novel edge-based algorithm with minimal latency that existing research had not achieved, solving the cloud round-trip problem that previously made real-time voice conversion impractical.
Accent bias in speech-to-text engines disproportionately favors standard American English, and Sanas can normalize audio input to improve accuracy across all downstream transcription tools.
Contact centers spend heavily on accent training and reject qualified candidates based solely on voice - Sanas eliminates both of those costly and discriminatory practices.
Phoneme structures are similar across languages, making Sanas robust enough to generalize across multiple languages and dialects beyond just English.
The long-term vision extends beyond contact centers to gaming, streaming, enterprise communications, and daily interactions - giving every person control over how their voice is perceived.

Insightful Quotes:

"We ended up building an algorithm that really doesn't exist in the research world. It's very innovative, it works on the edge, works with clients, and it's very efficient." - Maxim Serebryakov

"You're not just modulating the pitch and tone, you're actually changing the underlying phonemes that are present within. Phoneme is the most granular level of speech that you could get." - Maxim Serebryakov

"We want to build a system where people have a choice to sound however they want to sound, and we want to do something to improve the bias that a lot of people experience on a daily basis and allow people not to feel the need to change the way they speak to fit in." - Maxim Serebryakov

Tune in to hear how Sanas is using phoneme-level AI to eliminate accent bias, transform contact center operations, and give every person on Earth a choice in how their voice is heard.

Links:

LinkedIn: https://www.linkedin.com/in/maximser/
Website: https://www.sanas.ai/

Ways to Tune In:

Earley AI Podcast: https://www.earley.com/earley-ai-podcast-home
Apple Podcast: https://podcasts.apple.com/podcast/id1586654770
Spotify: https://open.spotify.com/show/5nkcZvVYjHHj6wtBABqLbE?si=73cd5d5fc89f4781
iHeart Radio: https://www.iheart.com/podcast/269-earley-ai-podcast-87108370/
Stitcher: https://www.stitcher.com/show/earley-ai-podcast
Amazon Music: https://music.amazon.com/podcasts/18524b67-09cf-433f-82db-07b6213ad3ba/earley-ai-podcast
Buzzsprout: https://earleyai.buzzsprout.com/

Thanks to our sponsors:

Podcast Transcript: Real-Time Accent Conversion, Phoneme-Level AI, and Inclusive Voice Technology

Transcript introduction

This transcript captures a conversation between Seth Earley, Chris Featherstone, and Maxim Serebryakov about the origin story behind Sanas, the technical breakthroughs required to build real-time phoneme-level accent conversion, how the technology improves speech-to-text accuracy and eliminates accent bias in enterprise contact centers, and the broader ethical and societal implications of giving people control over how their voice sounds to others.

Transcript

Seth Earley: Welcome to today's podcast. I am Seth Earley.

Chris Featherstone: And I'm Chris Featherstone. It's good to be with you, my friend.

Seth Earley: Yeah, nice to see you. Our guest today is passionate about using AI to give people a choice in how their voice sounds to others. Born in New York, raised in Russia, and educated at Stanford, please welcome co-founder and CEO of Sanas, Max Serebryakov.

Maxim Serebryakov: There's really no way of pronouncing anything correctly. There are so many different deviations and variations. I also go by Max, Maxime, Maxim - I'm okay with anything. But yeah, it's a pleasure to be here, guys. Thank you for inviting me, and looking forward to the chat.

Seth Earley: Yeah, so tell us a little bit about your background. The world according to Max - what's your journey been and how did you end up where you are?

Maxim Serebryakov: I'm Max. I was originally born in New York, didn't live there for quite some time. I basically was born there, lived a few months, and then moved out and lived for most of my life in Russia. I grew up there. My family is in Russia. As you can imagine, I've had quite a few very unusual perspectives when it comes to coming to the US, accents, being an immigrant. And that's really the brewing grounds for the creation of my company, Sanas. I would love to tell you more about Sanas and what we're building.

Seth Earley: So when did you move back to the US and what brought you here?

Maxim Serebryakov: I got into Stanford from high school when I was 18. I came to Stanford and honestly fell in love with the Bay and the United States. It's such a beautiful place. I've done so many different trips across the US with my family and it's incredible out here. I studied at Stanford, and part of my studies I spent a lot of time looking into artificial intelligence, studying artificial intelligence. Through these experiences I ended up building Sanas, and here we are today.

Chris Featherstone: Originally you weren't looking at it from language and speech because you had dabbled into a bunch of different other things. You were looking at quant-style mathematics and all sorts of interesting statistical anomalies. So how did you get to this piece? Because it's a really neat story - you weren't focused on speech and language before.

Maxim Serebryakov: You know what's interesting about Stanford in general - the computer science education and artificial intelligence education is that you basically take a bunch of different classes and you learn really the fundamentals in all of these different categories of machine learning, like computer vision, natural language processing, spoken language processing. You learn techniques - reinforcement learning techniques - and a lot of these state-of-the-art architectures. So you really get a good broad perspective on how things work. You get a good education on probability theory, stochastic processes. But really your specialization starts whenever you leave college. And the areas of focus - what pulled us into building Sanas was really a fascinating incident that happened to one of my good friends, this guy named Raul. Raul was a Stanford systems engineer. At the time he was a very, very smart guy, and it just so happened that a few years back, Raul's home country, Nicaragua, was going through a very tumultuous time period. They were going through a revolution, and the demonstrations left his family in pretty uncomfortable financial situations. He had to sadly take a leave of absence from his college degree to support his family financially, which involved him just trying to find a job. He was an immigrant, and he came back to his home country, Nicaragua, and he started applying to software engineering positions, day in and day out. What was interesting is - we all know Stanford, it's a great school - but in a place like Nicaragua, or places outside of the US, they don't actually know what Stanford is. They don't know it's a great school, that he's technical, he's talented. When they see his resume, all they see is a person that never matriculated, doesn't have their bachelor's degree, only has a high school diploma, and has very high English proficiency skills. So for a high-paying software engineering job, he couldn't get the position. So his horizons expanded. He started applying to less conventional positions - call center jobs. Interestingly enough, he ended up getting a call center job answering customer queries associated with broken computers. A Stanford systems engineer - a guy that could disassemble and reassemble a computer and knows the ins and outs of an operating system - basically just day in and day out answering these very repetitive queries. You'd think a profile like Raul's would outperform everyone within the contact center. But the truth is actually quite different. He was underperforming, and the reason wasn't because of a lack of knowledge - it's because of how he communicated with his customer base. He had a very thick South American accent, which led to people treating him differently. From one end of the spectrum, there were full-blown racial slurs. To the other end of the spectrum, people questioning his qualifications solely on the basis of how he sounds. And after learning about this, we started exploring this idea a little bit more - what if we were able to transfer Raul's accent so that accent is no longer a limiter at all?

Seth Earley: A limiter. Exactly.

Maxim Serebryakov: And we started building these algorithms. We started off building off some of these voice conversion research articles that were out there. We implemented some papers, thinking that voice conversion could actually probably transfer someone's accent. But quickly we saw that you could convert a person's voice to sound like, say, a different gender or some superhero like Batman. But if the input speech is Indian accented or Russian accented, you're going to have a Russian accented Batman as an output. Which is funny to imagine, but it shows the limitations of modern-day voice conversion research - in that you're not just modulating the pitch and tone, you're actually changing the underlying phonemes that are present within. What is a phoneme? A phoneme is the most granular level of speech that you could get. In images, you're working with RGB pixels - pixels are the most granular. For speech, it's phonemes. And yeah, so we started exploring this. We tried also, a little bit later, implementing some speech-to-text, text-to-speech projects. And our biggest issue - the same as Chris ran into - was latency. There are a lot of limitations that come from prosody. Text is a very lossy form of data. You go from a high amount of data to pretty much a very flat amount of data, and then from that you synthesize. But yeah, we ended up building an algorithm that really doesn't exist in the research world. It's very innovative, it works on the edge, works on the client, minimal latency, and it's very streamlined and efficient.

Chris Featherstone: The one thing we discussed at length too was that this is not just a problem that can be solved by a general speech model. Doing speech recognition is one thing, but this is so much more complex centered around the sound models that need to go into it. And like you said, how to get to the most basic constructs of what makes up a voice, which is sound. There are people out there saying I'll just do this with some speech rec and some text to speech - but no, you can't do that. When you get into tonality, inflection, speed - there are all those key pieces and also the nonverbal things that are going on as well, and then how to conjugate everything else.

Maxim Serebryakov: Chris, you nailed it. How do you map properly disfluencies? What makes speech natural are the ums and the uhs and the mmms and the elongated phonemes. So it's a really, really fascinating research problem. And for us, the applications for this technology are very broad. We could bring it into education, enterprise communications, medicine, and all of these different spaces where accent could be something that limits the speaker. And really, we chose deployments initially with contact centers and enterprises because we saw that the pain point is most clear initially with them. And on top of that, speech is very structured there - not working with whispering speech or yelling speech or crying speech. You're working largely with monotone or slightly enthusiastic tonalities. Vocabulary's technically the same too. It was probably the baseline to do English first.

Chris Featherstone: I mean, that's an easy one. The majority of the world speaks English in business. But you guys have solved for way more than just English - what is Sanas doing now in terms of multi-languages?

Maxim Serebryakov: Funny enough, we've actually done deployments for other languages. And interestingly enough, the learnings with us is that because we take our analysis on a phoneme level, phonemes are pretty similar across different languages. So it ends up being a model that's robust enough to account for many different languages.

Seth Earley: And so what kinds of clients do you have at this point?

Maxim Serebryakov: We work with some of the largest contact centers in the world. Of the top call centers, we work with the majority of them. We work with top enterprises as well, for whom customer service interactions are pretty key. We mostly deploy on voice-based call centers - that's our priority given the nature of our tech. Most of our deployments are in India and the Philippines. We've expanded out to deployments in Pakistan and a few places in Europe, so we're very actively growing. And we also have some deployments in South and Central America, spread across enterprises and call centers.

Chris Featherstone: I'd love to get your take on the intended and unintended consequences you've seen from it so far.

Maxim Serebryakov: You know, from what I've seen in our deployments, one of the most heartbreaking things that we've had to encounter is people going through accent training. Say a person comes into the contact center and they get rated a certain level of fluency - they have different levels of fluency and this term called mother tongue influence. And say they want to get upgraded from one level of fluency to another. The way they oftentimes do that is by going through accent training. All sorts of enterprises spend massive amounts of money putting agents through accent training to modify the way they speak to fit in better and communicate with their customers in the desired accent. With us, you actually remove the need for accent training completely. A lot of call centers also reject agents in the recruitment pipeline solely on the basis of accent. That's no longer a factor. You no longer reject candidates solely on how they sound. There are all sorts of these unintended consequences because one thing we realized is this is a really important problem for our customers and they really, really care about it - because at the end of the day they care about their employees and they want to make sure they're happy, and this just removes one thing for anyone to think about. And in general, when taking a more broad perspective - when it comes to vision, there are similar parallels between Sanas and the way things work with, like, Instagram. On Instagram, you take a picture of yourself - you could make yourself look skinny, you could make yourself look fat. You have complete control over the way you represent yourself digitally. You could change your eye color, your hair color - whatever you want. But Instagram is just a picture. It's very one-dimensional. What matters is really your voice - who you are, the content of your character portrayed through that, not just your appearance. Why is it that you have a level of control like that when it comes to your digital appearance, but you don't have a similar level of control when it comes to your voice? We're really excited about being that medium - the platform that empowers people to feel comfortable with their voice and control the way they speak without having to change themselves to fit in.

Seth Earley: That's fascinating. What are some of the challenges with doing a deployment in an enterprise? I imagine you need to train on specific types of - I'm sure there are generalized models and specific models you have to fine-tune. What does an organization need to do to be ready to use your technology?

Maxim Serebryakov: From the model perspective, I don't think there are as many issues. Things are mostly fine. The reason is because we built a model that's very generalizable - it's able to generalize across different dialects and accents exceptionally well. We have a very large data team internal to Sanas that prioritizes this and maintains a good distribution of different dialects. But really, for us, one of the challenges is just working with customer tech stacks. Different customers have different OS systems, different build versions of OS systems, different hardware, different chips - some Intel chips, some AMD chips. Some have two-core processors, some have four. Accounting for each and every variation of hardware is definitely no easy task, but it's a necessary task when deploying on the client. Some folks have Linux operating systems, Windows OS, Google OS.

Chris Featherstone: I can think about this too in terms of Seth's question around barriers to entry for organizations - part of this is that it also creates, if you can help figure out what the accent is, you can also generate more accuracy to refine things like search capabilities. Because if you're pulling back the information architecture, it's all the data being generated from all of these things. Then I get more accurate transcripts, more accurate documentation, and probably better abilities to understand sentiment and scoring.

Maxim Serebryakov: So true. One of the biggest limitations of modern-day speech-to-text engines is how robust they are towards different edge cases. There's a lot of bias in speech-to-text that leads to it recognizing a certain gender's speech better, or a certain pitch distribution better, or a certain accent better. Accents are a really, really big one. And a lot of these massive big tech titans are really fighting for a couple percentage improvements in word error rates for their speech-to-text engines. And if you get a certain level of improvement, you're popping champagne and celebrating. This improvement is really just built off of robustness to those edge cases. One cool thing that Sanas provides is it normalizes a lot of the input audio - normalizes it to sound like whatever your audio distribution is able to understand best. And oftentimes, because of the way speech-to-text engines are trained nowadays, they're mostly trained on standard American English speech. So if you're able to build an algorithm and use Sanas as a front end for that speech-to-text algorithm - that's preprocessing in the pipeline - it also improves some accuracy metrics. It makes something that can be more easily recognized by the speech-to-text algorithm.

Seth Earley: Yeah, it's an interesting idea.

Maxim Serebryakov: And it also democratizes speech-to-text a little bit. Right now any of us could go on GitHub, download 100,000 hours of speech data, and use that data to train a speech-to-text engine. You get pretty solid results from that, but it won't be robust to different accents because these massive tech companies have their own datasets. You probably won't have access to them. But you can have access to open source ones and get a good enough speech-to-text engine that you can use internally. Speech-to-text can be really expensive for companies too, so this could really be a game changer for companies that are okay with a certain level of accuracy.

Chris Featherstone: Hardest accents - which are they?

Maxim Serebryakov: One of the favorite things about my job is really how much you learn about different cultures, different accents, and the linguistics that come with these different places on Earth. An interesting one is the Japanese accent. I wouldn't say it's hard, but it's very unusual relative to any other languages out there. That's largely because sometimes in the Japanese language they flip phonemes - sometimes the L becomes the R when a Japanese-accented individual speaks in English. Another very common thing in the Japanese language is that they have consonant-vowel-consonant-vowel-based combinations. If you look at cities in Japan, that's usually the combination. So words that have a lot of consonants together are a little bit difficult for Japanese-accented individuals to pronounce. For example, the word McDonald's - in Japan they add in extra vowels next to the consonants. Sometimes the way they say McDonald's is - Maku Donarudo. It's a very interesting one. We get a lot of traffic from Japan though, massive, and we're definitely going to expand out there. There are just some interesting research problems to solve first.

Seth Earley: Just wanted to pause for a moment, we're about halfway through, and remind folks that we're speaking with speech conversion expert Max Serebryakov.

Chris Featherstone: What's interesting is I've had to give a speech internally around how speech works and things like that. And it's heartbreaking enough because you always want to pull data as to how this works. Come to find out, in 2022 in the US, 21% of adults are illiterate. If you take out the non-native US people, that means there's a huge generation of folks that are illiterate, and another 60 or 70% of them are 6th grade and below in literacy. I'm always fascinated by these things because we always learn speech before we learn reading - one is in the language processing part of the brain and the other is in the visual processing part of the brain. And so it's been super interesting with these types of technologies now that are coming out.

Maxim Serebryakov: Yeah. And it's interesting that we mentioned movies. These massive streaming companies spend hundreds of millions of dollars on movie dubbing, to make sure that Brad Pitt's mouth moves in the right way when selling movies in China, for instance. That's something our technology could also work on. Imagine Brad Pitt's original voice but in another language with the correct accent - it's a very interesting scenario.

Chris Featherstone: It's almost like a deepfake for voice, right?

Maxim Serebryakov: Right. And by the way, speaking of deepfakes - whenever generative companies appear, there's always this other branch of companies that end up also forming to help detect what's happening. Because there will be a massive market for that too.

Seth Earley: I just did a webinar with a charter school to talk to them about how to use generative AI in their day-to-day work and how to deal with it in the classroom. What do you see as the downsides to this type of technology? People will use things for nefarious purposes. Could you see someone using it to catfish or commit fraud? And do you have safeguards around that?

Maxim Serebryakov: Before getting into that - are you teaching kids how to cheat in school via ChatGPT? Is that what's happening?

Seth Earley: I'm teaching the teachers how to deal with cheating and to teach the critical thinking skills about how they should be asking questions. And also giving them GPT-0, which is the detector, around that. So yeah, they're going to use it, but you need to say what's the best way to use it. It can augment certain things.

Maxim Serebryakov: If I had ChatGPT in college or even in high school or middle school - oh my goodness. Very different person today. So kudos to you. But yeah, we have a very strict KYB process each time we onboard a customer. Our investors emphasize that, our customers emphasize that. We have a bunch of InfoSec certifications. It's really critical for us to make sure that whoever we deploy with and work with are legitimate organizations with no nefarious tasks or actions. That's part of our onboarding process. We analyze our customers carefully.

Seth Earley: So where do you want to go with the tools and the technology? What's on the horizon for you in terms of planning and applications?

Maxim Serebryakov: For me, the importance of this technology really lies at its core in allowing normal people to be able to communicate. There are obviously some limitations when it comes to identifying the synthesis to make sure it's not being used for nefarious purposes. But long term we think the research world and the industry world will account for that. We want to bring this to enterprises, we want to bring this to daily communications between people. We had a conversation with one of the founders of Twitch, and they were telling us that they have a lot of gamers who are exceptionally successful in Asia and have trouble breaking into the US market because of their accent - the way they sound. We want to build a system where people have a choice to sound however they want to sound, and we want to do something to improve the bias that a lot of people experience on a daily basis and allow people not to feel the need to change the way they speak to fit in - permanently.

Maxim Serebryakov: Absolutely. But this is one of those things I personally struggle with a little. Even though there are a lot of nefarious actors online, I also love America for its freedom of speech, the ability to express yourself. And whenever moving into the direction of censorship, you always have to be careful about going too far. Where do you really draw the line? If definitions change, the line shifts. You don't really know where it ends. So it's an interesting scenario, but I don't see Sanas expanding into that market quite yet.

Seth Earley: Did you have anything you can quickly demo for us?

Maxim Serebryakov: Yeah. I wish I had pulled in one of our voice actors to give a proper demo - I didn't think that one through. But let me give you a demo from the website. I'll intermittently enable and disable Sanas so you can hear how it changes.

[DEMO: The demo played a contact center agent's voice with Sanas disabled and then enabled, demonstrating real-time accent conversion on a simulated customer service call. The difference between the original and converted voice was clearly audible to podcast listeners.]

Seth Earley: That was very, very impressive. Thank you for that impromptu demo.

Maxim Serebryakov: Sorry I couldn't make it live. We'll have to have one of our reps call you guys and you could have a proper chat.

Seth Earley: We've also done some experiments with accents within India - one Indian accent transferring to another Indian accent. And there's actually a lot of bias also within India where one accent in English is identifiable by another person. You can also modify accents there too, and there's definitely a market for that.

Seth Earley: And it starts to, at some point, become a translation of some sort - you're translating accents. Are you also looking at language translation as an extension or part of this?

Maxim Serebryakov: Not quite yet. Speech translation is a great space and it's very interesting and exciting, but there are certain limitations that appear with durations. For example, some languages put nouns and verbs at the beginning of sentences while others put them at the end, and it goes into some of these temporal-based questions. If a company is building speech-to-speech translation, they really have to account for the permutations and combinations of different languages based off of grammatical structure. I think there are some markets right now where the tech is definitely there, and I've actually invested in some of these companies and they're incredible.

Chris Featherstone: Max, what should we be looking for from you personally and from Sanas?

Maxim Serebryakov: We're continuing growing, continually expanding our clientele - focused on deployments and revenue, and we're scaling out revenue. So this is going to be an incredible year for us on that front. We're a 70-person team. I want to expand out our horizons from call centers to even more enterprises and building relationships there. If there are any enterprises listening in, you're always welcome to reach out to me personally. Reach out to Sanas and we'd love to work with you. Our technology has a lot of different applications, so even if you think of an interesting application that we may not have considered, also reach out to us. We're always welcome to feedback as well.

Seth Earley: Well, it's been wonderful to meet you, Max. Where are you based?

Maxim Serebryakov: I fly between the Bay and Bangalore. You could find me in one of those two.

Seth Earley: This has been fascinating. I've really enjoyed it and great to have you. Thank you again for your time. Chris, thanks for the introduction. Appreciate your insights and it's a fascinating area. We'll definitely want to stay connected and hear how things evolve.

Maxim Serebryakov: Seth, Chris, appreciate it guys. Let me know how I could be helpful and I'm here. Thanks, my friend.

Earley AI Podcast - Episode 29: Real-Time Accent Conversion and the Future of Inclusive Voice AI with Maxim Serebryakov

Breaking Accent Barriers in the Enterprise: How Phoneme-Level AI Is Transforming Contact Center Communications

Meet the Author

Let's Connect