Deepfake Out: Voice Synthesis with Music.AI

Eric Doades
Feb 1, 2024
33 min read

Seth Goldstein and Michael Nevins of Music.AI give us the lowdown on audio deepfakes, singing clones, and synthetic voices, and what they mean for the music business.

We peel back the layers of AI-generated voices, revealing how this tech transforms one person's voice into another's while maintaining the original performer's emotional core. We delve into the ethical landscape of voice synthesis, distinguishing questionable fakes from commendable uses like aiding singers and crossing language barriers.

We unpack the potential impact of synthetic voices on music business models and opportunities. Discover how estates of late artists can continue their creative endeavors, thanks to voice modeling, and learn about a marketplace that's reshaping artist compensation.

Finally, Seth, Michael, and Tristra forecast voice modeling's expanding role across various sectors. Imagine a multilingual Shaquille O'Neal or educational content delivered in perfectly localized intonations. We analyze how startups could leverage synthetic voices to enhance customer service and user offerings, thanks to high-quality, localized voice models.

Looking for Rock Paper Scanner, the newsletter of music tech news curated by the Rock Paper Scissors PR team? Subscribe here to get it in your inbox every Friday!

Listen to the full episode here on this page, or wherever you pod your favorite casts.

Listen wherever you pod your casts:

Listen on your favorite podcasting platform!

Shownotes from the Episode

https://www.ispot.tv/ad/5jqB/papa-johns-shaq-a-roni-mucho-mas-pepperoni-song-by-young-mc-spanish

Episode Transcript

Machine transcribed

0:00:09 - Tristra

Hey everybody and welcome back to Music Tectonics, the podcast that goes beneath the surface of music and tech. I'm your host, Tristra Newyear Yeager, chief Strategy Officer at Rock Paper Scissors, the agency specializing in PR and marketing for music innovators. This episode, we're going to get real about so-called deep fakes and faux drakes and we're going to get serious about the so-called clones out there. I'm talking about voice synthesis, when AI models transform the voice of one person into that of another, with all its possibilities, challenges and potential pitfalls. So how does this stuff actually work? What can it do and what can't it do? What business models are appearing that will make sense in the long run for the music business? Joining me to answer these questions today are two AI thinkers coming from very different angles but from the same company Music AI, also known as Moises. They are Seth Goldstein, the general counsel and vice president of legal and business affairs, and Michael Nevins, vice president of branding and communications. First, let's hear from Michael.

0:01:17 - Michael

I wrote a little song for the Music Tectonics podcast. I wrote a little song for the Music Tectonics podcast. One more time for the Music Tectonics podcast.

0:01:36 - Tristra

Aw, shucks, you shouldn't have. Hey, wait a second, michael. Can we hear that again?

0:01:41 - Michael (Lauren)

I wrote a little song for the Music Tectonics podcast. I wrote a little song for the Music Tectonics podcast. One more time for the Music Tectonics podcast.

0:02:00 - Tristra

Hmm, that sounded different. What do you say? You try it one more time.

0:02:05 - Michael (George)

I wrote a little song for the Music Tectonics podcast. I wrote a little song for the Music Tectonics podcast. One more time for the Music Tectonics podcast.

0:02:24 - Tristra

Okay, okay, enough with the silliness. In case you haven't guessed, that first voice is the actual voice of one of our guests for this episode and the next two were synthetic based on that singing performance, Using AI models in Music AI's voice studio to render them as different sounding voices. Note Michael is not a pro singer. He wanted me to let you know that and therefore the models can't fix his non-pro sound. So regrets to everyone. No, AI is going to make you sound just like the weekend. During our conversation, Michael tells us exactly what inspired him to write a little song for the Music Tectonics podcast and how he did all this, as well as a bunch of other topics, so I'll let him and Seth tell you all about it. Here's the show. So we are here with Seth and Michael of Moises and Music AI. Hello everybody.

0:03:14 - Michael

Hello, hey, Tristra Thanks for having us.

0:03:18 - Tristra

Oh, absolutely, this is going to be a blast. So, first off, a lot of people have been throwing around things like voice clones and voice modeling etc. But I would love if you could explain voice synthesis to me, like I'm a fifth grader. How exactly does this stuff work?

0:03:36 - Seth

Where to begin.

0:03:36 - Michael

Seth and I have debated about how to best describe this topic in the past, and which word even to use. Is it voice synthesis? Is it voice cloning? Eke, we don't like that one. Is it timbre transfer? There's a lot of different things to consider here, but I think Seth has a good and concise because I don't do concise A good concise description of the tech and how it works, so I'm going to defer to him.

0:04:03 - Seth

Yeah, I mean at a base level. Voice synthesis is making something sound like someone using computers and we use the word something because that's something can be text, it could be an audio recording, it could be anything, it could be a video, but I hope my fifth grader understands that.

0:04:22 - Tristra

Probably a fifth grader to understand that better than my parents, for example. I think it comes a little bit more intuitively to fifth graders. So, exactly how? So let's walk through the stages. How do you record the you know? How do you get the model to sound like someone else, Like? What do you need to put into the model? How vaguely does the model work, and how do you get the output?

0:04:48 - Michael

Well, I'll start with how we operate, because there are different ways to do this. Certainly, you could, as people are doing already, find audio out in the wild, run it through an AI model that essentially captures the voice characteristics and then moves on to create some sort of horrible product that we should all be ashamed of the end. But we operate a little differently, quite a bit differently really, and the way we have brought voice modeling to market is through a product that we call Voice Studio. And with Voice Studio we have gone into top quality studios with high level gear, some of the best gear in the market microphones, etc. Etc. And we record professional singers.

The typical process. Without wanting to give away too much of our secret sauce, the typical process is not what you might think. We don't just take a song from them, run it through a computer and then come at the other side with something, something usable. We have a specific proprietary process that we follow, that in the recording process that helps us come away with a very, very high quality impression of that voice or voice timbre model. There's different ways that we can describe it.

What I think is important is to understand that doing that doesn't mean that we are completely changing somebody, creating something new and unique unto itself. I can sing a song, apply a voice model to it from our Voice Studio, for example and it will still retain my energy, my emotional content, my inflection, the way I set it, etc. Etc. But it marries it with the voice characteristics of the voice model. So, for example, when we do things with a celebrity, for example, they may help us build a model of their voice. And then another person, for example, if we were doing a translation into another language, for example, you need somebody who can speak Spanish and do that properly in a localized way, and then we marry the celebrity voice to it so that it sounds natural and sounds accurate.

0:07:16 - Tristra

How can we make these voices sound better? So is the recording input the key element, or are there other steps along the way? How do we judge quality when it comes to voice synthesis, whether we're talking spoken word or song text?

0:07:30 - Seth

Well, I think there's two different ways that you can improve the quality. You can create a voice model using any audio, but to get the best quality on the voice model, you need to be training with a clean vocal as many hours of speech as you can to really get the timbre of that person you're trying to create a model of. It's also based on the model, the underlying model, that you're using. Some are better than others. We like to think ours is amazing, but it's really the quality in and you get quality out.

0:08:02 - Michael

Right and the algorithms used. Ultimately, the AI models that are used to do the work are constantly improving, so that's true for everybody in the industry. We put particular time and effort into making sure that that happens all the time, so part of it is not only the work you do in the capture of a voice, but it's the improvement of your model over time.

0:08:26 - Tristra

And there's also the quality of the underlying performance, right. So what you map the model onto? We were just talking about this a bit before we hit the big red record button, michael. So if you're not the best singer in the world, the voice model can only do so much, right. You have to have a good performance in order to have the voice model give you a good output.

0:08:48 - Michael

I think she just said that I'm not a very good singer.

0:08:50 - Tristra

but you satisfied, you satisfied.

0:08:55 - Michael

I'm not fishing for compliments but I will be honest, I'm not a great singer, a better drummer than singer but I do sing sometimes and I did record some audio for Tristra that she can share later and perhaps we can add it to the program. But in that, while I was able to marry it with better voice models and people who have better sounding underlying voices than I do, if you will, it still retains all of my problems. Yeah, my pitches and create and my inflection could be better, and you know they're very authentic, though.

0:09:31 - Tristra

You sang from the heart, though that's what really matters.

0:09:35 - Michael

Yes, it's true, some people might love it, but it does.

You know to do this really, really well. But, for example, I think everybody at this point who might be listening to this is probably aware of the fake Drake project I don't know what else to call it the fake Drake incident and in a situation like that, where somebody is doing this kind of work to take a voice model and try to sound like Drake, they may have captured a voice model of Drake. This is not something we had anything to do with them. It'd be clear about that. You still need to try to sing like Drake. You still need to imitate Drake really well for it to work well. So it depends on you know humans to be very good at this. So, in a more positive context, when you're doing this for localization, translation and localization, or trying even just to use it in the context of producing a new music product, the underlying singer needs to be able to sing and needs to have, at least needs to be able to understand pitch and timing and everything else to make it work.

0:10:42 - Seth

And then this is. I mean, this is one of the incorrect assumptions that a lot of people make and why voice cloning has become such a dirty word of the last few months. I think everyone who's heard Heart on my Sleeve by Ghost Rider thinks Ghost Rider types in some words and out and he put it through Drake's voice model and out came Drake. That's not the case at all. You know, ghost Rider or some other singer created a sound recording of them doing their best Drake impression. You know his flow, the way he speaks, his accent, the way he, the words he uses, and they took that sound recording and then put it through a voice model which applied Drake's timbre to that sound recording. And now and out, you get Drake. Drake's timbre, the way he speaks, the way he sings, the way he raps, his flow and everything. So there's a lot that needs to go into it to really create an imitation of someone, and that's why I think we think voice cloning has become a really dirty word.

0:11:35 - Tristra

Because of that, because that's everyone's first introduction to this technology, yeah that's such an important point that that you both have made that without the underlying performance from a human, the voice model can only do so much. So if I were to sing like Robert Goulet and then I don't know, add, you know, but but you use a voice model that was like the weekend, it would. It would not sound like either singer. In certain ways it's kind of an interesting creative gap there.

0:12:05 - Seth

Right, the tech will get there. I mean, tech's getting better every day, but the tech is not there yet. The tech is not there where you can just type in some words and outcomes. Drake and sounds exactly like Drake.

0:12:18 - Michael

Clearly there's there's some like and I think in any when new technologies apply to any art form. I'll stick with music in this example. There's often a little bit of hysteria in the beginning about oh my God, what could this mean, and I think that's fair in this case. I think Ghost Rider and Fake Drake is, you know, the kind of thing that people should be concerned about and the industry certainly should be concerned about. So I don't mean to minimize that, but I think in the beginning we tend to focus on the oh no, what does this mean? As opposed to oh wow, what could this enable? You know, what are the positive possibilities, or something like this, and so I imagine we'll touch on that to some of that today.

0:13:00 - Tristra

Yeah, so we've talked a bit about music so far and singing, but there are other ways you can use synthetic voices. So what are some creative ways you guys have heard about these voices being used? You know, again, we tend to people tend to focus a little bit on the doom and gloom. That's a little easier to grasp right now, but there's some other really cool, exciting, inspiring ways to use it. Let's talk about that.

0:13:21 - Michael

I can jump in here.

0:13:22 - Tristra

Please.

0:13:25 - Michael

Yeah, I'm trying not to talk too much.

0:13:30 - Tristra

It is a podcast interview, so you know talking too much is part of the game.

0:13:34 - Michael

There are very smart people on this call, so I don't want to be that kind.

0:13:41 - Seth

But, tristan, at least we have. You know I'm the pessimist in the room.

0:13:45 - Speaker 4

Oh, thank goodness, Any time you want some optimism go to Michael, we'll get to you.

0:13:50 - Tristra

We're going to get to you soon, Seth, when we talk about the legal and business implications, because that's where the pessimism is really going to come in. But, Michael, tell me.

0:13:58 - Michael

Certainly there's a joke about an optimist and pessimist and a marketer and a lawyer walking into a bar together.

0:14:05 - Seth

You get along well, but just good.

0:14:07 - Michael

One of the interesting things for me when I first joined the company was to learn about what's happening, you know with I guess the best way to put it would be artists, states, and there may be a better you know name, image, likeness, I guess. On the legal front, you know, we have a partnership with a company called HyperReal and they work with the states, in some cases of artists who've passed away, a famous celebrity who's passed away, and in that context they are essentially enabling virtual experiences where the artist's legacy essentially is preserved and, you know, with the permission at least of the family and the estate, they're creating new content. And so, while it can be a little polarizing, I think when people first heard about, let's say, tupac and I wish I forget the word the- holograms Say it again.

Sorry, the hologram quote unquote Hologram, thank you. I think that over time there will be more acceptance of that kind of thing is sort of, you know, a regular thing that we see happen and there's already a lot of activity there. So there are projects in the works where we hear about artists who are retiring I'm not going to go into the details but where they want to be able to continue to earn money and continue to perform in some fashion, and voice modeling is being used in that context essentially to, you know, keep the music going and keep, keep things going. It also is true for spoken word. So there are celebrities who are not musicians, who are also, that either their families or estates are using voice synthesis as a way, or voice modeling as a way to extend their sort of active lifespan in the public.

0:15:59 - Tristra

Amazing. We're gonna take a really quick break here and we'll be right back to talk about a little, just just a little touch of the doom and gloom and some of the other prospects facing voice models. We'll be right back.

0:16:11 - Shayli

Hey Shali here. Join me online February 7th 2024 for Music Making Innovations a free online event. Find out where the innovation is happening in music making apps and instruments. We're bringing together some of our favorite finds from the NAMM show and beyond. You'll get sneak peek demos of innovations that are shaking up the creator space and making waves across the music industry and hear from the pioneers who are doing it. That's February 7th at 10 am Pacific, 1 pm Eastern, 6 pm UK time. Register for free and learn about our monthly seismic activity online event series at MusicTectonicscom. See you there. Now back to the show.

0:16:58 - Tristra

Okay, we're back and now we get into. I gotta say I love the creative stuff. That's really fun, but I also kind of love the nitty-gritty of how all this will affect business. Recently, for example, there was an announcement that there's a GPT store. These are chatbots. Right, these aren't voice models. They're not speaking. I don't think they are. Maybe they are. I haven't poked around there that much. There's been a lot of talk about like how valuable is this? There's been a lot of jokes among journalists of the number of GPT girlfriends and the stubbornness of a lot of chatbots who just refuse to give you any answer about what they've been told. That's one option to have, like a marketplace of voices or something like that. How do you see these models being turned into a business? What is the future total, addressable market for something like this?

0:17:54 - Michael

I don't know that I could speak to total addressable market, but I can share a little bit about something that we're doing already, which essentially is a marketplace in the way it operates. We created this voice studio product and I don't want to go on a sales pitch about our product, but what's interesting about it is the business model and essentially, within the software platform, there are several I think we're up to 13 so far roughly 13 voices of more coming that are modeled on real singers singers who wanted to license their voice and it creates a revenue opportunity, not only for our company, but initially it's a revenue opportunity for the singers, who sometimes don't have a great way to do this. I think that that's a sort of interesting way to look at it. Somebody acquires our software, they get into our platform, they can audition different types of voices, they can decide what they want to use and the singers are compensated as a result, and Seth can share more with you about how that happened and how we structured the business. I'm thinking about things like.

0:19:07 - Tristra

Find a way. Voices slash Spotify. Audiobooks are becoming more and more a slice of the streaming audio pie, and synthetic voices are going to play a big part in that because they are already permitted to. Books that have been read by us, not read, but have been rendered in a synthetic voice, are allowed on a lot of platforms. How would ideally, the providers of these voices, the vocalists or singers or narrators, be compensated? What do you think a good model might be that would be ethical.

0:19:39 - Seth

Well, there's a few different models you could think of, but the model that we've gone to market with and we're testing is to have 100% of the revenue go back to the actual singer who provided their voice for the voice model and the 11, 12, 13 voices we've launched. They've been purchased and we've been paying the royalties on to the singers 100%. We even got some some voicemails in the last week where these singers are now getting the checks in and they're ecstatic. They're crying at how much this means to them. Amazing. Yeah, it's really good to be part of that. I come from the music industry and typically it's how little can we pay the artist? Now we're kind of flipping it on its head and say, well, how much can we pay the artist? And our goal is to get as much money back to the artist as possible and let them use their gift and their voice to generate a new revenue stream for themselves.

0:20:35 - Tristra

Where's the revenue for music, ai or Moises, though?

0:20:39 - Seth

It's all subscription based. It's different access to our platform and you can buy these voice packs.

0:20:45 - Tristra

Amazing what other you know? Can we think a bit about the what's sorry? Let's talk a bit about the customer base, though, Like you know, obviously we don't need you to tell, we don't need to know who exactly is subscribing and buying these voice packs, but roughly, you know, roughly categorized, who is doing this now and who do you see doing this in the future?

0:21:07 - Michael

Right, so today the Voice Studio product lives within, let's say, a pro tier within our software packages. So that's generally people who are producers, so people who are either working from home studios or professional studios. It really runs the gamut, but it's, it appears or manifests in such a way that it's typically in a production environment, let me put it that way. So it's not celebrity voice changer type of activity.

We don't do that and we won't, but it's people you know. If you think about the use case for a producer, let's assume that I was actually a really good producer, except, you know, rather than a guy with some software at home, I write a song. I might be able to sing that song adequately, although with a lot of pitch problems, admittedly. I might want to pitch that song to a label or to an artist or to a publishing company in Nashville and the song really there should be sung by a female country singer who's a soprano. I'm none of those things. Using voice modeling, you can essentially sing the song yourself in your own studio. Apply the voice of a country singer from Nashville, essentially, and what comes out the other side is now much more appropriate for your pitching of the song into the marketplace. So that's already happening. Where some people don't have access to local singers, they can do it themselves. So that's sort of a real case scenario that's happening right now.

0:22:50 - Tristra

Okay, I want to ask a legal question. Seth and I may be forcing us to walk into territory we don't necessarily want to spend too much time in, but this isn't interesting. I think it's a question on a lot of people's minds. So, say, we have this marketplace of licensed voices and they're putting out a lot of stuff. What are the legal risks there for both the original vocal talent and for the producer using it? How is the landscape being laid out just from a legal perspective, and what are some things you think that music industry folks need to be aware of as we sort of move into this uncharted territory?

0:23:32 - Seth

I think the biggest pitfall is if you can take a voice that's known and create something. Whether it's obscene, pornographic, it's a political position that the artist doesn't want to take. I see that as the biggest pitfall. The way we've released the voice theater right now is it's unknown artists. They're all through pseudonyms. But again, we put in our rules that you can't use this for certain content obscene content content that we find just wrong.

I think, right now, the limitation really is by contract, how you contract with the artist, how you contract with the users.

0:24:14 - Tristra

It's just interesting to think about if this does kind of take off if we move from really professional specialized producers to synthetic voices being part of plug-ins, say in a DAW like Ableton or FL Studios or BandLab GarageBand where people have a lot of access. Do you see any potential ways we could put some guardrails on this process, or are we just going to have to write it out and see what the court cases say and what the legal precedents are? How do you foresee this unfolding?

0:24:50 - Seth

Well, firstly, contractually, we can always put the guardrails in by contract, but, as we all know, contracts sit in the drawer until something goes wrong. So you know the guardrails we can put in place is we actually have technology that we're able to identify synthetic voices in an audio recording. The next step is to be able to add technology on top of that where you can identify which artist or which voice model is used to create that, so you can always track it back. You know putting guardrails on the front end. You know I don't know how much. I don't think we want to go back to the days where we're DRM-ing audio recordings when you know you can only use this voice model for a specific reason or the results of this voice model for a specific thing. So it's really about we have to put in ways to police this on the front end, on the usage end, on where this is being used.

As far as you know, legally outside the US there's some really strong rights of publicity laws in place. Inside the US it's not so clear. In the US, rights of publicity is on a state-by-state basis. There's some case law covering it, but there's no federal right of publicity in the US currently. So, like I said, outside the US you can easily not easily, but it's easier to prevent someone from using your voice in a way that you don't authorize. In the US, it's a little bit more difficult currently.

0:26:15 - Tristra

Wow Okay, good times. Yeah and this is not something new.

0:26:21 - Seth

This has been happening for a while, like in the 80s, even way before we were able to create synthetic voices. There was, you know, bed Miller is one example. Ford used a sound like to sing a song, a jingle, for one of their car commercials and they actually hired a Bed Miller backup singer and he told her to sound like Bed Miller.

0:26:44 - Tristra

Wow, it's important to remember those historical examples because I think they're not really in the conversation right now.

0:26:51 - Seth

No, and ultimately it went to court and Bed Miller won. She won damages for that. But sound likes are. There is a big chunk of the catalog out there on whether it's Apple or Spotify or any music service that are sound likes.

0:27:08 - Tristra

Wow, and has there been any? I mean in reference to things like people, just I guess that's completely legal right. So you know, you can put a blanket license. You can just go and record a song that's been released before and put it out and if you happen to sound exactly like it and you put things in right, you'll come up top of search and people will play your version rather than the original artists.

0:27:29 - Seth

Right, right. Yeah, the search results is actually a big thing with DSPs and obviously the record labels want the original versions to come to the top, as opposed to the sound likes.

0:27:40 - Tristra

And it's just going to get more complicated now with voice clones. If you're like a really talented cover artist like you're, like you were saying, michael, you have to have the right intonation. You have to sound similar enough to the artist that you'd like to use the synthetic voice of. Oh my gosh, this is going to get. This is going to get crazy.

0:27:59 - Michael

It may, it may.

0:28:01 - Tristra

It may, it may not like okay, optimist, like give us a, give us another take. How could it, how could it get less crazy or remain somewhat sane and grounded?

0:28:10 - Michael

Let me try to be realist as opposed to optimist or pessimist. You know I was working in studios in the 80s when drum machines were first becoming, you know, popular within studios. But you know, lindrum was like a thing suddenly, and in that particular and also, you know, synthesizers were already pretty well established but samplers were also becoming pretty common. And when you, at those times when, when drum machines came in, all the drummers freaked out and thought they were going to like never work again Didn't happen. Some drummers learned to program Lindrum and worked. A lot Different styles of music evolved and in some ways, if you think about it in historical context, it probably created more than it than it harmed.

Let me put it that way. It created opportunities. It may have shifted them into other, into other places, but at the same time sequencing came along. Wow, is this going to change how things work? Sampling Well, if we're sampling, a horn player, does a horn player not get hired for that session?

Those things were were legitimate concerns if people had, and certainly from a. You know, unions were trying to ensure that, you know, musicians wouldn't be put out of work and that kind of thing. So certainly some of that stuff has has harmed some people in some cases, but the music industry did not fall apart. The television industry didn't fall apart. The jingle business, as we called it in those days, did not fall apart as a result. They shifted and you know, home studios also were proliferating in the 80s and suddenly you could you didn't have to go to a giant studio. Did it hurt the studio industry? Yeah, and it also opened it up to many people who otherwise wouldn't have been able to participate in the music for advertising.

And so I take a long view on this and while I don't want to leave too much room for, you know, some of the, I think some bad things can certainly happen. I worry about what can happen with disinformation, misinformation in the marketplace, and you know we used to be able to. I think we thought we could believe our eyes and our ears when we saw things or heard things, even if it's not as accurate as it should be. People are now questioning that, and I think rightfully, because I don't think we can assume that anything's real anymore when we see it in video or hear it in audio. So I don't want to downplay that too much, but I think that some of that hysteria will pass. We'll develop some tools for fake detection and things like that, and over time we will see this in the context of technology moving forward.

Sorry if that was a long speech. No, no, that's absolutely no, it's great.

0:30:49 - Tristra

I'm also very curious what you guys think of. You know when, for instance, if you look at some of the early computer generated images that you used in movies, for example, you know I'm old enough to remember like seeing them and going like, oh my gosh, that's so amazing, you know, and you look at it now you're like, oh, it looks like crap, you know. So there's a sort of developing and you know, developing sensitivity to certain artifacts or elements in a computer generated image or sound. You know almost. You know it's very, very difficult to fool someone who's really good at you know, when it comes to like, strings versus like, since strings versus like, you know live recorded strings you know, and maybe there'll be more and more convergence in that area.

So I'm not ruling out that that's not possible. But I'm wondering will our perception catch up with technology? Like, will my kids and grandkids get really good at detecting AI artifacts, you know? I'm just curious what you think about that.

0:31:50 - Seth

I don't. Actually, I tend to take the pessimistic view on that. I think as technology gets better, they'll be less likely to detect artifacts of AI processing.

0:32:00 - Tristra

Interesting.

0:32:03 - Seth

My hope is is that, you know, this technology is using in combination with human artistry and human creation and enhances that as opposed to completely replace it. I was talking to my kids last week and I was asking them when you listen to music, do you listen to music because you like the artist or do you listen to music because you like the song and you don't even think twice about the artist? Unfortunately, they told me, you know, I just listen to it because I like the song. I don't really focus on the artist that much. Obviously, taylor Swift, you know, set her aside because she holds a special place in her home. But, you know, for seeing the future, I can see that I can see the creation of a wholly synthetic pop star, whether it's the audio, it's the visuals, it's everything, and create a, you know, create a whole career. For this synthetic pop star Can go, you know, do hologram concerts. So I hope you don't see that feature because I hope there is human artistry attached to it. But I'm dubious.

0:33:07 - Tristra

Are we gonna see a hard fork? I mean, there's gonna be people who are like you know, they're kind of taking the Amish path and you're like I want music that's made by a string vibrating with somebody in a room and it won't necessarily be like okay, boomer. I mean there's young people who also feel this way. And then there's gonna be people who are like yeah, I'm totally cool with this, totally fantastic, and honestly, I hope it's no longer just like skinny young women, like let's get weird, let's like have like crazy, strange beings that are completely fantastic and maybe some you know some cuties that we can just like most celebrities today, but like, let's get really, really wild with this People if we're gonna imagine other creatures, anyway, that aside, yeah, what do you think?

0:33:48 - Michael

I could see where there's room for both. Yeah, and I think you know you'll see both happening. You know, I remember, because there was a time when I used to think, wow, djs are artists. Now, you know, I'm old and I remember years and years and years ago, somebody telling me that they were leaving the jingle business to go.

It was actually a story told to me about somebody leaving the jingle business to manage a DJ, and I was managing bands at the time too. So this idea that you could manage a DJ, I said what are they gonna like book weddings? I don't get it. You know what is this about? Like find new radio station shops. And it didn't take long for me to then find out that this DJ was named Moby, and you know right. So we think of Moby as an artist, not somebody who is just spinning records. And so you know somebody who was recontextualizing. You know existing stuff to some degree and you know doing original music at the same time, and a whole category of dance music and creativity has evolved over that period of time that we couldn't imagine in the past.

0:34:50 - Seth

As we speak today I mean vinyl is one of the largest growing segments in the music industry. So you know, maybe there will be that nostalgia for 100% yeah or a new a new style.

0:35:01 - Tristra

a new style where people are like I wanna feel. I wanna feel the vibes, because the vibes will set me free, or whatever you know. Also, it's a crazy human thought can come up around ordinary experiences that take them, make them, transform them. Anyway, cool, we've gotten way off track, but in the best of ways, and now we're gonna take a quick break before we zig back to synthetic voices and their future.

0:35:26 - Eleanor

What's up, beautiful listeners. Now I have a question for you. What do you want to hear next? Let me know at musictechtonics.com slash podcast. Click the big pink button to fill out a quick survey. Suggest future guests or music innovation topics. You wanna hear Dmitri and Trist recover? Or just tell me how we're doing. That's at musictechtonics.com slash podcast. Now back to the show.

0:35:55 - Tristra

Okay, we're back, so I wanna talk about two things that we haven't quite gotten to, and that is one of the most interesting, I think. You know, non-musical applications of synthetic voices, and we've talked a bit about this Is getting is taking someone who is speaking one language and rendering their voice in all its beauty and glory and recognizability in another language, and I'm, though I would love to see some experiments to see how far you could push this. Like, if someone is, you know, a Dutch speaker, can they speak Vietnamese? Like how are there gaps? Like, are there language? Like things that just make you know, when you have a beautiful tonal language, like Punjabi, but you're, then you have something like finish, where the intonation just goes down, down, down, down, down, down down in a sentence Like what, what happens? Anyway, that's probably some speculation for another podcast, but tell me about a bit about regionalization and the role of synthetic voices and how you see this playing out in the next couple of years.

0:36:55 - Michael

Okay, I could start on that one. I think you asked a question about different types of languages, which is interesting. Models will improve. Training will get better over time. All of these things will improve. So, without going into the details, I can't speak to a Punjabi versus. You know, Sorry, I know it comes up, it doesn't come up every day when you know the.

0:37:16 - Tristra

Punjabi to finnish language model question.

0:37:19 - Michael

And I think. Well, it's true it doesn't come up often, but I think when, when we're thinking about localization for, you know, television, film, et cetera, even in music, there's an opportunity there and as the opportunity becomes more obvious, there will be more investment and there will be more reason for those models to improve. So in the music industry, for example, we already know that there are projects happening with people using AI, or even back in the day, the Beatles used to record in you know three or four different languages in order to be able to market their stuff in other countries. They certainly weren't the only ones to do that. That's now happening with the help of voice synthesis and I think, as artists think about ways to connect with audiences in different territories, this will become much more common and it'll be seen as an opportunity and therefore the tech that supports it will get better and better, the methodologies will get better and better.

Television and film we're already doing projects. We have a project with Shaquille O'Neal that we did for Papa John's pizza and it's for the Chakarone Pizza. I think Chakarone is an English word and certainly not Punjabi, as far as I could tell you but in any case, the TV commercial was recorded with Shaq speaking English to the most part.

He does speak a little Spanish in this commercial, but it's largely recorded in English and they wanted to localize it for markets all around the world.

And rather than just putting titles on the screen or subtitles in the local languages, they used our voice modeling tech. Essentially, we created a voice model of Shaq, with his approval and his involvement, and then that was essentially married with recordings done by local you know local speakers of, let's say, Spanish. So a Spanish speaker speaks, he does his best to try to match visually what Shaq is doing or actually Shaq was on screen in this case but to essentially match his speech pattern. And then, marrying that voice model with that local speaker, you end up with a very high quality Shaq speaking Spanish but in a particularly locally regionalized way, and that really increases the value of that particular advertisement and its ability to connect with an audience in that particular market. And so I think as that becomes more common, you'll see improvements in all the underlying tech and the methodologies and it will become sort of an expectation that when you see a Shaq commercial in Mexico that he's gonna be speaking Spanish.

0:40:12 - Tristra

Wow, that's really cool and we'll link to the video in the show notes, but I think it's worth checking out because it is so interesting and even though we're still really early days and I can only imagine how much more interesting we'll get. And what really strikes me and excuse me for being a bit of a nerd here for a second, but I really can see the value of some of this for lesser spoken languages that have had to do with either terrible overdubs for years or subtitles, which again limits your audience, your viewing ability or your audience age, demographics, right. So if you've got, if you have a kid's show, you can't use subtitles. So it's really really cool to think about, you know again, finally being able to hear your favorite show in Mongolian, if you're Mongol. That would be really meaningful and probably make things a lot more fun for the viewer.

0:41:06 - Michael

Yeah, I definitely see the sort of TV film dubbing. You know, automatic dialogue replacement that segment of the industry. There's great opportunity for voice modeling.

0:41:17 - Seth

And currently the best way to get high quality translation and transcription is to have a local speaker speak, because as you speak and when you ask a question, your voice goes up in English. I don't know if other cultures and languages they use that same intonation when I ask a question, but right now the best way to do it is to have a local speaker speak and put it through the voice model so you can get the timbre of the actor applied to that. But as technology gets better, AI will be able to affect the voice and have the voice go up when you're asking a question, so it will get much more interesting.

0:41:53 - Tristra

Amazing.

0:41:54 - Michael

I also see things beyond sort of the categories that we've thought about and I don't know exactly what the applications will be, but I can imagine you know just within sort of education and translation. Industry is to sort of bounce off of what you just suggested. Assistive tech for people who are differently abled, you know there's use of. There are already applications where we're doing, where we see text to speech, but sometimes those voices sound mechanical and don't work very, very well. So some of the same underlying technology would benefit there. And then you think about things like IVR systems, interactive voice response, so call centers, you know, customer service applications. Think about having a voice that sounds local. I'm calling from down south and I get somebody with a Southern accent answering the phone, even though the phone's being picked up in the Northwest somewhere.

Yeah, it's not an idea.

0:42:59 - Tristra

Yeah, or we all learn that you can yell your language of preference into the phone early on and it's like oh okay, you wanna speak in Spanish, or oh see, you wanna speak in Ukrainian, and it'll respond in the language that you are most fluent in and therefore make your experience way smoother and less nerve-wracking. That sounds interesting, all right. So if I were to build so here's a tough one. It's like a stumper stump around here. If I am a startup or a company that would like to incorporate synthetic voices into my existing product or product I'm developing, how do you see sort of creating the right model so that, besides, you know hiring y'all? But in general terms, how do I create this so that my to sort of guard myself against some of the challenges and dangers, honestly, of synthetic voices and make sure that my product has the best possible chance of doing well and not facing legal challenges or other sort of PR disasters? What do you guys think what should?

0:44:01 - Michael

I keep in mind. I think disaster's for a complete business plan. I'm not sure that's it here.

0:44:06 - Seth

Here which legal box is built in?

0:44:08 - Tristra

Okay, you can build me. That's a good challenge, I mean.

0:44:12 - Michael

I will have to be a little bit self-promotional. You know we built a backend platform, musicai. You can go there now. You can explore our models, you can build products essentially on our platform and create APIs to use within your own project, so that infrastructure exists. And I would say that creating AI models in the ground up and creating the infrastructure to run AI models is probably prohibitive for a lot of companies when they're in the concept stage. Proof of concept stage. So yeah, you should come to musicai and then mess around a little. First, I won't just, like you know, hand out, you know sets the email or anything like that.

0:44:55 - Tristra

Cut a deal with Seth right away and everything's fine, maybe we can have Shaq say it at the end of this, at the end of the podcast episode.

0:45:03 - Michael

Yeah, but there is that, you know, in terms of you know, but those are tools and so that doesn't cover, you know, sort of the I think maybe two thirds of your question, which I'd rather let Seth attempt.

0:45:16 - Tristra

Yeah, it was kind of aimed towards you, seth, and I'm sorry to make you answer a question about like future legal landscape kind of stuff. But if you were to advise someone in a very general way, what advice would you give them about protecting themselves and their business from you know legal challenges or troubles ahead?

0:45:35 - Seth

Right. So a lot of people in the industry are now speaking about the three C's when it comes to voice synthesis consent, control and compensation. I think, just to start, start on the consent side, get as many rights, as broad of a rights grant, as you can, from the artist using. If you are using an artist, if you are using a voice actor to create your model, get as much authorization, as many rights, as broad of rights, as possible. You know. Control and compensation, that's up to, I think, that company you know much probably don't have enough time to talk about those two pieces of it. But on the compensation side, you know, like we said, what we've done. You know we like to make sure that 100% of the compensation goes back to the artists and the voice models.

0:46:22 - Tristra

That's great. I love setting the bar high for artist compensation is music to a lot of people's ears out there. So All right, and on that high note, I mean that the pessimists being slightly like upbeat, we're gonna, I Think. Is there anything else that we didn't, that we didn't cover? You guys Think is essential to throw into the mix, as people are thinking about synthetic voices and try to wrap their brains around what this means for the music industry and the media and entertainment industries writ large.

0:46:49 - Michael

Well, the optimists will say don't worry, everything's gonna work out, okay great.

0:46:53 - Tristra

It's gonna be great like right there and then, and then. Seth would add. You're gonna hear Harry Styles sing in Thai. It's gonna be so beautiful. You won't care about all the other stuff, right?

0:47:06 - Seth

Right. Well, I mean, a lot of artists actually are doing that. You know in in Asia right now. For for years you know, if you had a song, k-pop song released, they'll take that same song, translate into the Japanese and now you release it as a J-pop song and just rinse and repeat as you go through all of Asia. And that means more money back to the songwriters, more money back to the record labels, more money back to the artists. And now so using AI, now it's you could do it faster, easier, cheaper, so hopefully this will generate more money back to the creatives. I mean, that's the hope that was very optimistic, seth.

0:47:42 - Michael

Yeah, optimistic, I think you know nicely optimistic. I do think that there's more opportunity than threat here. I really do, and and that's one of the things when I talk to people about AI, who are creatives or who are, you know, concerned, just people at large, about where things are going, there's a lot of of Consternation, in some cases hysteria, about what this could all mean. But I I tend to you know one of. Again, I'm an optimist, but I want to look at the opportunities that are that are being created out there, and some that we haven't even imagined yet. I, you know, maybe I'm being Pollyanna, but I think there's some very, very exciting things to come that are gonna Create value for people, for companies, artists, for sure.

0:48:27 - Tristra

Awesome. Okay, that's great, I love it. All right, I must press stop. Let's do it's little thingy here. Oh, come on, you dumbass. Sorry, that's not. I'm not talking to you. That. That was not addressed either of you. It's a stupid. Our stupid platform is like you are. You are very we could. We got to pin you down pretty well for a lawyer. I didn't feel like I got any like you there was. Usually it wasn't like you know the sort of like Well, yeah, yeah. I don't know what the message says, just says that's great. No, no, that's what you need. You did, you did. You don't need to create account and you don't. And but your files have uploaded. So the cloud, the cloud, has captured your fantastic thoughts. I'm gonna press stop here. Thanks for listening to music tectonics. If you like what you hear, please subscribe on your favorite podcast app. We have new episodes for you every week.

0:49:20 - Dmitri

Thanks for listening to music tectonics. If you like what you hear, please subscribe on your favorite podcast app. We have new episodes for you every week. Did you know we do free monthly online events that you are lovely podcast listeners can join? Find out more at music tectonics.com and, while you're there, look for the latest about our annual conference and sign up for our newsletter to get updates. Everything we do explores the seismic shifts that shake up music and technology, the way the earth's tectonic plates cause quakes and make mountains. Connect with music tectonics on Twitter, instagram and LinkedIn. That's my favorite platform. Connect with me. Dmitri Vita, if you can spell it, we'll be back again next week, if not sooner.

Let us know what you think! Tweet @MusicTectonics, find us on LinkedIn, Facebook and Instagram, or connect with podcast host Dmitri Vietze on LinkedIn, Twitter, and Facebook.

The Music Tectonics podcast goes beneath the surface of the music industry to explore how technology is changing the way business gets done. Weekly episodes include interviews with music tech movers & shakers, deep dives into seismic shifts, and more.

Beneath the Surface of Music and Technology

Deepfake Out: Voice Synthesis with Music.AI

Shownotes from the Episode

Episode Transcript

Recent Posts