Looking for more podcasts? Tune in to the Salesforce Developer podcast to hear short and insightful stories for developers, from developers.
99. The Technical Side of Deep Fakes
Hosted by Julián Duque, with guest Dmytro Bielievtsov.
A "deep fake" is the derisive name given to the rise of manipulated pictures and videos. Will newer forms of computer generated media cause us doubt what we see and hear online? Dmytro Bielievtsov is the CTO and co-founder of Respeecher, a speech-to-speech platform that produces AI-generated audio samples. In this second half of a two-part episode, he'll explain how audio can be faked, why it can be advantageous, and more importantly, how faked audio can be detected.
Julián Duque is a Lead Developer Advocate at Salesforce and Heroku, and he's continuing a previous discussion with some members of Respeecher. Respeecher has created AI software which works within the speech-to-speech domain: it takes one voice and makes it sound exactly like another. Dmytro Bielievtsov, its CTO and co-founder, explains the practical uses of the software, such as re=recording the lines of an actor who is unavailable, or bringing historical figures to life in a museum.
In terms of sophistication, there are quite a few speech ML models already available on the Internet. The best source of audio to duplicate the speech patterns of a famous person is to grab an audiobook and pass it through one of these pre-existing models. But these models produce outputs which are poor in quality. That's one reason that speech-to-speech is hard to fake. The variation in our mouths and speech patterns, not to mention the emotive qualities, make the process of creating duplicate voices extremely difficult to pull off. One way Respeecher data can't be faked is by the fact that they position themselves as a B2B business, dealing with studios and other large estates which have access to immense amounts of hard-to-acquire sound data. The likelihood of another entity abusing a well-known voice is close to none. Another feature is that the audio is watermarked. Certain "artifacts" are embedded into the audio, which are imperceptible to humans, but easily identifiable by a computer program.
There's a consortium of several companies working on synthesized media who strategize in Slack on various ways to keep the tech from being misused. As well, Dmytro believes that there needs to more investment in education, to let people know that such technology exists, and to therefore be a bit suspicious with media they encounter online.
Links from this episode
Julián: Welcome to Code[ish]. My name is Julian Duque. I am a developer advocate here at Salesforce and Heroku. And today we are going to be continuing with the synthesized media, a series of episodes we have been talking about. Remember, on our previous episode, we were talking about the ethical side of deepfakes or synthesized media, and we had Alex from Respeecher speaking with us about it. Today we have the opportunity to be speaking with Dmytro Bielievtsov. He's the CTO and co-founder of Respeecher. Hello, Dmytro. How are you doing?
Dmytro: Hi, Julian. Thanks for having me.
Julián: So can you tell us a little bit more about yourself and what you do at Respeecher?
Julián: Can you remind our audience a little bit about Respeecher, the company, and what you exactly do in the market?
Dmytro: Basically at Respeecher, we build this software that lets one actor speak in the voice of another actor. And so this is a speech-to-speech, voice conversion software that's similar to deepfakes, but for audio. And our main market right now is the film industry where you can free up the voice of an actor, make an actor scale their voice, and use it for both preproduction and maybe help with ADR. And also of course, for actors who are no longer available for certain reasons, maybe they're sick. Maybe some of them are historical people who've died a long time ago. So with consent from their estates or from themselves, we try to use software and machine learning to help them scale their voices.
Julián: Nice. On our previous episode, I learned something very important. The technical term behind it is synthesized media, but the popular name that you are going to get out there, it's deepfake. So let's explain to our audience, what is a deepfake?
Dmytro: Right. So yeah, there's lots of interpretations, lots of different meanings. Originally deepfakes appeared when people first tried to use GANs, which are generative adversarial networks, it's just a special kind of neural networks. They applied them to generate moving faces of people, to reanimate faces of people and make them say things or appear in videos, which the original people never appeared on. So that's why it's called deepfake because it uses deep learning, deep neural nets to fake a video and replace a person there. And so that's, this term stuck and now it's used for basically everything that involves synthesizing a human appearance using a neural net. Because of course you could have done that before with CGI and computer graphics, but it usually involves much more resources and much more work, but now we can use these pre-trained neural nets to just easily place faces with whatever you want.
Julián: So pretty much instead of having a producer or somebody doing the models and changing things, we have our neural network doing all that work.
Dmytro: Right. Something along these these lines, except right now, if you want to do a really high quality deepfake, I'm talking about visuals now. So if you want to make it a real high quality deepfake which people will have a very hard time telling apart from a real video, if you want to do that, you still need a very specialized neural nets and you need to spend a lot of machine learning researchers time. And maybe you, I'm pretty sure you need to do a bunch of manual tweaking. So for fun applications, just throwing Elon Musk's face on whatever video you want. So if it's a music video or something, that works, but usually other people can tell it right away. But if you want to do it for a production grade video with 4k video or something, that's still pretty challenging. And it's the same for speech. You can easily do a funny, personalized text-to-speech thing, which sounds similar to Barack Obama, but to make it into a movie with an isolated high quality voice, it's a much bigger challenge.
Julián: Yeah. I can imagine. I was just yesterday watching YouTube and I looked at a video of Penn and Teller, this TV show called Fool Us, when they have magicians all over the world doing crazy magical acts. And this was this French magician that was doing a very simple act by guessing a card. But he was using the voice of Penn as if it were Penn speaking to him or thinking. And it was pretty much Penn's voice. Penn was very surprised about him using his voice because it was exactly his voice and they never recorded the voice for the show. So first how does this work, or what are the types of different synthesized media that we can work with with speech? For example, this magician, how was he able to imitate Penn's voice without having Penn working with him training a model? So can you explain a little bit about the technical side behind this?
Dmytro: Yeah, it depends. There's two different major types of synthetic speech. And those are, the one that people are most familiar with is TTS, which is text-to-speech where you take some text and you give it to the model and you get speech on the output. And so there's a whole bunch of TTS models available there on the internet. You can download them and with a little bit of skill, you can actually use them. So, I mean, I'm not sure about that specific case, but probably I would imagine it could have been TTS and for the TTS to work, you probably could get away with, I don't know, several minutes of speech available somewhere. So I mean the best source of speech is usually an audio book. So if some person recorded an audio book, that's usually a great sample because it's very homogenous. It's recorded in very controlled conditions. The distance from the mic doesn't change and it's several hours. That's great.
Dmytro: But if you don't need that huge amount of quality, if your converted speech will be playing off of a cell phone and in a room with a lot of reverberation, then you don't need that much of quality and you get away with a lower quality synthetic algorithm. And for that, you don't need that much data. You can get away with several minutes of data, which you can scrape somewhere off of YouTube and stuff like that. But I mean, that said in reality, if you really want to have high quality sound, you usually ask. We usually ask our clients to provide us with data, because if it's a studio, they have archives of recording, of high quality recordings they can just provide us with. And we'll use that. It's way better and way easier to work with than any stuff that we can collect in public domain.
Julián: So yeah, I imagine text-to-speech the result can be plain. I imagine it's going to be difficult, I guess, to add emotion, intonation, to be able to imitate the personality of the people behind the voice. Tell us about speech-to-speech, which I think this is the most difficult case.
Dmytro: Well, just one side note regarding text-to-speech. It's actually, there are examples in research literature of really good text-to-speech, really good emotional text-to-speech. But the problem is you can't precisely control it. It's hard to control text-to-speech because the means of input are text and maybe you can annotate it somehow. But you can't ask it to make a specific question pattern, or specific intonation or something and nonverbal things. That's the thing that makes it a little bit harder to use for creative content. When it comes to speech-to-speech, the difference is that we convert, the algorithm or whatever it is that does the synthesis takes speech on the input and generates speech on the output. But it keeps the content, it keeps the intonations, it keeps the emotion off the source actor, but it tries to replace the identity of the source and replace it with the identity of the target speaker.
Dmytro: And there are many levels you can do speech-to-speech voice conversion on. So the most perfect speech-to-speech you can imagine is if it also adapted language because different people have different linguistic habits. So that's the highest level possible voice conversion, which is really hard because you need your neural network to not only parse the sounds of the input speech, but it should also understand what it's about and take the words and replace them with a more, with the words that are more typical of the target speaker. So we don't work on that level. We work on a slightly lower level where we replace phonemes and things that are peculiar to the target speaker.
Dmytro: So the way you pronounce different sounds, they can have flips in the different durations, say my A in certain contexts might be the longer or shorter than your A in the same context. So these things we try to convert, try to replace, and also the quality of the sound itself, its spectral profile, the performance. So my A sounds different from your A because the resonant frequencies in my mouth are different from the resonant frequencies in your mouth because it's got a slightly different shape. And this is what the network tries to capture and consistently change. So that on the output, it sounds like me and not like you.
Julián: Let's say we want to do a deepfake of my voice and we train the model and we have enough data and everything. Will this be also able to imitate my accent, for example, how I pronounce English and the strong pieces of my accent, or is it not there yet?
Dmytro: It really depends. If there would be a person with similar accent on the input, then it would be fine, but it's cheating. You can think it's cheating because we are reusing an accent of a different person that's similar to your accent. But if it would be an American native speaker or a person with a British accent, or whatever other accent, then it will be a mixture on the output. So we're not there yet in terms of converting accents. It's a little bit more difficult than we initially anticipated, because when we started the company, we thought it was, we'll solve it in a year or something. But then it turned out that, oh, no, we're here for much longer. But yeah, but it's exciting anyway.
Julián: So we are on very serious times and this technology is very powerful. That can be used as we say, for good and entertainment and keeping history. I mean we can see how a lot of different good uses, but also there might be certain bad actors that definitely are going to use this for the bad. Can you tell us how we can start even defining this synthesized media? How can we see if this is going to be the real deal or is just a fake speech? There are technology or tips you can give us to be able to train ourselves to identify this type of synthesized media?
Dmytro: When it comes to us, I guess one important thing that we're doing to prevent this is we try to work exclusively with B2B and not actually publishing to B2C apps that anyone can use. Because well, by working with studios, well, first of all, we can afford much more time and resources to focus on a specific voice and make sure that it sounds great. Another reason is that it's a trusted studio so the likelihood that it will just all of a sudden start abusing the voice is close to none. Another thing is audio watermarking, which is also could be useful for people who are consuming media. And watermarking is just basically the idea is the same as with images is to embed certain code in the audio, such that it's imperceptible by humans. You can't tell that there's something, but a specific program can easily identify it. And of course, no watermark is perfect.
Dmytro: A perfect watermark would be totally imperceptible by a person, or a human, but it will be 100% detectable by a program, no matter what happens to the audio. But there's always a degree of corruption that you can apply to audio. It can be re-encoded a million times. You can cut frequencies, you can play it, lower the quality, add some noise, play with a ton of reverb, and then it will be extremely difficult for the program to still identify that watermark. But the good news is that with video, huge companies like Facebook, Twitter, or these networks where you find information, they're investing a lot of money into deepfake detection algorithms. I believe that even now, I mean it works to some extent, and so Twitter would notify you, if I'm not mistaken, about a potential deepfake in the video.
Dmytro: So that's one thing. So you should be trying to watch for those signs. And another big, helpful thing is there are very typical artifacts for synthetic speech. One of the artifacts that we're actually fighting with is when it becomes slightly, what we call phasey, it almost sounds like there are two people speaking at the same time. So that's one noticeable artifact. Others are maybe some weird mistakes or mispronunciations that are very improbable for humans. If I suddenly replaced an O sound with A sound or if my S is shaky. With these neural nets actually, it's pretty universal.
Dmytro: I hear it in many popular YouTube audio deepfake videos is that you can hear sometimes that is an effect similar as if the person has a little bit of water in their mouth. When they speak, it starts shaking a little bit. So I would probably recommend going on YouTube and listening to a ton of generated synthetic speech and trying to try to listen what it is. Listen very carefully on good headphones, then you can develop a pretty good ear with defects. Right now for us, it's pretty easy usually to detect. But I mean I'm sure it's not going to be like this for much longer.
Julián: Wait, are people working more technology around it and defining and protecting this type of content? Is there any standardization committee or group of companies trying to do something together around this topic?
Dmytro: Yeah. There's this group of companies that work on audio, not necessarily audio, but some other deepfakes, like us, Modulate AI, some other companies. And we have this Slack community that would try to see what's going on there and try to find the ways of keeping the tech from being misused. There's also companies that work specifically for audio watermarking and that's basically all they do. We're in the process of trying one company's watermark, just to see how it compares to whatever we have at home. Because we'll probably, I mean they invested much more time in this, so maybe it's better. So one thing is watermarking and the other part of the thing is just educating people about this.
Dmytro: A lot of it comes down to learning, because with Photoshop, if someone shows you a perfectly photoshopped photo, it's really hard. I mean there are some experts that can tell the difference, but for an average person, it's almost impossible to tell that this picture was photoshopped, but it doesn't seem to be that huge of a problem for us in real life. Because everyone knows that it's very easy to photoshop a picture. And then whenever you see something unusual, you're immediately very skeptical about it. And a quick fact check usually shows that it was a fake picture or something like that. So I think as soon as people become more educated, it will be a similar problem.
Julián: Let's talk about the future. Is there any trend or new type of synthesized media that you see might be coming in the future or people are working on?
Dmytro: The main direction for synthesized media right now are text, speech, and audio. Yeah. Maybe we'll see more powerful combinations of those, because right now they go separate ways. Because speech and video are sometimes combined, but when it comes to text those models are usually separate. But of course, if you want to have a truly powerful artificial intelligence that sounds and looks like a real person, you want to combine a very powerful language model with speech and audio. It's not going to be around anytime soon. I feel like we're really hitting this wall of our models being too dumb to go beyond just being a parrot that repeats everything.
Julián: So do you have any advice of people that want to get into the technical side of synthesized media? Is there any community of developers or resources you can recommend?
Dmytro: One thing is that you need to be good at machine learning. It's not easy, but there's a lot of information out there. So there's deep learning courses and stuff. So assuming you have some machine learning experience, then when it comes to speech, Google DeepMind and Google's Tacotron team are probably one of the most interesting research groups there is. Because they started basically a lot of modern, cool sounding tech-to-speech models are based off of the Tacotron architecture that was released a while ago.
Dmytro: So if you just Google Tacotron, they have a page with a whole bunch of papers. And they start with this, with their first paper, which was just this simple TTS model. But yet that was based on actual deep neural nets entirely without any classical things inside. And it was great, but then they kept publishing more and more paper and made it more emotional. They made it more personalized. They made it multi-speaker. They added a whole bunch of interesting things and it's improved it. And you can just read through all these papers and see... Basically have a very good view on the state of the art of synthetic speech AI. It's not really related to speech recognition, but actually the synthetic speech.
Julián: Dmytro, thank you very much for joining us today on this fantastic episode. Looking forward to take a look at what you are building and get into a little bit more about synthesized media and obviously train ourselves how to identify and see the good side of this amazing technology.
A podcast brought to you by the developer advocate team at Heroku, exploring code, technology, tools, tips, and the life of the developer.
Developer Advocate, Heroku
More episodes from Code[ish]
Jim Jagielski and Alyssa Arvin
Jim Jagielski is the newest member of Salesforce’s Open Source Program Office, but he’s no newbie to open source. In this episode, he talks with Alyssa Arvin, Senior Program Manager for Open Source about his early explorations into open... →
Lisa Marshall and Greg Nokes
This episode of Codeish includes Greg Nokes, distinguished technical architect with Salesforce Heroku, and Lisa Marshall, Senior Vice President of TMP Innovation & Learning at Salesforce. Lisa manages a team within technology and product... →
Innocent Bindura and Greg Nokes
How do you know an application is performing well beyond the absence of crash reports? Innocent Bindura, a senior developer at Raygun, shares the company's tools and utilities, discusses the importance of monitoring P99 latency, and talks... →