Audio Deepfakes Result in Some Pretty Convincing Mashup Performances

Advertise here with Carbon Ads

Stay Connected

This site is made possible by member support. 💞

Big thanks to Arcustech for hosting the site and offering amazing tech support.

When you buy through links on kottke.org, I may earn an affiliate commission. Thanks for supporting the site!

kottke.org. home of fine hypertext products since 1998.

🍔 💀 📸 😭 🕳️ 🤠 🎬 🥔

posted Apr 30 @ 12:36 PM by Jason Kottke · gift link

Audio Deepfakes Result in Some Pretty Convincing Mashup Performances

Have you ever wanted to hear Jay Z rap the “To Be, Or Not To Be” soliloquy from Hamlet? You are in luck:

What about Bob Dylan singing Britney Spears’ “…Baby One More Time”? Here you go:

Bill Clinton reciting “Baby Got Back” by Sir Mix-A-Lot? Yep:

And I know you’re always wanted to hear six US Presidents rap NWA’s “Fuck Tha Police”. Voila:

This version with the backing track is even better. These audio deepfakes were created using AI:

The voices in this video were entirely computer-generated using a text-to-speech model trained on the speech patterns of Barack Obama, Ronald Reagan, John F. Kennedy, Franklin Roosevelt, Bill Clinton, and Donald Trump.

The program listens to a bunch of speech spoken by someone and then, in theory, you can provide any text you want and the virtual Obama or Jay Z can speak it. Some of these are more convincing than others — with a bit of manual tinkering, I bet you could clean these up enough to make them convincing.

Two of the videos featuring Jay Z’s synthesized voice were forced offline by a copyright claim from his record company but were reinstated. As Andy Baio notes, these deepfakes are legally interesting:

With these takedowns, Roc Nation is making two claims:

1. These videos are an infringing use of Jay-Z’s copyright.
2. The videos “unlawfully uses an AI to impersonate our client’s voice.”

But are either of these true? With a technology this new, we’re in untested legal waters.

The Vocal Synthesis audio clips were created by training a model with a large corpus of audio samples and text transcriptions. In this case, he fed Jay-Z songs and lyrics into Tacotron 2, a neural network architecture developed by Google.

It seems reasonable to assume that a model and audio generated from copyrighted audio recordings would be considered derivative works.

But is it copyright infringement? Like virtually everything in the world of copyright, it depends-on how it was used, and for what purpose.

Celebrity impressions by people are allowed, why not ones by machines? It’ll be interesting to see where this goes as the tech gets better.