Buffl uses cookies to provide you with a better experience. You can find more information in our privacy policy.
How to (pre) process audio?
furier trafo
-> show audio file as spectrum….
=> overlapping window based approach…
What is the process of training Tacotron 2 (audio deepfake) ?
collect dataset <text, audio>
preprocess (to spectrogramm)
train model (such as tacotron…)
What is a challenge with audio deepfakes training?
mismatch in trainig data
X
Text (what is said)
Y (information in target audio waveform)
Speakers (who say it)
prosody (pitch, speed, emitions, breaks,…)
channel information such as bnackground noise, recording artefacts…
=> What will data driven Model
X -> Y learn?
How can one reuse a model to fit to specific voices?
fine tune model
multi-speaker architecture
how can one get audio data?
youtube?
e.g. interview, podcast,…
What are and disadvantages of using interview files for training audio deepfake?
good and much data
several speakers
noisy (background…)
hard to transcribe (‘ähm’…)
bad prosody (information entanglement)
What are and disadvantages of using podcast files for training audio deepfake?
easier to trabnscribe
single speaker, clean audio, even prosody
few data -> need mamny podcasts -> heterogeneous data…
What are multi speaker TTS?
embed input so that only
Last changed2 years ago