Brand Voice: Deep Learning for Speech Synthesis

Speech Synthesis


The Data

my_voice_000001.wav|I am a wonderful being.

The Code for Speech Synthesis

git clone

Preprocessing: the what

Preprocessing: the how

wav_directory: '/data/datasets/MyVoice/'
metadata_path: '/data/datasets/MyVoice/metadata.csv'
log_directory: '/data/logs/tts/'
train_data_directory: 'transformer_tts_data'
data_config: './data_config.yaml'
aligner_config: './aligner_config.yaml'
tts_config: './tts_config.yaml'data_name: ljspeech # raw data naming for default data reader (select function from data/
def my_voice(metadata_path: str, column_sep='|') -> dict:
text_dict = {}
with open(metadata_path, 'r', encoding='utf-8') as f:
for l in f.readlines():
l_split = l.split(column_sep)
filename, text = l_split[0], l_split[2]
if filename.endswith('.wav'):
filename = filename.split('.')[0]
text = text.replace('\n', '')
text_dict.update({filename: text})
return text_dict
python --config configs/session_paths.yaml


Aligning the data: the what

Aligning the data: the how

python --config configs/session_paths.yaml

Get them durations

python --config configs/session_paths.yaml

Train the TTS model

python --config configs/session_paths.yaml
from utils.config_manager import Config
from import Audio
config_loader = Config(config_path=f'config/session_paths.yaml')
audio = Audio(config_loader.config)
model = config_loader.load_model()
out = model.predict("I don't speak often, but when I do, it's for ODSC East.")
# Convert spectrogram to wav (with griffin lim)
wav = audio.reconstruct_waveform(out['mel'].numpy().T)



Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store
ODSC - Open Data Science

ODSC - Open Data Science


Our passion is bringing thousands of the best and brightest data scientists together under one roof for an incredible learning and networking experience.