Text-to-speech synthesis

By 2026, conversational AI within contact centres will automate one in 10 agent interactions, saving $80 billion in labour costs, facilitated by key technologies like text-to-speech synthesis, as demonstrated in a Python project converting text to voice-over audio.

Heiko Joerg Schick

Sep 7, 2022 • 1 min read

A little Python project

Article voiceover

0:00

/1:39

By 2026, conversational artificial intelligence deployments within contact centres will reduce agent labour costs by $80 billion, according to Gartner. As digital transformation increases, many jobs are impacted at call centres and in the customer service industry. Gartner projects that one in 10 agent interactions will be automated by 2026. The advantage of conversational artificial intelligence is that it automates all customer interaction—voice and digital channels.^[1]

An essential and required key technology for conversational agents is text-to-speech (TTS) synthesis—a technology that synthesis the human voice artificially.

I started using such technologies to combine the receptive skills of reading and listening. It helps humans to learn a language better and study more efficient. Because of this, I made a little Python project which converts an input text file into a spoken audio output file—allowing me to generate voice-overs for my blog and op-ed postings.

The project was inspired by Dr Tristan Behrens's^[2] arxiv-reader^[3], which converts arXiv papers to audio. It also uses the FastSpeech2 model from fairseq S^2.

The voice-over of this posting is also using this project.

To execute the Python script run the following commands (please note that I use pyenv).

make env

make tts

The project is available via the following repository: https://gitlab.h3132.de/schihei/tts