Text-To-Speech (TTS) -- Frequently Asked Questions
Big Changes in 2014 and 2015
In July 2015, support for this TTS web demo will end, following a long run that
begin in 2001. The site may continue to operate for some time, but problems are
unlikely to be resolved and email to "tts-feedback" will probably go unread.
This site runs on AT&T servers, and the Interactions team will loose access
to those AT&T machines mid-year.
Work continues on a new generation of TTS, closely integrated with our "Watson"
speech and language platform, which was developed at AT&T.
In December 2014, AT&T spun off the AT&T Watson speech research group,
which is now part of Interactions LLC.
Earlier in 2014, AT&T Labs discontinued support for certain TTS languages
and voices that were available on this web site. While there is still an active
R&D program in speech technology, any plans or schedules for new releases
Table of Contents
- About the website.
- About the product.
- About the research group.
- About the technology.
How does the demo work?
This section addresses the mechanics of using the interactive demo page. For more information
on the Text-To-Speech technology behind the demo please see
How does TTS work?
The Text-To-Speech demo has three easy steps which are numbered on the page.
First, choose one of the available voices. This also selects the language.
Each voice was created from recordings of a native speaker of that language. NOTE:
the web demo limits the length of the text.
Second, enter some text. NOTE: the text should match the chosen language. Please
note that no language translation is done. If you enter English text for
a Spanish voice the results will be unpredictable at best.
Third, click "SPEAK" or "DOWNLOAD". The "SPEAK" button should start
downloading audio within a few seconds and play it with no further action (unless
your browser is set up to download it instead). The "DOWNLOAD" button displays
a new page containing a link to the audio file. In most browsers you can left-click
that link to play the file or right-click for a menu. The menu probably has an
option named "Save Link Target As..." or something to that effect which you can
use to save the file with a name and location of your choosing.
Please note that there are restrictions on the use of
The demo will generally accept either UTF-8 or ISO-8859-1 "Latin-1"
If accented characters do not behave properly it is very likely an
issue with communication between the website and your browser rather
than the synthesis software itself. Likewise, any restrictions on
length or content are imposed by the website, not by Natural Voices.
TTS accepts diacritics appropriate to the language being synthesized,
e.g. "ñ" in Spanish. Some browsers provide a way to input
special characters, or you may click the small "special characters"
link just above the text box on the right-hand side.
What if the demo does not work properly?
The demo page was designed to be as simple as possible. You submit
an HTML form containing text and voice selection. We return either an HTTP header redirection
to the completed audio file ("SPEAK" button) or a new page displaying the audio URL ("DOWNLOAD"
There are two categories of problems that sometimes arise, and it is helpful if
they are reported separately, as follows:
If a word or name is mispronounced or garbled, use the "SEND FEEDBACK" button
to let us know. This sort of feedback is very helpful for improving the
dictionaries and finding errors in the voice databases. Please be sure that
the text and voice being submitted actually created the problem so that we
can replicate it. Any details you can provide are most appreciated, for example
"My name should sound like ...", or "should rhyme with ...".
Please, please, please check the spelling before you report a mispronunciation.
Less typing for you, less time reading email for us.
And no, our software does not include spelling correction. That can be done
by an application before submitting text to the TTS engine, if appropriate.
Spelling correction can sometimes make outrageous changes to names and acronyms,
and so is unsafe without human supervision.
If you get no audio at all or only part of it, if it is clearly not speech,
or if it stops and starts in pulses, then something is wrong with the speech
creation, delivery, or playback. Please continue reading.
If something is wrong, there may be a problem with our web servers, with the internet itself
or with the hardware or software on your computer. This checklist should help you determine
where the problem is.
If you don't solve the problem directly, you will at least have a better
idea what is wrong. If you believe that there is a problem with our website or TTS
software, please pass along as much detail as you have (scenario, error messages, etc.)
Please note: if the problem is with your computer or browser, we cannot help you.
We're not qualified and it's not our job.
Step 1. Try again.
Maybe there was a glitch in the network. Consistent problems are real problems.
Step 2. Error message?
Was there an actual error message? From the web? From the audio player?
Please send the exact text of any error messages.
Step 3. How far did you get?
Did you see the website demo page? Select a speaker and type some text?
Click the "SPEAK" or "DOWNLOAD" button? Did it appear to download any audio?
Was there a download progress indicator? Was an audio player launched or attempted?
If you get no audio download, there may be something wrong with our TTS server.
If you download audio but cannot play it, the problem is likely on your side
(but keep reading for hints).
The website's basic operation is as simple as possible. You submit voice name
and text. If all is well, our site will redirect your browser to an ordinary WAV file
containing the speech. That file will be available for about 5 minutes, after
which it will be deleted to recover disk space. You only need a standard browser
and the ability to play a WAV file.
Please note that one of the most common problems is caused by packages that block
pop-up windows. These packages also frequently block indirect audio downloads.
Test by clicking "DOWNLOAD" rather than "SPEAK".
Click the resulting audio link to listen. This should work even if HTTP redirections
to audio files are blocked. You'll have to reconfigure or disable your blocker to
use the "SPEAK" button.
Internet Explorer on Windows Vista may claim that the WAV file is corrupted. The
following solution was found on the web. It worked for me, but your mileage may vary.
Go to Conrol Panel, double-click "Default programs"
Click "Associate a file type or protocol with a program" -
takes a few seconds to load the list.
Double-click the file-type under the "name" column (such as ".mp3").
It is probably showing something already, like "Windows Media Player", but don't believe it.
Now you'll have a list of "Recommended Programs". Double click the one you want.
I chose "Windows Media Player" even though it was already "supposedly" set to that.
Again, it will show "loading" at the bottom of the screen for a few seconds.
We have also seen other issues on Windows over the years, including stopping audio during
the last word and playing older clips instead of the most recent.
Step 4. Sound, but something is wrong?
Maybe you hear sound, or even speech, but it sounds wrong somehow. There are
several possibilities. First be sure that you can play audio files.
Try a sample audio file or test your browser's
audio capability on some multimedia test page.
Did everything sound fine but then stop early? There is currently a 300 character
limit on the website to reduce server load. The CGI script deletes the extra characters,
even in the middle of a word, which can yield odd results for the final word.
The limit may change without notice as server load varies.
Note that the TTS software itself will synthesize any length text.
Did it play at the wrong speed? We currently deliver only 16 KHz audio. You
may have to convert the sample rate if your audio card cannot handle this, but
that would be rare these days.
Did the speech sound choppy or stop too early? Some Microsoft audio players
seem to have problems on the initial download. If the player can replay, e.g.
with VCR-type buttons, it should sound OK on the second playing.
This is a common complaint but we're not sure how to avoid this.
What audio format is used?
The speech output audio format is a simple WAV (a.k.a. RIFF) file. The sample
rate is 16KHz 16-bit linear, i.e. 16,000 samples per second, each sample a 16-bit integer,
mono (not stereo). The website uses these wideband voices for best quality.
We also ship 8KHz versions of the voice (8,000 per second, one 8-bit Mulaw value per sample)
for a 4-times reduction in voice database size.
The 8K voices are useful for telephony applications (where the phone line limits quality anyway)
and for platforms with storage limitations.
There is no option on the demo page for MP3 or similar encodings.
If you need a different sample rate or audio format you can probably find free software to
convert what we deliver. But before you use the audio for something more than private listening
please check the website usage policy.
Are there restrictions on the use of this site?
Yes, both the site itself and any downloaded audio files have restrictions.
The goal of this site, after all, is for as many people as
possible to become familar with AT&T Natural Voices™ text to speech.
But, the website is for demonstration purposes only, and is not a free service.
More details can be found below, but here are quick highlights:
-- No broadcast, distribution or publication of audio clips without licensing. (#2)
-- Limited private use is allowed. See examples below. (#3)
-- All synthesis from this web page, i.e. don't submit text from your wesite or apps. (#4)
-- Certain content may be rejected by the website, e.g. if it seems commercial. (#1)
-- Input is logged and treated as private customer data. (#5)
-- AT&T will cooperate with law enforcement, including providing relevant log entries. (#5)
-- Text length and number of accesses may be restricted. (#6)
1. AT&T Natural Voices™ is commercially available.
AT&T Natural Voices™ is
commercially available as a cloud API
and as desktop or server software. Please consider
purchasing if these limitations affect you.
Natural Voices™ is available in desktop, server, and SDK editions.
The desktop voices are an inexpensive add-on to several PC packages
which read documents, convert to MP3 files, etc.
We reserve the right to reject the submission if it seems intended for commercial use. If this happens,
please choose text of a different type or topic to try out the voices.
If this is a problem, please see the
How to Buy section.
And, just to clarify, any restrictions on length, content, etc. are imposed by the website,
not by the synthesis software.
2. No Broadcast, Publication, or Distribution of Audio.
This website is intended only for demonstration purposes.
Audio clips created here may not be broadcast, published, or distributed.
It doesn't matter if the use if non-commercial. Any such use requires
licensing, so please contact
Wizzard Speech LLC.
3. Limited Use of Audio Clips.
Audio files produced on our site are intended only for private, non-commercial use.
This is not legal advice (hey, we're researchers) but here are some scenarios.
Speaking events on your own computer is probably OK. Sharing the audio clips on
the internet is not. A class project is probably OK. Bugging your friends is probably
OK (with us).
Any use that involves wide distribution or long lifetime is probably not OK,
whether or not it is commercial. Audio clips used in songs, videos or game levels
cannot be made publicly available on the internet. Building or prototyping
a software package using our audio rather than recording your own prompts is
definitely not OK.
The common thread here is a very limited audience, and minimal impact on
potential commercial sales.
If you aren't sure, please ask.
If you need the software or a license to use the audio, you can find a link to Wizzard Speech LLC in the
How to Buy section of this FAQ.
4. Only Through This Demo Site.
Website resources are limited and are needed to support this site. Direct access
to the CGI scripts is not permitted. You may refer people to this site
but you may not have users enter text on your site and use our servers to provide
audio, even if you give us credit. To run such a site you must install your own TTS server
with licenses from Wizzard Speech LLC.
5. Input Text Will Be Logged.
Text will be logged. This information is handled just as any other private customer
data and is kept confidential. However, AT&T will cooperate with law enforcement
when there is legal authorization to do so.
6. Text Length and Number of Submissions Per Computer is Limited.
The number of submissions is limited. Access may be temporarily blocked if
there are too many submissions from one computer. This, together with the length limit,
allows more users to try the demo and shares the limited resources more fairly.
Limits on submissions per day will be adjusted as needed to keep the website functioning normally.
Length of the input text is limited, typically to 300 characters. Anything longer
is chopped off, and may result in partial words or single letters at the end of
the speech. The length limit is a website feature and helps regulate server load.
If you need to synthesize longer text, the product can handle input of any length.
Limits on text length be adjusted as needed to keep the website functioning normally.
Can the synthesis be modified?
It is possible to change the way the speech sounds by altering the input text.
Liberal use of commas is the easiest way to get better phrasing,
especially in long complex sentences. Overall speed can be controlled
using XML-style tags from the SSML standard, e.g.
<prosody rate="slow"> this is speaking slowly </prosody>.
<prosody rate="fast"> this is speaking fast </prosody>.
<prosody rate="-50%"> this is 50% slower </prosody>.
Rate control does not change the pitch of the speech output.
i Precise pauses can also be inserted using the <break/> tag, e.g.
Break for 100 milliseconds <Break time="100ms"/> Okay, keep going."
Break for 3 seconds <Break time="3s"/> Okay, keep going."
Voices and languages can be intermixed using the <voice> tag, e.g.
<voice name="crystal">Crystal, 1 2 3.</voice>
<voice name="rosa">Rosa, 1 2 3.</voice>
Back to Mike.</voice>
The Speech Synthesis Markup Language,
or SSML, is defined by the W3C organization. Note that not all tags are supported. See the documentation
for specific product releases for more details.
How to get AT&T Natural Voices™?
NV is available for use as follows:
A more thorough description can be found here.
The AT&T Developer Program offers cloud services through its
Wizzard Speech LLC handles
licensing and support of released versions of AT&T Natural Voices™ for US English and US Spanish.
Please be advised that the version of AT&T Natural Voices used in the online Research demo
may change without warning and may differ in quality and voice selection from commercially available
Who supports AT&T Natural Voices™?
supports all issues related to the web API or the speech output.
Wizzard Speech LLC supports desktop and server SDK
editions for US English and US Spanish. Please contact Wizzard Speech for all questions about sales, licensing,
updates, supported platforms, etc.
If you have an installed version of AT&T Natural Voices™, your vendor
is the first line of support. They know their own products and common problems involving
installation and interaction with TTS, and so are mostly likely to be of immediate help.
Our TTS research group at AT&T Labs cannot handle direct customer support. As a small
research group focused on the underlying science and technology, we just don't have sufficient
resources for field support. In particular, we haven't seen half the applications that use NV
and have no idea what their error messages mean. Problems may be escalated to us by support staff,
in which case we can handle any serious issues just once, allowing us time to continue our research
to improve TTS quality.
Who's synthesizer is this?
Our Research group at AT&T Labs produced AT&T Natural Voices™.
The website demo runs a recent Research version of the synthesizer (and you
may note differences from the released product).
This TTS system was originally known as "Next-Generation TTS" or "Next-Gen"
and some published technical papers refer to it by that name. The "Natural Voices"
name came about when our system was introduced as a commercial product circa 2001.
Who works on TTS?
AT&T has a very long history in speech synthesis, beginning in Bell Labs and continuing
in the newly formed AT&T Labs following the Lucent spin-off in 1996. Initially at Murray Hill,
we are now located in Florham Park, NJ but are moving to Bedminster, NJ in October 2013.
Though our research
team is relatively small, we have leveraged available resources to create a ground-breaking
TTS system in AT&T Natural Voices™. And although TTS has advanced considerably
in the past few years there is still much room for improvement, and research continues.
You can learn more about our work on the Publications page.
How can I contact the research group?
You can send questions, suggestions and problems to the research group at
We cannot promise to respond to every email
-- there are just too many --
but we do our best.
What is Text-To-Speech?
Text-To-Speech, or TTS for short, is computer software that converts text into audible
speech. You can try it yourself on our demo page. See our
Home page for more information.
TTS is separate from speech recognition. You can think of TTS as "talking" and speech
recognition as "listening". There is some shared technology, but neither is just the
reverse of the other. And the talking/listening analogy is limited too. Neither
technology really involves much language understanding.
TTS is also distinct from language translation, though voice to voice
translation would employ both speech recognition and TTS. Again, translation requires
significant understanding of the meaning.
People new to the idea of TTS often underestimate the difficulty of the task. After all,
humans can typically learn this stuff in early childhood. They talk, listen, understand, and even
translate without much apparent effort. Humans do all this work without even being
aware of it in most cases, but that doesn't make it easy.
If programmers could create software that really understands human language we could avoid most
of the guesswork in TTS, but that hasn't happened yet. Until then, TTS is more like
learning to read a foreign language aloud without ever understanding the words.
With a good dictionary, grammar rules, etc. you can get better and better but will still make
mistakes occasionally that are obvious to native speakers.
How does TTS work?
TTS is often described as two conceptual stages. In the first stage, it decides how
the text should be spoken, that is, how each word should be pronounced, what
length and pitch each phoneme should have, etc. In the second stage, the system does
its best to create audio that matches the specifications produced by stage one.
TTS software has little or no understanding of the text being read. It uses
rules, lists, dictionaries, etc. to make very sophisticated guesses about how a piece
of text should be read. While general performance can be quite good, some decisions
are intrinsically hard to make without some level of understanding.
For example, the word "bass" in the phrases "bass drum" or "bass boat".
Intonation depends in many cases on the writer's intention, which often cannot be inferred in
short texts even by human readers. As a result, TTS systems will occasionally make mistakes
and can be fooled by carefully constructed texts.
These are challenging problems for all TTS systems, and we continue to improve ours as we are able.
The type of TTS we do is called a "concatenative" system, meaning that we record a human
speaker to make a voice database. We re-use small chunks of the recordings to create
new sentences containing words that were never recorded. Further, we do "unit selection"
synthesis. This means that we use large voice databases and do clever searches on-the-fly
to find chunks in the voice database that best match the requested sentences.
Who uses Text-To-Speech?
TTS is used in a wide variety of services and applications. Commercially, help desks
and voice response systems are probably the most important. On nights and weekends, when
people are scarce, customers can still get some basic information from computers, for
example, an account balance. The computer listens either to speech or touch tones and
responds with TTS. These apps are typically over the telephone but might also be at
kiosks or automated tellers.
TTS is also used on personal devices, e.g. on a PC to proof-read a document or to learn
a new language. This category also includes assistive technologies such as screen-readers
for the visually impaired and as a substitute voice for those who cannot speak.
As the quality of synthetic voices continues to improve, barriers to new applications
drop. Some applications, to guarantee high quality, record all the things that need to
be said. This can be expensive, impractical, or impossible depending on the task.
TTS is often a better option if the voices are good enough.
Can I use my own voice?
Not in AT&T Natural Voices™.
The reasons for wanting customized voices are varied. Some people just think it
would be cool. Some are losing their voices due to a medical condition or upcoming surgery
and would like to have their own synthetic voice rather than a generic one. Some people have
audio tapes of a late loved one.
(See the reference to ModelTalker
in the section on Assistive Technologies
below that may be useful for people soon to lose their voices.)
Creating high-quality voices
requires a good voice talent, a sound-proof room, professional audio equipment,
hours of written material with thorough coverage of phoneme combinations in the
language, and the time and expertise to turn those recordings into a decent synthetic
voice. Because of the expense involved, custom voice builds are usually done for
corporations that want to computerize an existing actor's voice, for example to
continue a brand image.
Since even professional actors reading well-chosen material don't always synthesize well,
another possibility is to get the highest quality recordings possible, and as much of
it as possible. Keep the recordings in a safe place until the technology improves for
transforming one voice to sound like another.
It may take far less material to build a tranformation model than it does to
build a TTS voice from scratch. Eventually it may be possible to take a good TTS voice that
is roughly similar (e.g. mid-pitch-range male, same accent) and morph it to sound like the
What about assistive technologies?
TTS systems have a long history in assistive technologies.
First, there are two basic classes of TTS software:
Small, fast general-purposes TTS systems are available, and many come with simple
applications which allow text typed on the screen to be read.
There are multiple systems on the market (ours included), varying widely in voice
quality, hardware requirements, and price.
General-purpose TTS software on a desktop or laptop PC, and
Specialized packages for various disabilities.
These vary greatly in price and capability, and are sometimes customized
for particular users. These products are needed when general purpose
hardware and software do not suit the users needs.
Disability-related systems can be grouped into two basic types:
Screen readers for the sight-impaired are an example of the first case.
The user listens to the voice long enough to adapt and reach high
intelligibility even for low quality voices. A convenient user
interface with good responsiveness is perhaps more important than
voice quality. The ability to vary the speaking rate, and particularly
to choose extremely fast speaking rates, is crucial.
Stephen Hawking is the classic example of the second case, typing on a PC
so that other people can listen. Here, the listeners come and go
frequently and so most have less opportunity to adapt to low-quality voices.
Fast speaking rates are not needed. If keyboard input is difficult
or impossible, touch screens or physical buttons are customized to
represent words or phrases. Buttons may be selected in a variety of ways,
depending on the user's abilities.
These systems range from a budget PC with inexpensive TTS software
to highly customized button boards controlled with exotic control
Here are some links that may help. I'm sure that many other
relevant sites can be found by searching on the web.
A "beta" (free preview version) of a program called
available now (early 2008). This package allows a person
(e.g. someone soon to lose their voice) to record themselves
and create a synthetic voice that sounds more or less similar.
The examples from the website sound pretty good and there are
some good researchers behind it. We don't know what the
commercial version will cost or when that will come out,
but this is the first such program that we're aware of. It
may also be available with pre-existing voices.
The misc.handicap newgroup sometimes
has discussions of commercial TTS products. This is especially useful
because you can ask questions of people who face similar challenges.
Info on many specialized solutions.
Links to many TTS systems, some commercial, some research.
How can I learn more about TTS?
A good starting point is "An Introduction to Text-to-Speech Synthesis"
by Thierry Dutoit, published by Kluwer Academic Press. For hands-on
experience, take a look at free TTS software called Festival from the
University of Edinburgh in Scotland. Downloads are available, and it
includes instructions for creating new voices and languages. Many
universities around the world use this software as a framework for their
own research. There are also some technical papers on our website that
might serve as entry points into the research literature, e.g. by also
reading some of the referenced papers.