AT&T Labs - Research

Text-To-Speech (TTS) -- Frequently Asked Questions

Home   |   Demo   |   > FAQ   |   Publications   |   Contact

   Big Changes in 2014 and 2015
In July 2015, support for this TTS web demo will end, following a long run that begin in 2001. The site may continue to operate for some time, but problems are unlikely to be resolved and email to "tts-feedback" will probably go unread. This site runs on AT&T servers, and the Interactions team will loose access to those AT&T machines mid-year.

Work continues on a new generation of TTS, closely integrated with our "Watson" speech and language platform, which was developed at AT&T.

In December 2014, AT&T spun off the AT&T Watson speech research group, which is now part of Interactions LLC.

Earlier in 2014, AT&T Labs discontinued support for certain TTS languages and voices that were available on this web site. While there is still an active R&D program in speech technology, any plans or schedules for new releases are proprietary.

Table of Contents


   How does the demo work?

This section addresses the mechanics of using the interactive demo page. For more information on the Text-To-Speech technology behind the demo please see How does TTS work?

The Text-To-Speech demo has three easy steps which are numbered on the page.

  • First, choose one of the available voices. This also selects the language. Each voice was created from recordings of a native speaker of that language. NOTE: the web demo limits the length of the text.

  • Second, enter some text. NOTE: the text should match the chosen language. Please note that no language translation is done. If you enter English text for a Spanish voice the results will be unpredictable at best.

  • Third, click "SPEAK" or "DOWNLOAD". The "SPEAK" button should start downloading audio within a few seconds and play it with no further action (unless your browser is set up to download it instead). The "DOWNLOAD" button displays a new page containing a link to the audio file. In most browsers you can left-click that link to play the file or right-click for a menu. The menu probably has an option named "Save Link Target As..." or something to that effect which you can use to save the file with a name and location of your choosing. Please note that there are restrictions on the use of downloaded audio.

The demo will generally accept either UTF-8 or ISO-8859-1 "Latin-1" character input. If accented characters do not behave properly it is very likely an issue with communication between the website and your browser rather than the synthesis software itself. Likewise, any restrictions on length or content are imposed by the website, not by Natural Voices. TTS accepts diacritics appropriate to the language being synthesized, e.g. "ñ" in Spanish. Some browsers provide a way to input special characters, or you may click the small "special characters" link just above the text box on the right-hand side.


   What if the demo does not work properly?
The demo page was designed to be as simple as possible. You submit an HTML form containing text and voice selection. We return either an HTTP header redirection to the completed audio file ("SPEAK" button) or a new page displaying the audio URL ("DOWNLOAD" button). There is some minimal javascript but it does not affect website operations.

There are two categories of problems that sometimes arise, and it is helpful if they are reported separately, as follows:

  • If a word or name is mispronounced or garbled, use the "SEND FEEDBACK" button to let us know. This sort of feedback is very helpful for improving the dictionaries and finding errors in the voice databases. Please be sure that the text and voice being submitted actually created the problem so that we can replicate it. Any details you can provide are most appreciated, for example "My name should sound like ...", or "should rhyme with ...".

    Please, please, please check the spelling before you report a mispronunciation. Less typing for you, less time reading email for us. And no, our software does not include spelling correction. That can be done by an application before submitting text to the TTS engine, if appropriate. Spelling correction can sometimes make outrageous changes to names and acronyms, and so is unsafe without human supervision.

  • If you get no audio at all or only part of it, if it is clearly not speech, or if it stops and starts in pulses, then something is wrong with the speech creation, delivery, or playback. Please continue reading.

If something is wrong, there may be a problem with our web servers, with the internet itself or with the hardware or software on your computer. This checklist should help you determine where the problem is.

If you don't solve the problem directly, you will at least have a better idea what is wrong. If you believe that there is a problem with our website or TTS software, please pass along as much detail as you have (scenario, error messages, etc.) to tts-feedback.

Please note: if the problem is with your computer or browser, we cannot help you. We're not qualified and it's not our job.

  • Step 1.   Try again.
    Maybe there was a glitch in the network. Consistent problems are real problems.

  • Step 2.   Error message?
    Was there an actual error message? From the web? From the audio player? Please send the exact text of any error messages.

  • Step 3.   How far did you get?
    Did you see the website demo page? Select a speaker and type some text? Click the "SPEAK" or "DOWNLOAD" button? Did it appear to download any audio? Was there a download progress indicator? Was an audio player launched or attempted? If you get no audio download, there may be something wrong with our TTS server. If you download audio but cannot play it, the problem is likely on your side (but keep reading for hints).

    The website's basic operation is as simple as possible. You submit voice name and text. If all is well, our site will redirect your browser to an ordinary WAV file containing the speech. That file will be available for about 5 minutes, after which it will be deleted to recover disk space. You only need a standard browser and the ability to play a WAV file.

    Please note that one of the most common problems is caused by packages that block pop-up windows. These packages also frequently block indirect audio downloads. Test by clicking "DOWNLOAD" rather than "SPEAK". Click the resulting audio link to listen. This should work even if HTTP redirections to audio files are blocked. You'll have to reconfigure or disable your blocker to use the "SPEAK" button.

    Internet Explorer on Windows Vista may claim that the WAV file is corrupted. The following solution was found on the web. It worked for me, but your mileage may vary.

    • Go to Conrol Panel, double-click "Default programs"

    • Click "Associate a file type or protocol with a program" - takes a few seconds to load the list.

    • Double-click the file-type under the "name" column (such as ".mp3"). It is probably showing something already, like "Windows Media Player", but don't believe it.

    • Now you'll have a list of "Recommended Programs". Double click the one you want. I chose "Windows Media Player" even though it was already "supposedly" set to that. Again, it will show "loading" at the bottom of the screen for a few seconds.

    We have also seen other issues on Windows over the years, including stopping audio during the last word and playing older clips instead of the most recent.

  • Step 4.   Sound, but something is wrong?
    Maybe you hear sound, or even speech, but it sounds wrong somehow. There are several possibilities. First be sure that you can play audio files. Try a sample audio file or test your browser's audio capability on some multimedia test page.

    Did everything sound fine but then stop early? There is currently a 300 character limit on the website to reduce server load. The CGI script deletes the extra characters, even in the middle of a word, which can yield odd results for the final word. The limit may change without notice as server load varies. Note that the TTS software itself will synthesize any length text.

    Did it play at the wrong speed? We currently deliver only 16 KHz audio. You may have to convert the sample rate if your audio card cannot handle this, but that would be rare these days.

    Did the speech sound choppy or stop too early? Some Microsoft audio players seem to have problems on the initial download. If the player can replay, e.g. with VCR-type buttons, it should sound OK on the second playing. This is a common complaint but we're not sure how to avoid this.


   What audio format is used?

The speech output audio format is a simple WAV (a.k.a. RIFF) file. The sample rate is 16KHz 16-bit linear, i.e. 16,000 samples per second, each sample a 16-bit integer, mono (not stereo). The website uses these wideband voices for best quality.

We also ship 8KHz versions of the voice (8,000 per second, one 8-bit Mulaw value per sample) for a 4-times reduction in voice database size. The 8K voices are useful for telephony applications (where the phone line limits quality anyway) and for platforms with storage limitations. There is no option on the demo page for MP3 or similar encodings.

If you need a different sample rate or audio format you can probably find free software to convert what we deliver. But before you use the audio for something more than private listening please check the website usage policy.


   Are there restrictions on the use of this site?

Yes, both the site itself and any downloaded audio files have restrictions. The goal of this site, after all, is for as many people as possible to become familar with AT&T Natural Voices™ text to speech. But, the website is for demonstration purposes only, and is not a free service. More details can be found below, but here are quick highlights:

-- No broadcast, distribution or publication of audio clips without licensing. (#2)
-- Limited private use is allowed. See examples below. (#3)
-- All synthesis from this web page, i.e. don't submit text from your wesite or apps. (#4)
-- Certain content may be rejected by the website, e.g. if it seems commercial. (#1)
-- Input is logged and treated as private customer data. (#5)
-- AT&T will cooperate with law enforcement, including providing relevant log entries. (#5)
-- Text length and number of accesses may be restricted. (#6)

1. AT&T Natural Voices™ is commercially available.
AT&T Natural Voices™ is commercially available as a cloud API and as desktop or server software. Please consider purchasing if these limitations affect you. Natural Voices™ is available in desktop, server, and SDK editions. The desktop voices are an inexpensive add-on to several PC packages which read documents, convert to MP3 files, etc.

We reserve the right to reject the submission if it seems intended for commercial use. If this happens, please choose text of a different type or topic to try out the voices. If this is a problem, please see the How to Buy section.

And, just to clarify, any restrictions on length, content, etc. are imposed by the website, not by the synthesis software.

2. No Broadcast, Publication, or Distribution of Audio.
This website is intended only for demonstration purposes. Audio clips created here may not be broadcast, published, or distributed. It doesn't matter if the use if non-commercial. Any such use requires licensing, so please contact Wizzard Speech LLC.

3. Limited Use of Audio Clips.
Audio files produced on our site are intended only for private, non-commercial use. This is not legal advice (hey, we're researchers) but here are some scenarios. Speaking events on your own computer is probably OK. Sharing the audio clips on the internet is not. A class project is probably OK. Bugging your friends is probably OK (with us).

Any use that involves wide distribution or long lifetime is probably not OK, whether or not it is commercial. Audio clips used in songs, videos or game levels cannot be made publicly available on the internet. Building or prototyping a software package using our audio rather than recording your own prompts is definitely not OK.

The common thread here is a very limited audience, and minimal impact on potential commercial sales. If you aren't sure, please ask. If you need the software or a license to use the audio, you can find a link to Wizzard Speech LLC in the How to Buy section of this FAQ.

4. Only Through This Demo Site.
Website resources are limited and are needed to support this site. Direct access to the CGI scripts is not permitted. You may refer people to this site but you may not have users enter text on your site and use our servers to provide audio, even if you give us credit. To run such a site you must install your own TTS server with licenses from Wizzard Speech LLC.

5. Input Text Will Be Logged.

Text will be logged. This information is handled just as any other private customer data and is kept confidential. However, AT&T will cooperate with law enforcement when there is legal authorization to do so. See the AT&T Privacy Policy.

6. Text Length and Number of Submissions Per Computer is Limited.
The number of submissions is limited. Access may be temporarily blocked if there are too many submissions from one computer. This, together with the length limit, allows more users to try the demo and shares the limited resources more fairly. Limits on submissions per day will be adjusted as needed to keep the website functioning normally.

Length of the input text is limited, typically to 300 characters. Anything longer is chopped off, and may result in partial words or single letters at the end of the speech. The length limit is a website feature and helps regulate server load. If you need to synthesize longer text, the product can handle input of any length. Limits on text length be adjusted as needed to keep the website functioning normally.


   Can the synthesis be modified?
It is possible to change the way the speech sounds by altering the input text. Liberal use of commas is the easiest way to get better phrasing, especially in long complex sentences. Overall speed can be controlled using XML-style tags from the SSML standard, e.g.
	  <prosody rate="slow"> this is speaking slowly </prosody>.
	  <prosody rate="fast"> this is speaking fast </prosody>.
	  <prosody rate="-50%"> this is 50% slower </prosody>.
Rate control does not change the pitch of the speech output. i Precise pauses can also be inserted using the <break/> tag, e.g.
	  Break for 100 milliseconds <Break time="100ms"/> Okay, keep going."
	  Break for 3 seconds <Break time="3s"/> Okay, keep going."
Voices and languages can be intermixed using the <voice> tag, e.g.
	  <voice name="crystal">Crystal, 1 2 3.</voice>
	  <voice name="mike">Mike.
	      <voice name="rosa">Rosa, 1 2 3.</voice>
	  Back to Mike.</voice>
The Speech Synthesis Markup Language, or SSML, is defined by the W3C organization. Note that not all tags are supported. See the documentation for specific product releases for more details.


   How to get AT&T Natural Voices™?
NV is available for use as follows:
  • The AT&T Developer Program offers cloud services through its Speech API.

  • Wizzard Speech LLC handles licensing and support of released versions of AT&T Natural Voices™ for US English and US Spanish.
A more thorough description can be found here.

Please be advised that the version of AT&T Natural Voices used in the online Research demo may change without warning and may differ in quality and voice selection from commercially available product versions.


   Who supports AT&T Natural Voices™?
The AT&T Developer Program supports all issues related to the web API or the speech output.

Wizzard Speech LLC supports desktop and server SDK editions for US English and US Spanish. Please contact Wizzard Speech for all questions about sales, licensing, updates, supported platforms, etc.

If you have an installed version of AT&T Natural Voices™, your vendor is the first line of support. They know their own products and common problems involving installation and interaction with TTS, and so are mostly likely to be of immediate help.

Our TTS research group at AT&T Labs cannot handle direct customer support. As a small research group focused on the underlying science and technology, we just don't have sufficient resources for field support. In particular, we haven't seen half the applications that use NV and have no idea what their error messages mean. Problems may be escalated to us by support staff, in which case we can handle any serious issues just once, allowing us time to continue our research to improve TTS quality.


   Who's synthesizer is this?

Our Research group at AT&T Labs produced AT&T Natural Voices™. The website demo runs a recent Research version of the synthesizer (and you may note differences from the released product).

This TTS system was originally known as "Next-Generation TTS" or "Next-Gen" and some published technical papers refer to it by that name. The "Natural Voices" name came about when our system was introduced as a commercial product circa 2001.


   Who works on TTS?
AT&T has a very long history in speech synthesis, beginning in Bell Labs and continuing in the newly formed AT&T Labs following the Lucent spin-off in 1996. Initially at Murray Hill, we are now located in Florham Park, NJ but are moving to Bedminster, NJ in October 2013.

Though our research team is relatively small, we have leveraged available resources to create a ground-breaking TTS system in AT&T Natural Voices™. And although TTS has advanced considerably in the past few years there is still much room for improvement, and research continues.

You can learn more about our work on the Publications page.


   How can I contact the research group?
You can send questions, suggestions and problems to the research group at tts-feedback. We cannot promise to respond to every email -- there are just too many -- but we do our best.


   What is Text-To-Speech?
Text-To-Speech, or TTS for short, is computer software that converts text into audible speech. You can try it yourself on our demo page. See our Home page for more information.

TTS is separate from speech recognition. You can think of TTS as "talking" and speech recognition as "listening". There is some shared technology, but neither is just the reverse of the other. And the talking/listening analogy is limited too. Neither technology really involves much language understanding.

TTS is also distinct from language translation, though voice to voice translation would employ both speech recognition and TTS. Again, translation requires significant understanding of the meaning.

People new to the idea of TTS often underestimate the difficulty of the task. After all, humans can typically learn this stuff in early childhood. They talk, listen, understand, and even translate without much apparent effort. Humans do all this work without even being aware of it in most cases, but that doesn't make it easy.

If programmers could create software that really understands human language we could avoid most of the guesswork in TTS, but that hasn't happened yet. Until then, TTS is more like learning to read a foreign language aloud without ever understanding the words. With a good dictionary, grammar rules, etc. you can get better and better but will still make mistakes occasionally that are obvious to native speakers.


   How does TTS work?
TTS is often described as two conceptual stages. In the first stage, it decides how the text should be spoken, that is, how each word should be pronounced, what length and pitch each phoneme should have, etc. In the second stage, the system does its best to create audio that matches the specifications produced by stage one.

TTS software has little or no understanding of the text being read. It uses rules, lists, dictionaries, etc. to make very sophisticated guesses about how a piece of text should be read. While general performance can be quite good, some decisions are intrinsically hard to make without some level of understanding. For example, the word "bass" in the phrases "bass drum" or "bass boat". Intonation depends in many cases on the writer's intention, which often cannot be inferred in short texts even by human readers. As a result, TTS systems will occasionally make mistakes and can be fooled by carefully constructed texts. These are challenging problems for all TTS systems, and we continue to improve ours as we are able.

The type of TTS we do is called a "concatenative" system, meaning that we record a human speaker to make a voice database. We re-use small chunks of the recordings to create new sentences containing words that were never recorded. Further, we do "unit selection" synthesis. This means that we use large voice databases and do clever searches on-the-fly to find chunks in the voice database that best match the requested sentences.


   Who uses Text-To-Speech?
TTS is used in a wide variety of services and applications. Commercially, help desks and voice response systems are probably the most important. On nights and weekends, when people are scarce, customers can still get some basic information from computers, for example, an account balance. The computer listens either to speech or touch tones and responds with TTS. These apps are typically over the telephone but might also be at kiosks or automated tellers.

TTS is also used on personal devices, e.g. on a PC to proof-read a document or to learn a new language. This category also includes assistive technologies such as screen-readers for the visually impaired and as a substitute voice for those who cannot speak.

As the quality of synthetic voices continues to improve, barriers to new applications drop. Some applications, to guarantee high quality, record all the things that need to be said. This can be expensive, impractical, or impossible depending on the task. TTS is often a better option if the voices are good enough.


   Can I use my own voice?
Not in AT&T Natural Voices™.

The reasons for wanting customized voices are varied. Some people just think it would be cool. Some are losing their voices due to a medical condition or upcoming surgery and would like to have their own synthetic voice rather than a generic one. Some people have audio tapes of a late loved one. (See the reference to ModelTalker in the section on Assistive Technologies below that may be useful for people soon to lose their voices.)

Creating high-quality voices requires a good voice talent, a sound-proof room, professional audio equipment, hours of written material with thorough coverage of phoneme combinations in the language, and the time and expertise to turn those recordings into a decent synthetic voice. Because of the expense involved, custom voice builds are usually done for corporations that want to computerize an existing actor's voice, for example to continue a brand image.

Since even professional actors reading well-chosen material don't always synthesize well, another possibility is to get the highest quality recordings possible, and as much of it as possible. Keep the recordings in a safe place until the technology improves for transforming one voice to sound like another. It may take far less material to build a tranformation model than it does to build a TTS voice from scratch. Eventually it may be possible to take a good TTS voice that is roughly similar (e.g. mid-pitch-range male, same accent) and morph it to sound like the desired person.


   What about assistive technologies?
TTS systems have a long history in assistive technologies. First, there are two basic classes of TTS software:
  • General-purpose TTS software on a desktop or laptop PC, and

  • Specialized packages for various disabilities.

Small, fast general-purposes TTS systems are available, and many come with simple applications which allow text typed on the screen to be read. There are multiple systems on the market (ours included), varying widely in voice quality, hardware requirements, and price.

These vary greatly in price and capability, and are sometimes customized for particular users. These products are needed when general purpose hardware and software do not suit the users needs.

Disability-related systems can be grouped into two basic types:

  • User is the listener.

  • User types, others listen.

Screen readers for the sight-impaired are an example of the first case. The user listens to the voice long enough to adapt and reach high intelligibility even for low quality voices. A convenient user interface with good responsiveness is perhaps more important than voice quality. The ability to vary the speaking rate, and particularly to choose extremely fast speaking rates, is crucial.

Stephen Hawking is the classic example of the second case, typing on a PC so that other people can listen. Here, the listeners come and go frequently and so most have less opportunity to adapt to low-quality voices. Fast speaking rates are not needed. If keyboard input is difficult or impossible, touch screens or physical buttons are customized to represent words or phrases. Buttons may be selected in a variety of ways, depending on the user's abilities.

These systems range from a budget PC with inexpensive TTS software to highly customized button boards controlled with exotic control hardware.

Here are some links that may help. I'm sure that many other relevant sites can be found by searching on the web.

  • A "beta" (free preview version) of a program called ModelTalker is available now (early 2008). This package allows a person (e.g. someone soon to lose their voice) to record themselves and create a synthetic voice that sounds more or less similar. The examples from the website sound pretty good and there are some good researchers behind it. We don't know what the commercial version will cost or when that will come out, but this is the first such program that we're aware of. It may also be available with pre-existing voices.

  • The misc.handicap newgroup sometimes has discussions of commercial TTS products. This is especially useful because you can ask questions of people who face similar challenges.

    Info on many specialized solutions.

  • Links to many TTS systems, some commercial, some research.


   How can I learn more about TTS?
A good starting point is "An Introduction to Text-to-Speech Synthesis" by Thierry Dutoit, published by Kluwer Academic Press. For hands-on experience, take a look at free TTS software called Festival from the University of Edinburgh in Scotland. Downloads are available, and it includes instructions for creating new voices and languages. Many universities around the world use this software as a framework for their own research. There are also some technical papers on our website that might serve as entry points into the research literature, e.g. by also reading some of the referenced papers.