Speech synthesis '''Speech synthesis''' is the artificial production of human [[Speech communication|speech]]. A computer system used for this purpose is called a '''speech synthesizer''', and can be implemented in [[software]] or [[Computer hardware|hardware]]. A '''text-to-speech (TTS)''' system converts normal language text into speech; other systems render [[symbolic linguistic representation]]s like [[phonetic transcription]]s into speech.Jonathan Allen, M. Sharon Hunnicutt, Dennis Klatt, ''From Text to Speech: The MITalk system''. Cambridge University Press: 1987. ISBN 0521306418 Synthesized speech can be created by concatenating pieces of recorded speech that are stored in a [[database]]. Systems differ in the size of the stored speech units; a system that stores [[phone]]s or [[diphone]]s provides the largest output range, but may lack clarity. For specific usage domains, the storage of entire words or sentences allows for high-quality output. Alternatively, a synthesizer can incorporate a model of the [[vocal tract]] and other human voice characteristics to create a completely "synthetic" voice output.Rubin, P., Baer, T., & Mermelstein, P. (1981). An articulatory synthesizer for perceptual research. ''Journal of the Acoustical Society of America'', 70, 321-328. The quality of a speech synthesizer is judged by its similarity to the human voice, and by its ability to be understood. An intelligible text-to-speech program allows people with [[visual impairment]]s or [[reading disability|reading disabilities]] to listen to written works on a home computer. Many computer operating systems have included speech synthesizers since the early 1980s. == Overview of text processing == [[Image:Festival TTS Telugu.jpg|Overview of a typical TTS system]] {{sample box start variation 2|Audio sample:}} {{multi-listen start|Audio sample of:}} {{multi-listen item|filename=MS Sam.ogg|title=Sample of Microsoft Sam|description=[[Windows XP]]’s default speech synthesizer voice saying, “The quick brown fox jumps over the lazy dog 1,234,567,890 times. Soif.”|format=[[Ogg]]}} {{multi-listen end}} {{sample box end}} A text-to-speech system (or "engine") is composed of two parts: a [[front-end]] and a back-end. The front-end has two major tasks. First, it converts raw text containing symbols like numbers and abbreviations into the equivalent of written-out words. This process is often called ''text normalization'', ''pre-processing'', or ''[[tokenization]]''. The front-end then assigns [[phonetic transcription]]s to each word, and divides and marks the text into [[prosody (linguistics)|prosodic units]], like [[phrase]]s, [[clause]]s, and [[sentence (linguistics)|sentence]]s. The process of assigning phonetic transcriptions to words is called ''text-to-phoneme'' or ''[[grapheme]]-to-phoneme'' conversion.P. H. Van Santen, Richard William Sproat, Joseph P. Olive, and Julia Hirschberg, ''Progress in Speech Synthesis''. Springer: 1997. ISBN 0387947019 Phonetic transcriptions and prosody information together make up the symbolic linguistic representation that is output by the front-end. The back-end—often referred to as the ''synthesizer''—then converts the symbolic linguistic representation into sound. == History == Long before [[electronics|electronic]] [[signal processing]] was invented, there were those who tried to build machines to create human speech. Some early legends of the existence of [[Brazen Head|"speaking heads"]] involved [[Pope Silvester II|Gerbert of Aurillac]] (d. 1003 AD), [[Albertus Magnus]] (1198–1280), and [[Roger Bacon]] (1214–1294). In 1779, the [[Denmark|Danish]] scientist Christian Kratzenstein, working at the [[Russian Academy of Sciences]], built models of the human [[vocal tract]] that could produce the five long [[vowel]] sounds (in [[help:IPA|International Phonetic Alphabet]] notation, they are {{IPA|[aː]}}, {{IPA|[eː]}}, {{IPA|[iː]}}, {{IPA|[oː]}} and {{IPA|[uː]}}).[http://www.acoustics.hut.fi/publications/files/theses/lemmetty_mst/chap2.html History and Development of Speech Synthesis], Helsinki University of Technology, Retrieved on [[November 4]], [[2006]] This was followed by the [[bellows]]-operated "acoustic-mechanical speech machine" by [[Wolfgang von Kempelen]] of [[Vienna]], [[Austria]], described in a 1791 paper.''Mechanismus der menschlichen Sprache nebst der Beschreibung seiner sprechenden Maschine'' ("Mechanism of the human speech with description of its speaking machine," J. B. Degen, Wien). This machine added models of the tongue and lips, enabling it to produce [[consonant]]s as well as vowels. In 1837, [[Charles Wheatstone]] produced a "speaking machine" based on von Kempelen's design, and in 1857, M. Faber built the "Euphonia". Wheatstone's design was resurrected in 1923 by Paget.Mattingly, Ignatius G. Speech synthesis for phonetic and phonological models. In Thomas A. Sebeok (Ed.), ''Current Trends in Linguistics, Volume 12, Mouton'', The Hague, pp. 2451-2487, 1974. In the 1930s, [[Bell Labs]] developed the [[Vocoder|VOCODER]], a keyboard-operated electronic speech analyzer and synthesizer that was said to be clearly intelligible. [[Homer Dudley]] refined this device into the VODER, which he exhibited at the [[1939 New York World's Fair]]. The [[Pattern playback]] was built by [[Franklin S. Cooper|Dr. Franklin S. Cooper]] and his colleagues at [[Haskins Laboratories]] in the late 1940s and completed in 1950. There were several different versions of this hardware device but only one currently survives. The machine converts pictures of the acoustic patterns of speech in the form of a spectrogram back into sound. Using this device, [[Alvin Liberman]] and colleagues were able to discover acoustic cues for the perception of [[phonetic]] segments (consonants and vowels). Early electronic speech synthesizers sounded robotic and were often barely intelligible. However, the quality of synthesized speech has steadily improved, and output from contemporary speech synthesis systems is sometimes indistinguishable from actual human speech. === Electronic devices === The first computer-based speech synthesis systems were created in the late 1950s, and the first complete text-to-speech system was completed in 1968. In 1961, physicist [[John Larry Kelly, Jr]] and colleague Louis Gerstmanhttp://query.nytimes.com/search/query?ppds=per&v1=GERSTMAN%2C%20LOUIS&sort=newest NY Times obituary for Louis Gerstman. used an [[IBM 704]] computer to synthesize speech, an event among the most prominent in the history of [[Bell Labs]]. Kelly's voice recorder synthesizer (vocoder) recreated the song "[[Daisy Bell]]", with musical accompaniment from [[Max Mathews]]. Coincidentally, [[Arthur C. Clarke]] was visiting his friend and colleague John Pierce at the Bell Labs Murray Hill facility. Clarke was so impressed by the demonstration that he used it in the climactic scene of his screenplay for his novel ''[[2001: A Space Odyssey (novel)|2001: A Space Odyssey]]'',[http://www.lsi.usp.br/~rbianchi/clarke/ACC.Biography.html Arthur C. Clarke online Biography] where the [[HAL 9000]] computer sings the same song as it is being put to sleep by astronaut [[Dave Bowman]].[http://www.bell-labs.com/news/1997/march/5/2.html Bell Labs: Where "HAL" First Spoke (Bell Labs Speech Synthesis website)] Despite the success of purely electronic speech synthesis, research is still being conducted into mechanical speech synthesizers.[http://www.takanishi.mech.waseda.ac.jp/research/voice/ Anthropomorphic Talking Robot Waseda-Talker Series] == Synthesizer technologies == The most important qualities of a speech synthesis system are ''naturalness'' and ''[[Intelligibility]]''. Naturalness describes how closely the output sounds like human speech, while intelligibility is the ease with which the output is understood. The ideal speech synthesizer is both natural and intelligible. Speech synthesis systems usually try to maximize both characteristics. The two primary technologies for generating synthetic speech waveforms are ''concatenative synthesis'' and ''[[formant]] synthesis''. Each technology has strengths and weaknesses, and the intended uses of a synthesis system will typically determine which approach is used. === Concatenative synthesis === Concatenative synthesis is based on the [[concatenation]] (or stringing together) of segments of recorded speech. Generally, concatenative synthesis produces the most natural-sounding synthesized speech. However, differences between natural variations in speech and the nature of the automated techniques for segmenting the waveforms sometimes result in audible glitches in the output. There are three main sub-types of concatenative synthesis.
==== Unit selection synthesis ==== Unit selection synthesis uses large [[database]]s of recorded speech. During database creation, each recorded utterance is segmented into some or all of the following: individual [[phone]]s, [[diphone]]s, half-phones, [[syllable]]s, [[morpheme]]s, [[word]]s, [[phrase]]s, and [[Sentence (linguistics)|sentence]]s. Typically, the division into segments is done using a specially modified [[speech recognition|speech recognizer]] set to a "forced alignment" mode with some manual correction afterward, using visual representations such as the [[waveform]] and [[spectrogram]].Alan W. Black, Perfect synthesis for all of the people all of the time. IEEE TTS Workshop 2002. (http://www.cs.cmu.edu/~awb/papers/IEEE2002/allthetime/allthetime.html) An [[index (database)|index]] of the units in the speech database is then created based on the segmentation and acoustic parameters like the [[fundamental frequency]] ([[pitch (music)|pitch]]), duration, position in the syllable, and neighboring phones. At [[runtime]], the desired target utterance is created by determining the best chain of candidate units from the database (unit selection). This process is typically achieved using a specially weighted [[decision tree]]. Unit selection provides the greatest naturalness, because it applies only a small amount of [[digital signal processing]] (DSP) to the recorded speech. DSP often makes recorded speech sound less natural, although some systems use a small amount of signal processing at the point of concatenation to smooth the waveform. The output from the best unit-selection systems is often indistinguishable from real human voices, especially in contexts for which the TTS system has been tuned. However, maximum naturalness typically require unit-selection speech databases to be very large, in some systems ranging into the [[gigabyte]]s of recorded data, representing dozens of hours of speech.John Kominek and Alan W. Black. (2003). CMU ARCTIC databases for speech synthesis. CMU-LTI-03-177. Language Technologies Institute, School of Computer Science, Carnegie Mellon University. Also, unit selection algorithms have been known to select segments from a place that results in less than ideal synthesis (e.g. minor words become unclear) even when a better choice exists in the database.Julia Zhang. Language Generation and Speech Synthesis in Dialogues for Language Learning, masters thesis, http://groups.csail.mit.edu/sls/publications/2004/zhang_thesis.pdf Section 5.6 on page 54.
==== Diphone synthesis ==== Diphone synthesis uses a minimal speech database containing all the [[diphone]]s (sound-to-sound transitions) occurring in a language. The number of diphones depends on the [[phonotactics]] of the language: for example, Spanish has about 800 diphones, and German about 2500. In diphone synthesis, only one example of each diphone is contained in the speech database. At runtime, the target [[prosody]] of a sentence is superimposed on these minimal units by means of [[digital signal processing]] techniques such as [[linear predictive coding]], [[PSOLA]][http://www.fon.hum.uva.nl/praat/manual/PSOLA.html PSOLA Synthesis] or [[MBROLA]].T. Dutoit, V. Pagel, N. Pierret, F. Bataiile, O. van der Vrecken. The MBROLA Project: Towards a set of high quality speech synthesizers of use for non commercial purposes. ''ICSLP Proceedings'', 1996. The quality of the resulting speech is generally worse than that of unit-selection systems, but more natural-sounding than the output of formant synthesizers. Diphone synthesis suffers from the sonic glitches of concatenative synthesis and the robotic-sounding nature of formant synthesis, and has few of the advantages of either approach other than small size. As such, its use in commercial applications is declining, although it continues to be used in research because there are a number of freely available software implementations.
==== Domain-specific synthesis ==== Domain-specific synthesis concatenates prerecorded words and phrases to create complete utterances. It is used in applications where the variety of texts the system will output is limited to a particular domain, like transit schedule announcements or weather reports.L.F. Lamel, J.L. Gauvain, B. Prouts, C. Bouhier, R. Boesch. Generation and Synthesis of Broadcast Messages, ''Proceedings ESCA-NATO Workshop and Applications of Speech Technology'', September 1993. The technology is very simple to implement, and has been in commercial use for a long time, in devices like talking clocks and calculators. The level of naturalness of these systems can be very high because the variety of sentence types is limited, and they closely match the prosody and intonation of the original recordings.{{Fact|date=February 2007}} Because these systems are limited by the words and phrases in their databases, they are not general-purpose and can only synthesize the combinations of words and phrases with which they have been preprogrammed. The blending of words within naturally spoken language however can still cause problems unless the many variations are taken into account. For example, in [[Rhotic and non-rhotic accents|non-rhotic]] dialects of English the=== Formant synthesis === [[Formant]] synthesis does not use human speech samples at runtime. Instead, the synthesized speech output is created using an acoustic model. Parameters such as [[fundamental frequency]], [[phonation|voicing]], and [[noise]] levels are varied over time to create a [[waveform]] of artificial speech. This method is sometimes called ''rules-based synthesis''; however, many concatenative systems also have rules-based components. Many systems based on formant synthesis technology generate artificial, robotic-sounding speech that would never be mistaken for human speech. However, maximum naturalness is not always the goal of a speech synthesis system, and formant synthesis systems have advantages over concatenative systems. Formant-synthesized speech can be reliably intelligible, even at very high speeds, avoiding the acoustic glitches that commonly plague concatenative systems. High-speed synthesized speech is used by the visually impaired to quickly navigate computers using a [[screen reader]]. Formant synthesizers are usually smaller programs than concatenative systems because they do not have a database of speech samples. They can therefore be used in [[embedded system]]s, where [[data storage device|memory]] and [[microprocessor]] power are especially limited. Because formant-based systems have complete control of all aspects of the output speech, a wide variety of prosodies and [[Intonation (linguistics)|intonation]]s can be output, conveying not just questions and statements, but a variety of emotions and tones of voice. Examples of non-real-time but highly accurate intonation control in formant synthesis include the work done in the late 1970s for the [[Texas Instruments]] toy [[Speak & Spell (game)|Speak & Spell]], and in the early 1980s [[Sega]] [[Video arcade|arcade]] machines.Examples include [[Astro Blaster]], [[Space Fury]], and [[Star Trek (arcade game)|Star Trek: Strategic Operations Simulator]]. Creating proper intonation for these projects was painstaking, and the results have yet to be matched by real-time text-to-speech interfaces.John Holmes and Wendy Holmes. ''Speech Synthesis and Recognition, 2nd Edition''. CRC: 2001. ISBN 0748408568. === Articulatory synthesis === [[Articulatory synthesis]] refers to computational techniques for synthesizing speech based on models of the human [[vocal tract]] and the articulation processes occurring there. The first articulatory synthesizer regularly used for laboratory experiments was developed at [[Haskins Laboratories]] in the mid-1970s by [[Philip Rubin]], Tom Baer, and Paul Mermelstein. This synthesizer, known as ASY, was based on vocal tract models developed at [[Bell Laboratories]] in the 1960s and 1970s by Paul Mermelstein, Cecil Coker, and colleagues. Until recently, articulatory synthesis models have not been incorporated into commercial speech synthesis systems. A notable exception is the [[NeXT]]-based system originally developed and marketed by Trillium Sound Research, a spin-off company of the [[University of Calgary]], where much of the original research was conducted. Following the demise of the various incarnations of NeXT (started by [[Steve Jobs]] in the late 1980s and merged with Apple Computer in 1997), the Trillium software was published under the [[GNU General Public License]], with work continuing as ''gnuspeech''. The system, first marketed in 1994, provides full articulatory-based text-to-speech conversion using a waveguide or transmission-line analog of the human oral and nasal tracts controlled by Carré's "distinctive region model". === HMM-based synthesis === HMM-based synthesis is a synthesis method based on [[hidden Markov model]]s. In this system, the [[frequency spectrum]] ([[vocal tract]]), [[fundamental frequency]] (vocal source), and duration ([[prosody]]) of speech are modeled simultaneously by HMMs. Speech [[waveforms]] are generated from HMMs themselves based on the [[maximum likelihood]] criterion.The HMM-based Speech Synthesis System, http://hts.sp.nitech.ac.jp/ === Sinewave synthesis === [[Sinewave synthesis]] is a technique for synthesizing speech by replacing the [[formants]] (main bands of energy) with pure tone whistles.Remez, R.E., Rubin, P.E., Pisoni, D.B., & Carrell, T.D. Speech perception without traditional speech cues. ''Science'', 1981, 212, 947-950. == Challenges == === Text normalization challenges === The process of normalizing text is rarely straightforward. Texts are full of [[Heteronym (linguistics)|heteronym]]s, [[number]]s, and [[abbreviation]]s that all require expansion into a phonetic representation. There are many spellings in English which are pronounced differently based on context. For example, "My latest project is to learn how to better project my voice" contains two pronunciations of "project". Most text-to-speech (TTS) systems do not generate semantic representations of their input texts, as processes for doing so are not reliable, well understood, or computationally effective. As a result, various [[heuristic]] techniques are used to guess the proper way to disambiguate homographs, like examining neighboring words and using statistics about frequency of occurrence. Deciding how to convert numbers is another problem that TTS systems have to address. It is a simple programming challenge to convert a number into words, like "1325" becoming "one thousand three hundred twenty-five." However, numbers occur in many different contexts; when a year or part of an address, "1325" should likely be read as "thirteen twenty-five", or, when part of a [[social security number]], as "one three two five". A TTS system can often infer how to expand a number based on surrounding words, numbers, and punctuation, and sometimes the system provides a way to specify the context if it is ambiguous.{{Fact|date=February 2007}} Similarly, abbreviations can be ambiguous. For example, the abbreviation "in" for "inches" must be differentiated from the word "in", and the address "12 St John St." uses the same abbreviation for both "Saint" and "Street". TTS systems with intelligent front ends can make educated guesses about ambiguous abbreviations, while others provide the same result in all cases, resulting in nonsensical (and sometimes comical) outputs. === Text-to-phoneme challenges === Speech synthesis systems use two basic approaches to determine the pronunciation of a word based on its spelling, a process which is often called text-to-phoneme or grapheme-to-phoneme conversion ([[phoneme]] is the term used by linguists to describe distinctive sounds in a language). The simplest approach to text-to-phoneme conversion is the dictionary-based approach, where a large dictionary containing all the words of a language and their correct pronunciations is stored by the program. Determining the correct pronunciation of each word is a matter of looking up each word in the dictionary and replacing the spelling with the pronunciation specified in the dictionary. The other approach is rule-based, in which pronunciation rules are applied to words to determine their pronunciations based on their spellings. This is similar to the "sounding out", or [[synthetic phonics]], approach to learning reading. Each approach has advantages and drawbacks. The dictionary-based approach is quick and accurate, but completely fails if it is given a word which is not in its dictionary.{{Fact|date=February 2007}} As dictionary size grows, so too does the memory space requirements of the synthesis system. On the other hand, the rule-based approach works on any input, but the complexity of the rules grows substantially as the system takes into account irregular spellings or pronunciations. (Consider that the word "of" is very common in English, yet is the only word in which the letter "f" is pronounced [v].) As a result, nearly all speech synthesis systems use a combination of these approaches. Some languages, like [[Spanish language|Spanish]], have a very regular writing system, and the prediction of the pronunciation of words based on their spellings is quite successful.{{Fact|date=February 2007}} Speech synthesis systems for such languages often use the rule-based method extensively, resorting to dictionaries only for those few words, like foreign names and borrowings, whose pronunciations are not obvious from their spellings. On the other hand, speech synthesis systems for languages like [[English language|English]], which have extremely irregular spelling systems, are more likely to rely on dictionaries, and to use rule-based methods only for unusual words, or words that aren't in their dictionaries. === Evaluation challenges === It is very difficult to evaluate speech synthesis systems consistently because there is no subjective criterion and usually different organizations use different speech data. The quality of a speech synthesis system highly depends on the quality of recording. Therefore, evaluating speech synthesis systems is almost the same as evaluating the recording skills. Recently researchers start evaluating speech synthesis systems using the common speech dataset.Blizzard Challenge http://festvox.org/blizzard This may help people to compare the difference between technologies rather than recordings. === Prosodics and emotional content === A recent study reported in the journal "'''Speech Communication'''" by Amy Drahota and colleagues at the [[University of Portsmouth]], [[UK]], reported that listeners to voice recordings could determine, at better than chance levels, whether or not the speaker was smiling.[http://www.port.ac.uk/aboutus/newsandevents/news/title,74220,en.html The Sound of Smiling] It was suggested that identification of the vocal features which signal emotional content may be used to help make synthesized speech sound more natural. == Dedicated hardware == *Votrax **SC-01A (analog formant) **SC-02 / SSI-263 / "Arctic 263" *General Instruments SP0256-AL2 (CTS256A-AL2, MEA8000) *Magnevation SpeakJet (www.speechchips.com TTS256) *Savage Innovations SoundGin *National Semiconductor DT1050 Digitalker (Mozer) *Silicon Systems SSI 263 (analog formant) *Texas Instruments **TMS5110A (LPC) **TMS5200 *Oki Semiconductor **MSM5205 **MSM5218RS (ADPCM) *Toshiba T6721A *Philips PCF8200 == Computer operating systems or outlets with speech synthesis == === Apple === The first speech system integrated into an [[operating system]] was [[Apple Computer]]'s [[PlainTalk#The original MacInTalk|MacInTalk]] in 1984. Since the 1980s Macintosh Computers offered text to speech capabilities through The MacinTalk software. In the early 1990s Apple expanded its capabilities offering system wide text-to-speech support. With the introduction of faster PowerPC based computers they included higher quality voice sampling. Apple also introduced [[speech recognition]] into its systems which provided a fluid command set. More recently, Apple has added sample-based voices. Starting as a curiosity, the speech system of Apple [[Macintosh (computer)|Macintosh]] has evolved into a cutting edge fully-supported program, [[PlainTalk]], for people with vision problems. [[VoiceOver]] was included in Mac OS Tiger and more recently Mac OS Leopard. The voice shipping with Mac OS X 10.5 ("Leopard") is called "Alex" and features the taking of realistic-sounding breaths between sentences, as well as improved clarity at high read rates. === AmigaOS === The second operating system with advanced speech synthesis capabilities was [[AmigaOS]], introduced in 1985. The voice synthesis was licensed by [[Commodore International]] from a third-party software house (Don't Ask Software, now Softvoice, Inc.) and it featured a complete system of voice emulation, with both male and female voices and "stress" indicator markers, made possible by advanced features of the [[Amiga]] hardware audio [[chipset]].[[Jay Miner|Miner, Jay]] et al (1991). ''Amiga Hardware Reference Manual: Third Edition''. [[Addison-Wesley]] Publishing Company, Inc. ISBN 0-201-56776-8. It was divided into a narrator device and a translator library. Amiga [[AmigaOS#Speech synthesis|Speak Handler]] featured a text-to-speech translator. AmigaOS considered speech synthesis a virtual hardware device, so the user could even redirect console output to it. Some Amiga programs, such as word processors, made extensive use of the speech system. === Microsoft Windows === Modern [[Microsoft Windows|Windows]] systems use [[Speech Application Programming Interface#SAPI 1-4 API family|SAPI4]]- and [[Speech Application Programming Interface#SAPI 5 API family|SAPI5]]-based speech systems that include a [[speech recognition]] engine (SRE). SAPI 4.0 was available on Microsoft-based operating systems as a third-party add-on for systems like [[Windows 95]] and [[Windows 98]]. [[Windows 2000]] added a speech synthesis program called [[Microsoft Narrator|Narrator]], directly available to users. All Windows-compatible programs could make use of speech synthesis features, available through menus once installed on the system. [[Microsoft Speech Server]] is a complete package for voice synthesis and recognition, for commercial applications such as [[call centers]]. === Internet === Currently, there are a number of [[Application software|applications]], [[plugin]]s and [[gadget]]s that can read messages directly from an [[e-mail client]] and web pages from a [[web browser]]. Some specialized [[Computer software|software]] can narrate [[RSS|RSS-feeds]]. On one hand, online RSS-narrators simplify information delivery by allowing users to listen to their favourite news sources and to convert them to [[podcast]]s. On the other hand, on-line RSS-readers are available on almost any [[Personal computer|PC]] connected to the Internet. Users can download generated audio files to portable devices, e.g. with a help of [[podcast]] receiver, and listen to them while walking, jogging or commuting to work. A growing field in internet based TTS technology is web-based assistive technology, e.g. Talklets. {{Fact|date=March 2008}} This web based approach to a traditionally locally installed form of software application can afford many of those requiring software for accessibility reason, the ability to access web content from public machines, or those belonging to others. While responsiveness is not as immediate as that of applications installed locally, the 'access anywhere' nature of it is the key benefit to this approach. === Others === * Some models of Texas Instruments home computers produced in 1979 and 1981 ([[TI-99/4A|Texas Instruments TI-99/4 and TI-99/4A]]) were capable of text-to-phoneme synthesis or reciting complete words and phrases (text-to-dictionary), using a very popular Speech Synthesizer peripheral. TI used a proprietary [[codec]] to embed complete spoken phrases into applications, primarily video games.[http://www.mindspring.com/~ssshp/ssshp_cd/ss_home.htm Smithsonian Speech Synthesis History Project (SSSHP) 1986-2002] * Systems that operate on free and open source software systems including [[Linux|GNU/Linux]] are various, and include [[open-source]] programs such as the [[Festival Speech Synthesis System]] which uses diphone-based synthesis (and can use a limited number of [[MBROLA]] voices), and gnuspeech which uses articulatory synthesis[http://www.gnu.org/software/gnuspeech/ gnuspeech] from the [[Free Software Foundation]]. Other commercial vendor software also runs on GNU/Linux. * Several commercial companies are also developing speech synthesis systems (this list is reporting them just for the sake of information, not endorsing any specific product): [http://www.acapela-group.com Acapela Group], [[AT&T]], [[Cepstral]], [[DECtalk]], [[IBM ViaVoice]], [[IVONA|IVONA TTS]], [http://www.loquendo.com Loquendo TTS], [http://www.neospeech.com NeoSpeech TTS], [[Nuance Communications]], Rhetorical Systems, [http://www.svox.com SVOX] and [http://www.yakitome.com YAKiToMe!]. * Companies which developed speech synthesis systems but which are no longer in this business include BeST Speech (bought by L&H), [[Lernout & Hauspie]] (bankrupt), [[SpeechWorks]] (bought by Nuance) == Speech synthesis markup languages == A number of [[markup language]]s have been established for the rendition of text as speech in an [[XML]]-compliant format. The most recent is [[Speech Synthesis Markup Language]] (SSML), which became a [[W3C recommendation]] in 2004. Older speech synthesis markup languages include Java Speech Markup Language ([[JSML]]) and [[SABLE]]. Although each of these was proposed as a standard, none of them has been widely adopted. Speech synthesis markup languages are distinguished from dialogue markup languages. [[VoiceXML]], for example, includes tags related to speech recognition, dialogue management and touchtone dialing, in addition to text-to-speech markup. ==Applications== ===Accessibility=== Speech synthesis has long been a vital [[assistive technology]] tool and its application in this area is significant and widespread. It allows environmental barriers to be removed for people with a wide range of disabilities. The longest application has been in the use of [[screenreaders]] for people with [[visual impairment]], but text-to-speech systems are now commonly used by people with [[dyslexia]] and other reading difficulties as well as by pre-literate youngsters. They are also frequently employed to aid those with severe [[speech impairment]] usually through a dedicated [[voice output communication aid]]. ===News service=== Sites such as [[Ananova]] have used speech synthesis to convert written news to audio content, which can be used for mobile applications. ===Entertainment=== Speech synthesis techniques are used as well in the entertainment productions such as games, anime and similar. In 2007, Animo Limited announced the development of a software application package based on its speech synthesis software FineSpeech, explicitly geared towards customers in the entertainment industries, able to generate narration and lines of dialogue according to user specifications.[http://animenewsnetwork.com/news/2007-05-02/speech-synthesis-software Speech Synthesis Software for Anime Announced] Software such as [[Vocaloid]] can generate singing voices via lyrics and melody. This is also the aim of the Singing Computer project (which uses the [[GNU General Public License|GPL]] software [[GNU LilyPond|Lilypond]] and [[Festival Speech Synthesis System|Festival]]) to help blind people check their lyric input.[http://www.freebsoft.org/singing-computer Free(b)soft - Singing Computer] == References == [http://www.thws.cn/articles/tag/text_to_speech text to speech and RSS to Podcast ]reviewed by [http://twitter.com/thw thw] {{reflist}} *[http://www.cs.indiana.edu/rhythmsp/ASA/Contents.html Dennis Klatt's History of Speech Synthesis] == Specific programs ==in words like {{IPA|/ˈkliːə/}} is usually only pronounced when the following word has a vowel as its first letter (e.g. is realized as {{IPA|/ˌkliːəɹˈɑʊt/}}). Likewise in [[French language|French]], many final consonants become no longer silent if followed by a word that begins with a vowel, an effect called [[Liaison (French)|liaison]]. This [[alternation (linguistics)|alternation]] cannot be reproduced by a simple word-concatenation system, which would require additional complexity to be [[context-sensitive]].