Reviews of Popular Speech Synthesizers
By Veli-Pekka Tätilä
Alasdair - broken links and unavailable files have been removed from this page and are lost forever.
One day it struck me that I should do a Web page containing samples of software speech synthesizers that can be used with screen reading programs, as well as providing real world comparisons and my opinions of there sound qualities. And, as a result, here's that page. It should give you some clues to what advantages and disadvantages a certain speech synth has, and it also contains samples of these synths in mp3 format (64 kbps, 44.1 kHz, mono). It is not assumed you have got any background info as far as speech synths go. Though synths for various platforms are included, I've restrictedd the selection to commonly available, widely known synths or those that are interesting because of intelligibility or historical significance.
Although this may initially seem boring, I used exactly the same text for all synths when creating the samples. This is because it makes it easier to compare the different synths if you have the same source material. I also figured out the text should have some shouting, questions and ordinary sentences to demonstrate how speech synths react to them (changes in intonation in questions among other things). So I chose a dialog in the Hichhikers Guide to the Galaxy by Douglas Adams. It is a snippet of the passage where Arthur and Ford look up Earth in the Hitchikers Guide to the Galaxy (for those of you who don't know, a kind of galactic encyclopedia with lots of "attitude"). I did remove the quote characters because many speech synths seem to have pronounciation problems with quoted text particularly in Supernova's document read. Other than that, however, the text is untouched.
Here's the source text in case some of you want to contribute by sending me a sample of how this sounds like on a particular speech synth. This should also help in identifying where questions and exclamation marks should be in the text:
What? Harmless? Is that all it's got to say? Harmless! One word! Ford shrugged. Well, there are a hundred billion stars in the Galaxy, and only a limited amount of space in the book's microprocessors, he said, and no one knew much about the Earth of course. Well for God's sake I hope you managed to rectify that a bit. Oh yes, well I managed to transmit a new entry off to the editor. He had to trim it a bit, but it's still an improvement. And what does it say now? asked Arthur. Mostly harmless, admitted Ford with a slightly embarrassed cough.
Different Speech Users
There are basically too types of speech users out there as far as I know. Casual users listen to synthesized speech at times for a variety of reasons: because it's cool or perhaps to increase productivity. For most of the casual users (not to mention those completely new to synthesized speech) it's important that the voice sounds as human as possible and that the speech is intelligible at normal speaking rates.
The other type of users are "serious" speech users, who are for some reason, using synthesized speech on a daily basis. I myself fall to the latter category and have been using speech daily for eight years or so, because I'm using a screen reader (a program telling you what's happening on the screen by speech and/or braille). For serious users the most important qualities in a speech synth are intelligible speech at very high rates (up to 300 words per minute or more), good pronounciation and customizability, the quality of the voice usually does not matter that much. Personally I tend to keep up the speech rate so high that you don't really have time to concentrate on intonation apart from whether some sentence contains special punctuation or not.
In this document, I use the term interface or API for short to mean the software conventions that allows speech synthesizers and applications to communicate with each other e.g. Microsoft Speech API (SAPI). The software at the receiving end are speech synthesizers (say RealSpeak) and the apps that use them are simply called applications (e.g. Text Aloud) or sometimes clients. Inside the synthesizer is the underlying speech technology I refer to as the engine. For example, both ViaVoice and Eloquence use the same engine, although the speech synthesizers have different names, parameters and configuration dialogs. Lastly, in a speech synth there are multiple voices. Some synths use the term voice to mean distinctly different, sampled or synthesis based tones the synth can generate. Yet others allow the user to customize secondary voice parameters e.g. for the voice Amy its pitch, and call user-defined snapshots of such settings voices, too. Confusingly the synths might or might not show the user-defined voices in their voice lists exposed via the speech API.
Samples at different speeds
To make the speech synth samples enjoyable to everyone, I have included samples of most speech synths in both normal and high speaking rates. I have noticed, however, that once you get used to high rate speech, the intonation at normal speaking rates often sounds stupid and irritating. So I increased the speed somewhat to compensate for this even for the normal samples. On reflection, I've noticed this adjustment I made is clearly overkill in some instances, as what's slow for me is quite fast for the average speech user, sorry for the inconvenience.
Types of Speech Synthesis
Here's a really brief introduction to the basic speech synthesis methods (nothing technical, I don't know very much about speech synthesis myself frankly speaking.
For most of the casual speech users, sample based speech synths are the way to go. they contain samples of real human voices which are then processsed and put together to create an impression of a human voice. Sampled voices sound pretty good and are pleasant to listen to. However, they take quite a lot of space, are not really customizable (pitch, volume and speed as far as voice parameters go) and are not very good at really high speaking rates. The technical term for this type of synthesis is concatenation of diphones.
The alternative approach to speech synthesis is to create all sound from scratch without any samples of real human voices. These fully synthetic speech synths are preferred by most serious speech users. As the voice is fully synthetic, it's also customizable meaning you can do your own voices adjusting parameters like head size, breathiness and roughness. Fully synthetic speech is generally more responsive (less delay before it speaks) and scales better to be used at high rates. It is also easier to do multilingual synths if you generate all the sound on the fly. The tech term for this type of synthesis is formant based or rule based. mixing samples and formants is also possible though rare as is physically modelling the organs that actually produce the speech, as the latter tends to be computationally intensive.
Little Issues in the Sound Files
Although I did try to watch for the levels, some of the sounds include slight digital clipping. I hope you don't mind too much. It's really pretty unnoticible unless you start looking for it in particular. Oh well, now that I mentioned these gliches, I'm sure you'll probably pay attention to them, grin.
I used the Supernova screen reader in the document read mode to read the text. As some of the older releases said end of document" at the end without inserting a pause first, I had to manually cut off that part at the end of the audio file. Because some speech synths considered Supernova's prompting to be part of the last sentence of the text, you might have a feeling that the voice is cut mid-sentence in some of the old audio files here. There was not much I could do about it.
Features to Be Compared
In addition to the main verbal comments, I'll be comparing different aspects of the synths numerically. the scale is from 4 (total failure) to 10 (excellent, could not be much better) rounded to the nearest half points. As there are a number of synths and it is hard to balance the effect of each and every pro and con, the numbres may not be exact. The maximum value 10 doesn't mean an aspect of some synth could not be improved at all. Rather, it's outstanding and one of the best in the lot. I might rescale the scores or switch to a more verbal and ambiguous presentation in the future. This is provided that speech technology advances in strides e.g. AI for understanding the meaning which affects intonation.
In these speech synth reviews, the following aspects of the synths will be judjed:
- Sound quality: how pleasant the basic voice character of the synth is.
- Pronounciation: How well different words are pronounced, howabout abbreviations.
- Intonation: measures how natural the changes in pitch over time appear to be, as well as taking into account how the synth reacts to questions and exclamations.
- Customizability: The range of pitch settings that are usable and other parameters that can be customized. Also, the number of really different, usable voices and the language count affect this score. Hmm, versatility might be a better term for this aspect after all.
- Usability at high rates: This measure tells you how good the synth is in daily use if high speech rates are desired.
- Overall: The average of the above scores representing the synths overall score. This may be a straightforward average plus or minus 0.5 points given because of some special features or omitions.
Index of Voices
Here are quick links for jumping to a particular voice directly:
Free or Very Common Synths
Ok, we'll be starting with the Microsoft voices. They are sample based and native to Windows in the sense that 2k and XP include some Microsoft SAPI voices by default. SAPI stands for speech application programming interface. There are two widely used SAPI versions SAPI 4 and SAPI 5. This Microsoft synth uses the older SAPI 4 standard and is shipped with Windows 2000.
Note that I've deliberately lowered the pitch somewaht from the default setting (I like lower voices better as you'll see).
The sound quality is quite pleasant as this one is sample based. However, notice how wildely the intonation varies, that's going to be a problem when listening to the speech in high rates. Mary's way of subtly hinting at a question is ok but I cannot detect any difference between ordinary sentences and those ending in an exclamation mark. The rate of the speech varies strangely, too, it sounds a bit "jumpy" to me.
Although this is not implied by the text anyway, the pronounciation is not good. It has difficulties in saying words like Cakewalk and there are some really irritating built-in abbreviations that cannot be removed, although it includes an exception dictionary.
Despite the fact that you cannot really customize the individual voices, there's a large assortment of different voices. There are two males and one female, a couple of robots plus whacky echoing voices (based on simple delay effects).
Sound quality: 8.0
Usability at high rates: 6.5
For comparison, let's see how the same speech synth and voice has evolved in SAPI 5 (included preinstalled with Windows XP). Here are the same samples in the same voice but this time it's the SAPI 5 version.Microsoft Speech Homepage
Well, did you detect any difference?
The SAPI 5 version retains the good qualities of it's predecessor, sounding about as pleasant as the SAPI 4 voice did. Microsoft has also went and fixed some of the problems of the previous synth.
The voice sounds much calmer now, no wildly varying intonation anymore. It's perhaps too clean and lifeless, the SAPI 4 version is actually a little better at lower rates. Notice how the SAPI 5 version gets the exclamation mark by suddenly raising the pitch. One final nag about the new intonation, the voice sounds pretty sad and shaky to me judjing by variations in pitch. The SAPI 4 version can actually sound unintentionally happy (try making it say "sam to sapi 4") so I'd prefer it at lower speaking rates.
But the SAPI 5 version has got other enhancements in addition to improved intonation. They have managed to almost completely eliminate the jumpy quality of the SAPI 4 version, the SAPI 5 version is taking long and nice breaks where it should, rather than hastily blurting out things. Pronounciation is also a lot better and the horrible habit of auto-abbreviating is mostly though not entirely gone.
Sound quality: 8.0
Usability at high rates: 8.0
Microsoft shipped a new SAPI 5 voice with Windows Vista which is a step up from the previous default voice Sam. In Keeping with the common female names, Mary's SAPI 5 successor is called Microsoft Anna and it is the only SAPI 5 voice in the system out of the box. Although much of the underlying SAPI interfface technology appears to have stayed the same, a lot has changed in the engine.
the new voice Anna sounds much more natural and the high frequencies in particular sound more pronounced in this voice. The sampling rate was set to 16 kHz on speaking to the wave file, however, so the sample does not convey the quality very well. I very much like the voice character myself it doesn't sound quite as exciting as At&T's Crystal and is much lighter and more soothing to listen to than Apple's Vicki.
Where as Microsoft's SAPI 4 intonation varied wildly and that of Mary in SAPI 5 did not, Microsoft Anna is of the more fluttery variety. The intonation reminds me of TV channels like Euron News. The voice does sound particularly pleasant in rather formal context such as announcements, neutral Windows user interface text, as spoken by a screen reader, and in passages like still an improvement. This intonation is one of the reasons why I like the speech in Vista so much. But then again, at times the intonation goes over the top. Sometimes the voice sounds angry or just wildly random, at other times it manages to be very bland. Listen to the snippet one word, though do keep in mind that the exclamation mark affects matters, too. I don't seem to be able to detect much difference in intonation between questions and exclamations, either, which is a definite minus. The two use an intonation markedly different from normal sentences, however, the intonation does not convey to me, as a second-language English speaker, any sense of questioning or shouting, so I'd say it's rrather unintuitive after all.
In terms of pronounciation I don't like the new voice all that much. It pronounces many very common user interface terms as used in screen readres all wrong. Combobox becomes com-by-box and verbosity is said incorrectly, too. in the sample passage, to me at least, it says rectify with a distinct n and always pronounces document read in the past tense, . Another thing they had better in earlier voices like Mary appears to be the auto abbreviating, though I have not done any extensive testing, just a few random sentences. Stuff like sw and oct are pronounced south west and october, which again, is pretty bad in screen reading. Oct can be octave, octal or even Octavian, for that matter.
The voice parameters are the usual pitch, speed and volume. However, ana scales very well for a sample based synth. You can get a variety of different pitch settings that sound all right and there are few artifacts even at very slow speeds. For screen reader use, the voice is much more intelligible at high rates compared to the earlier offerings. Not quite in the same league as the fully synthetic or formant voices but one of the most intelligible sample synths, I'd say.
Lastly, one big drawback of the synth is its response time. There's a noticible lag before it speaks in screen readers like Narrator or Supernova which is quite serious. Its not as bad as in some other synths but perceivable and thus irritating nevertheless.
Sound quality: 9.0
Usability at high rates: 8.5
This is another free speech synthesizer but not by Microsoft. It's also SAPI 4 compatible and sample based like the MS synths are but is multilingual. L&H supports a variety of languages but my LH sample is a British Female, however.
The basic quality of the voice is kind of cute. It sounds pretty pleasant, although it has got a LoFi sound to it suggesting a sampling rate of only 11 kHz or even less which is a definite shortcoming. Although the voice cannot be customized at all, the synth still includes separate female and male voices which is a nice touch.
The synth does not seem to be able to make head or tail about questions, shouting or anything as far as intonation is concerned. The intonation seems to be a little "wrong" all the time. In all honesty, however, the L&H synth seemed to do very poorly with this text. Normally the intonation is fortunately somewhat better.
The pronounciation is OK somwhere between MS SAPI 4 and 5. But the synth is not as responsive as the MS ones (perhaps a little too sluggish for screen reader use) and it has got some serious trouble at high rates. Even if the speaking rate was relatively slow, as in the above example, some of the phonemes sound muddy to me and the synth also has a peculier habit of stringing together words in an odd way. You might get used to these quirks and the synth could be Ok to good. However, for high rate speech and daily use I cannot really recommend it. Might make a nice mIRC voice, though.
Sound quality: 8.0
Usability at high rates: 6.0
RealSpeak is another sample based speech synthesizer from L&H. The synth is supplied with Omni Page 12, a popular optical character recognition package. Many scanners include Omni Page as an added bonus. The reason why Omni Page comes with a speech synth is that the version twelve includes text to speech functionality for reading aloud recognized texxt.
This voice is really pleasant and certainly one of the best I've heard. It sounds much more human and lively than the average speech synth and reads text in a steady rate apart from a couple of strange speedups. The speech is clear and intelligible at high rates, too, although not as good as in ViaVoice, Orpheus or Dectalk.
The intonation is ok but just like the previously reviewed L&H synth it does not react to questions and exclamations at all, odd. The samples have got a somewhat metallic edge to them but that's due to a different mp3 encoder being used and not a fault of this voice.
The pleasant sound quality comes with a price, however. The CPU usage is one of the greatest in the lot and the response time is the longest ever, around a second. This is OK for casual use like in Omni Page but it is unacceptable if using the synth with a screen reader. The voice is not customizable, either, apart from the usual volume, pitch and speed. You cannot set up abbreviations and even the about dialog box is missing. To make things worse, there's only a single voice.
Sound quality: 9.0
Usability at high rates: 6.0
Highly Intelligible Engines
IBM ViaVoice is IBM's version of the popular Eti Eloquence synth. I have heard samples of both and they seem to contain exactly the same voice engine. Another speech synthesizer having the engine for Linux is Voxin. The synth is free in the sence that once you purchase a professional screen reader, it is most likely included with it. Jaws and Windoweyes come with Eloquence while Supernova users can purchase ViaVoice in a fraction of the original, very high retail price. You can also get ViaVoice for free by downloding a demo of IBM's speaking Web browser called Homepage Reader.
Unlike the other synths so far, ViaVoice is a fully synthetic speech synthesizer. It has a dry but clear voice by default. The following ViaVoice samples, however, use a customized male voice which is a lot softer than the default voice (e-mail me if you want the exact parameters).
The basic character of the voice is very pleasant for a synthetic one. In fact, it sounds almost as natural as if it was a sample synth, which is pretty amazing. Definitely the best sounding fully synthetic speech synth I have ever heard. A big plus for this. One slight gripe, though, the Eti voices seem to always sound either too dry and nasal or husky depending on the breathiness setting.
The speech is also much more responsive than that of the MS voices and it uses relatively little CPU, making the synthesizer an option for even embedded systems like mobile phones. The timing is pretty stable, in other words, the rate of the speech does not very too much so the synth should do well at high speaking rates. ViaVoice has got a bizzarre, very rhythmic way of speaking, listen to how it says "only a limited amount of memory" to get a feel for this.
The intonation is also OK. It can subtly hint at a question but seems to be ignoring the exclamation mark completely, however. Dispite all of it's advantages, I personally find ViaVoice's intonation a bit boring after having listened to it for a while.
More about the pros of ViaVoice. The character of the voices is extensively customizable. You can adjust parameters like speaker type (male/female, child / adult / elderly person), speed, volume, pitch, pitch fluctuations, head size, breathiness and roughness and these seem to really affect the sound. You can create several very different voices, head size, pitch and breathiness are especially interesting.
The synthesizer is also multilingual even including Finnish, which is my mother tongue. The support for other languages is not too good. This is because the intonation is global in ViaVoice, but in real world it differs from language to language. In Finish, the pitch should not raise in the end of a question but it does in ViaVoice because the global intonation seems to be tailored for English in particular. it has also some trouble with the hard rrr sound and umlauts like ä ä and ö ö sound a little funny. Overall, I don't like ViaVoice's Finnish, it sounds like a classic "my Finnish is poor" computer voice.
Sound quality: 8.0
Usability at high rates: 8.5
Dolphin is a company making screen reading and magnification software. Dolphin's screen reading products like Supernova (which I'm using), HAL and Lunar Plus have for years included Dolphin's own speech synth called Orpheus. Dolphin has also made a hardware synth called Apollo. Orpheus cannot be used much outside Supernova, however, apart from a few programs that support Dolphin's very own SAM standard.
Just like ViaVoice, Orpheus is not sample based (although Orpheus 2 will optionally be) but generates all sound synthetically the default sound, Dave, is pretty unpleasant and very robotic but fortunately it can be tweaked. My custom voice, Hal, is lower in pitch than Dave and is a lot more smoother (see my Orpheus page for details). The Orpheus samples have been recorded with the HAL voice.
The basic quality of the voice is definitely robotic in comparison to ViaVoice, not to mention sample based synths. However, the speech is very clear the pronounciation is clearly second to none (and it's getting better and better in each subsequent release). Sure there are a couple of glitches with some rare words like (pineout) but not nearly as much as with ViaVoice.
Timing also seems to be very stable and the response time is short. In addition, there's an option of switching to 11 kHz for really old computers (sounds pretty horible but a nice feature whatsoever). By saying that timing is stable, I actually ment that the speech is flowing at a constant rate, punctuated by commas, periods and question marks quite nicely. Whether ViaVoice's rhythmic way of speaking or Orpheus's constant babbling is better, comes down to personal preference. If you ask me, I prefer the "constant flow" approach at least at higher speaking rates.
Getting used to Orpheuss's intonation takes some time. It's not the most natural sounding on the planet but, on the other hand, makes it certainly clear which type of sentence is in question. The pitch raises almost comically in questions, and exclamations are noted by an extraordinarily low voice.
Orpheus is also customizable. The parameter names differ quite a bit from ViaVoice, there are parameters like: pitch, intonation, word pause, phrase pause, voicing (breathiness), speaker table (from 0 to 7 ), balance (panorama), hyper mode (skipping of articles and prepositions for quick browsing), mark/space ratio and finally voice source (male or female default). Despite Orpheus's great number of voice parameters, ViaVoice is really much more customizable in most cases and it's parameters are pretty straight forward in comparison to such abstractions as mark/space ratio or the numeric values for speaker table.
To be honest you can really get only one or two different, good sounding male voices out of Orpheus and that's about it, where as ViaVoice can do several pleasant sounding voices both male and female. Orpheus just cannot pull off female-sounding voices, although the voice source parameter has got the option female default. All of Orpheus's default voices are males as well.
Because Orpheus is fully synthetic, it is also multilingual. It supports even more languages than ViaVoice does including Finnish as well. Orpheus's Finnish is certainly better than that of ViaVoice, although it suffers from most of the same basic problems like global intonation.
It is really difficult to say which one is better Orpheus 1.x or ViaVoice. Both synths have there strong points and weaknesses. If I were to choose a synth for both casual and serious use, it would almost certainly be ViaVoice due to it's better sound quality. For daily use I would rather pick up Orpheus despite it's robotic sound quality, because of clear speech and good pronounciation (and because I've gotten so used to it in the past years).
Sound quality: 7.0
Usability at high rates: 9.0
Orpheus version 2 comes with the latest Dolphin software and includes several major enhancements to version one. Most notably it is finally SAPI 4 and 5 compliant, includes many new languages with an exception dictionary and features both synthetic and sample-based voices to get the best of both worlds. Orpheus 2.x has also been ported to Windows CE, according to the help file and includes a variety of new languages such as Chinese.
As Orpheus 2 has both sampled and synthetic voices, which are fundamentally different, I consider it unfair not to review both. However, as you can only use one type of voice at once I'll review both the synthetic and the sampled side separately, giving scores for both kinds in naturalness and intelligibility. There will be only one overall rating, however, though the versatility of technologies and voices will be considerd an advantage. As for the samples, Synthetic Dave (US English) handles the synthetic side while UK English Alan serves as an example of sampled speech.Orpheus Homepage (Now by Meridian)
The synthetic side of Orpheus 2 hasn't changed much as far as naturalness goes, that is it is still very robotic sounding. IT's hard to assess the quality exactly as I originally disliked Orpheus 2 greatly but nowadays have grown used to it and like it better than its predecessor, even so much that now Orpheus 1 sounds strange. As for the sampled version, it is certainly a lot more natural than the synthetic one and roughly on-par with the Microsoft SAPI synths. The basic quality of the voice is a pleasant brit but the problem with it are the actual units of sound. The transitions could be a lot smoother and in particular hard sounds like k, p, t, s and x leave a lot to be desired in terms of clarity. When it comes to responsiveness, Orpheus 2 supports DirectSound and is about the fastest synth so far. You don't notice any delay even if using the sampled voices, though loading the voices the first time in memory takes some time.
Pronunciation is virtually the same regarding both voices and represents a clear improvement from the previous version. There are a couple of glitches but Orpheus 2 includes a language specific exception dictionary that can be used to remedy them and also add support for abbreviations if desired. Speaking of which, Orpheus 2 takes an even more drastic approach to abbreviating than the 1.x line. Even such mundane abbreviations as Mr. and Mrs. have to be defined separately, which is a good thing as far as context-free speech goes. Improvements to skipping over unimportant glue-words as well as reading context better regarding money, dates and such has been added but I won't go into them here.
There are a number of changes regarding intonation and intelligibility in Orpheus 2. In particular, the synthetic voices seem to be even more intelligible than before, even the fastest speed of 700 is somewhat comprehendable to me (the fast samples were recorded at the rate of 307). The sampled voices, on the other hand, suffer badly at higher speeds but seem to be a bit more natural sounding if you increase the speech rate. Still I find Alan's voice to be a bit on the thin side when its trying to talk very rapidly. Intonation is the same for both the synthetic and sampled voices, even though it differs slightly in UK and US English, and is roughly of the same quality as in V1. However, though the synth emphasizes exclamation marks and questions somewhat, the effect is still a bit too mild and can go unnoticed at higher rates from time to time. The problem used to be much worse in the early 2.x releases. Nowadays I think the questions are pretty easy to spot, although the changes could be even more drastic. But when it comes to exclamation marks I can detect very littel difference in intonation, the pitch is only slightly lower near the end of the sentence. Another noticible change is in the way how Orpheus 2 speaks. it sounds like it's closer to ViaVoice nowadays, speaking more rhythmnically than it used to. I believe this is partly why the intelligibility is better and seems to be an ideal middle-ground between the Orpheus 1 and ViaVoice styles. Still Orpheus's tone appears to be tailored for fast rates and I find that at very slow speaking rates, stretches many of the sounds so much that the effect is almost funny.
Finally, as far as customizability is concerned Orpheus 2 scores very high. there are the familiar options as in Orpheus V1 as well as new controls such as basic eq parameters and head-size. As for the voices, each supported, sampled language has got one male and one female voice and the synthetic voices Dave and Andy are available in every supported language. though the array of customizable parameters is big, I've found many of the controls don't sound as good as in ViaVoice or orpheus 1 especially head-size and voicing.
Despite all of the shortcomings listed so far, I still view Orpheus 2 as a major change for the better. In particular, the option of using sampled voices if desired, the exception dictionary as well as improvements in language support and intonation are welcome changes. ON the other hand, sampled voices could still be further developed, question highlighting should be even more pronounced and the effects of the voice parameters could be more, er, usable in practise.
In the following scoring, when ever two numbers are given, the
first one is the score for the synthetic voices and the second one
for the sample based.
Sound quality: 7.0, 8.0
Usability at high rates: 9.5, 7.5
The Dectalk family of speech synths is a popular option as far as hardware synths go. Nowadays the same synthesis is achieved in software as well. The software version is available for Windows 9X, NT and CE as well as for a number of Unix operating systems.
Although the basic character of this voice is a little robotic and the mp3 encoding generates som nasty artifacts, the sound quality is still very nice and comparable to ViaVoice I dare say. In addition, there are several other voices some of them sounding a lot more natural but not as intelligible as this one. I chose the voice Paul for the intelligibility which is very good almost as good as in Orpheus or in ViaVoice. My only real complaint about the voice is that it is a little unclear and muddy when it says those nazal sounds like "mm" and "nn".
I also like the intonation, it's lively for a speech synth and it really makes a difference between questions and shouting (requires some getting used to, though).
The pronounciation is not good. It's about at the same level as in Microsoft's SAPI 4 without auto-abbreviations. I cannot comment on responsiveness because I've only got a standaalone demo of the synth but it should be fairly good becaus Dectalk is one of the smallest sample based synths around and is being used in some handheld devices I think (judjing by Windows CE).
There are not any special voice parameters, just volume and speed, I couldn't even find the pitch slider anywhere. However, Dectalk has got a huge number of voices, eight for each supported language, which compensates the lack of customizability very nicely. The voices are also really different. The default voice Paul has got a LoFi edge to it lacking bass and not being very natural but it's ideal for small speakers and daily high-rate use. Frank and Dennis are smooth, nice and natural but not as intelligible. Some voices also suffer from slight clicks and glitches and excessively boosted bass. Still a nice collection of voices. In brief, a pretty neat sample synth that can really compete with Orpheus and ViaVoice.
Sound quality: 8.5
Usability at high rates: 8.5
Particularly Human-sounding Voices
Cepstral offers human sounding sample based voices for various platforms and devices ranging from servers to mobile, embedded systems. Though the voices are from 40 to about 130 MB they significantly surpass most speech synths and can even compete with AT&T's Natural Voices. In the following review I make many references to AT&T so you might want to read its review, too.Cepstral Homepage Cepstral Interactive Demo
The basic character of both Callie and Lawrence is very natural, although a bit metallic due to heavy mp3 compression. Callie's 22 khz sampling rate results in some clear highs and overall I find the voice quality very pleasant especially in the higher register. Lawrence and most other voices are 16 kHz and natural sounding as well. Curiously the S-sound in Lawrence is, to my taste, annoyingly long in places. Each individual sound is very clear, maybe even more so than in the AT&T voices.
Not all voices are born equal, however. Especially those with a small memory footprint suffer from various artifacts. On the other hand, it seems many voices have improved loads in this regard in version 4, Amy used to be somewhat unclear in V3 but is not any more. As a quick diversion regarding the lower quality voices let's spend a while with Diane. It has a definite bass boost in it, many EQed radio voices come to mind. I also find this voice not to be as intelligible as the two I've sampled and it is studdery. come to mind, for example, is said come PAUSE to mind with a truely strange G-man:ish intonation. There are various pops and clicks which manifest themselves easily if you interrupt the speech while it is speaking. Maybe snapping Dian to a zero crossing might help. The EQed sound is, on the whole, most likely intentional and I feel this voice would suit commercials well.
The basic intonation of Callie is pretty good and neutral. It is able to handle a variety of texts without attaching too much unintentional significance. It's rather like a more lively version of AT&T Crystal. The voice doesn't imply punctuation at all, which is a big minus as a screen reader user. Another gripe which tends to get annoying over time is the handling of some short words. It says the Word harmless always the same way and has a queer, large and quantized drop in pitch in the middle. It's also noticible in passages such as well I managed ... At first the problem didn't seem that bad but the quirky intonation does grow annoying over time.
One definite merit of this synth is that the intonation is not only language but also voice specific. The UK English voice Lawrence is a wonderful example of this. It speaks slower than Callie, with a distinctly British singsong intonation. The voice brings some BBC radio plays to mind and is a classic case of an elderly absent-minded professor. The style isn't for everyone or for all kinds of text. It works extremely well in some computer contexts (e.g. Smart Humor in Programming Perl) and fairytails, for example. Lawrence is definitely one of my all-time favorites as far as intonation is concerned. Still like the AT&T voices, achieving this great intonation does involve the risk of guessing wrong. Placing heavy emphasis on words like of course or the varying rate in sentences such as of to the editor, is sometimes out of place. In addition, do notice the vibrato in the word what. Yet again, I can see marked improvements from the 3.x release in general, though.
I think both the voices I demoed scale quite well when you up their speaking speed. I can understand them about as well as the AT&T voices. Lawrence seems to even benefit from a slight speed boost. It sounds as though it is excited about something and you don't have too much time to be bothered by the quirky intonation.
Pronounciation seems all right, although I've encountered a couple of issues. All voices apart from Lawrence say the word intentional with a heavy t rather than s sound in it. Also, the synth spells out stuff too eagerly to my taste which can be problematic in tryihng to type command languages with a screen reader.
The Cepstral voices integrate nicely in other applications supporting SAPI 5. The pitch, speed and volume can be managed as usual. The engine supports the speech markup language for emphasizing certain words, changing pronounciation and switching voices on the fly. That markup won't work in SAPI 5, which appears to be an interface issue fixed in Windows Vista.
As for the voice parameters, volume and speed are supported as expected but there's also pitch. IT works fine within SAPI 5 if you only shift it small amounts to slightly vary the voice. However, via SAPI and especially the supplied Swift Talker speaking Applet the sliders adjust the parameters too coarsly over an extremely large range. Many voices sound strained even after moving the slider a single notch and the extremely high and low settings are virtually unusable. Pitch shifting and time stretching artifacts are as nasty as ever even the whispery voice gets a flanging sound to it when you shift it around. My point is that the adjustments are useful, per se, but don't work over as large a range as one might expect.
There are some pitch shifter, flanger and other delay effect presets which can be applied to a voice, The feature is rather boring once the novelty wears off. You can associate a set of parameter deviations with a voice which is called a voice alias. The aliases aren't accessible within SAPI 5, though. this means the FX voices are out for SAPI users meaning most speaking apps out there. MS did it much better in their SAPI 4 voices, in which FX is tied to individual SAPI 4 voices. It would help if you could tweak the effect parameters yourself. I've also discovered some issues regarding the aliases. A straight alias of Callie which is slightly slower and has the old robot effect doesn't reapply the effect when you pick it in the included Swift Talker applet (V4.1.4). Further more, I think it's pretty bad that some of the FX actually result in digital clipping distortion such as patching old robot to Lawrence, on this machine at least. It might be a volume control issue if I'm unlucky, however.
Frankly speaking, you don't get very far with the pitch and FX parameters. The best way to customize is to get new voices. Each voice has a different intonation and apart from quite a number of rather similar sounding US English females the range is quite impressive. There are some Macintalk-like specialty voices which might be useful in limited domains. A dog voice, impressive whispering and shouting as wel as a true bad guy. There are not that many male voices in general and UK English only has two voices at the time of writing.
The last point I'd like to address is responsiveness which I ranked too favorably the last time. The latency associated with voices is bad enough to render them hard-to-use with a screen reader. Due to a small memory footprint switching between voices doesn't take all that much time. Neither does reading documents or patiently waiting for the synth to complete before more speaking is queued to it. The trouble is, this is rarely the case with daily speech users. Screen reader users are lazy and speech is still rather slow and linear. Thus, people tend to skip to the next item in the tab order or list as soon as they know the current one is not desired. This is done by interrupting the speech by pressing tab for instance. So the lag from the interruption to the start of the next phrase should be as small as possible. The synth appears to fail rather miserably in such scenarios. The delay is subjectively at least half a second with a gig of RAM and a 1.8 GHz Pentium M. And it's not only navigating dialogs and windows but extends to typing in words rapidly or cursoring around a document. I'm taking 1.5 points from the high rates score due to this.
Sound quality: 9.5
Usability at high rates: 6.5
AT&T is a wel known firm as far as speech synthesis is concerned and many people have been surprised by how natural their natural voices synthesizers sound. All of these synths are sample based but each voice has got about 600 MB of samples where as ordinary speech synths are only tens of megs in size. There are also multiple languages available and a couple of voices per language. The sampling rate refers to the quality of the voices a higher number meaning a brighter voice with more high frequencies in it. The 8 and 16 kHz versions are commonly available but I haven't seen the 22 kHz version on sale anywhere. For the samples, I've chosen a 16 kHz Brit female voice titled Audrey.
The first thing worthy of mention is the naturalness of the voice. Half a gig of samples does seem to make a difference, really, The 16 kHz sample rate is sufficient for basic speech use and unlike in nearly all sample based synthesizers, I cannot detect any nasty artifacts in the transformations between the sounds. Suffice it to say that I nearly always use US English speech as UK English is only usable if the voice is pleasant and natural sounding. AT&T is one of the few synths in which I routinely use UK English , in stead, and it also sets a new record for the most natural sounding synth of the ones reviewed.
Another strongpoint of the AT&T synths is the intonation. It is very pleasant sounding to say the least, and quite exceptional too. AT&T is a rare synth in that it has got truly both language and speaker specific intonation. this is an important aspect of making the UK English voices sound real British and also means that individual voices have got more character than usual.
I was also pleasantly surprised to find that the intonation is quite good in general. This is even though of the various punctuation, only the question mark seems to make a difference and even so you usually need a longer sentence to hear it easily. However, AT&T takes a unique approach to flowing the speech. It isn't overly rhythmical or evenly babbling, either, but it tries to take pauses in the right places, probably by analysing the grammar. This is one step further than what conventional speech synths do, that is deterministically adding pauses when-ever commas are encountered. In case of AT&T, however, phrasing is based on simple language structures e.g. the word that will usually give a long pause. This helps in understanding the speech at blazingly fast rates and the quality of samples also means better scalability and sample interpolation than usual. Overall, AT&T is probably the easiest to understand sampled speech synth I've come across, even with extra high speaking rates. Another exceptional characteristic is that the synth has got hard-weird intonational guesses on how certain words should be pronounced. Sometimes this works very well but at other times these educated guesses may bring unwanted emotions to sentences, so we have a bit of a mixed blessing here.
Even though there's a huge amount of samples in each and every AT&T voice, it is only reflected in the long loading times when you activate a voice for the first time. once it has been loaded, the synth is very responsive considering it's quality. However, there's still room for improvement and while the response time is a lot shorter than say in Real Speak, it should be a bit more responsive for day to day screen reading. Still, it is sufficient for occasional use and quite all right for reading text documents. I also had some issues with the SAPI implementation of AT&T. In case of SAPI 4, the volume must be brought as low as 20 (max 100) to avoid a horrible clipping effect. And as to SAPI 5, sometimes when using it with Supernova, it just stops speaking and requires that the synth be reloaded to workaround the problem.
On the dark side, pronounciation is one of the weaknesses of AT&T's natural voices synths. IT is about the same quality as Microsoft's SAPI 4 and even has the annoying habit of reading out abbreviations fully, which hampers the synth's usefulness for daily screen reading. Fortunately, there's an exception dictionary. The synth isn't too customizable, the only adjustable parameters being volume and speed. Pitch would have been a nice option as well as some control over pauses and intonation. To compensate the lack of customizability, the voices themselves are pretty unique. Crystal resembles a classic US English SCIFI computer voice while Mike always brings certain American radio commercials to mind. I've found Audrey to be the most pleasant sounding of the bunch, especially over time.
To summarize, the AT&T is a great synth for reading e-texts at conventional speeds because of the superb intonation and the natural voice quality, let alone things like language and voice specific intonation. It's also got some pretty impressive features for screen reading such as good responsivness and intelligibility. However, AT&T falls short on pronunciation and customizability in particular, and I also had some technical difficulties with the SAPI-interface.
Sound quality: 10
Usability at high rates: 8.0
Infovox, currently Acapella, manufactures multi-lingual, sampled based natural speech synths. This review was made with the downloadable demo version, which is still called Infovox, although it appears Acapella still uses much the same voices. The size of each voice is usually between 100 and 150 MB. The samples are from the US English female voice Heather, although British voices are also provided.
The default sampling rate in the synth is as high as 22 kHz. The sound quality is quite good, one of the best considering the size of the voice. The highs have excellent clarity as well and I think the synth emphasizes sounds like s or t more than most. Not sure about native English speakers but for me the extra emphasis just makes the voice easier to understand. Once again the mp3 compression masks that effect a little, however.
Speaking of the individual voices. I think Heather is Infovox's answer to voices like Crystal or Callie to mention a few. There aare very few artifacts related to single sounds and none in the sample passage as far as I can tell. One of the most disturbing is the second syllable in the stand-alone product name Sonic Foundry Sound Forge. You'll also get a click in the midle of single words like United States. The British voice Lucy is quite pleasant to listen to as well. I think it's about as good as AT&T Audrey although its way of speaking is more predictable and somehow reminds me of some English teachers, Nope, that's not a bad thing at all. Frankly speaking, as with some other Brit voices Lucy's intonated way of speaking may become irritating over time. The only male voice in the demo is called Ryan and is OK, although not quite as good as the females in my view. The voice has too much bass in it and short sounds are a bit muddy and blocked if you will. The intonation isn't quite my cup of tea, either,
But back to Heather, I think they got that voice's intonation pretty much right. It is not too flat like Crystal neither is it wide enough to produce awkward emphasis when guessing poorly. It's a middle-ground between the two and somewhat reminds me of Microsoft SAPI 4 Mary improved. Heather speakes in a rather varying upbeat voice in general particularly in words such as well. Also notice how well it handles questions like harmless or is that all it's got to say. The raising intonation unambiguously implies a question and manages to convey it rather pleasantly, too. On the down side, the rise in pitch is too sudden in very short questions and the synth appears to completely ignore the exclamation mark. Further more, at times the intonation will become very flat, too. The last sentence in the sample clearly demonstrates the effect.
I'm not very fond of the pronounciation in Infovox. IT auto-abbreviates as badly as Microsoft voices saying 3rd as "free road", for example. So far I've yet to find the way to turn off the abbreviation processing. And a lonely a is said more like uh than ay and effective and affective sound exactly the same to me, which is definitely no good thing. Speech synths are ambiguous enough without synth specific quirks.
When it comes to intelligibility and responsiveness Infovox is doing pretty well. It is quite easy to understand and scales better than most sampled voices, I think. It does speak a bit more slowly than most synths but then again this makes it more intelligible. That is the pause inserted when punctuation is encountered is a constant independent of speed, and a long one at that, too. This makes it easier to grasp the structure of the text but does slow down reading a little. Another small touch I like, is the fact that it says "trim it a bit" such that the words are separated clearly. So trim it rather than trim-it. The synth is responsive enough to be used with a screen reader. There's a small pause similar to other high quality voices but it is not as bad as to make the synth unusable. Be warned, interruptions in speech produce audible clicks. Another issue I'm having with Supernova is that sometimes the synth responds to indexing rather poorly. it occasionally starts with the wrong speed and has to catch up by jumping around in the speech stream which doesn't sound nice at all. This might be a bug in Supernova's SAPI handling.
The last point I'm going to address is customizability. You can have pretty good control over volume and speed with the synth supporting both SAPI 4 and SAPI 5. However, that's all you get. AS with synths like AT&T the rest of the tweaking happens by switching voices. There are few out there, so the score in this area won't be that high.
Sound quality: 9.0
Usability at high rates: 8.0
Neospeech is a natural sounding speech engine whose voices are smaller than AT&T's but still easily hundreds of megs in size. The screen magnifier and reader ZoomText offers the Neospeech voices as a more natural alternative to ViaVoice which is normally supplied with the product. The following voice samples are recorded using the ZoomText 9 demo. The intonation of the initial question is a bit different from the usual because of a leading screen reader prompt that I've snipped.
Again, as you've probably expected, the sound quality of this engine is top-notch. There are few if any artifacts in the voice as such and the basic character actually reminds me of some audio books (such as I Robot). Further more, the intonation of Paul is very natural. Neospeech seem to have achieved a fine balance of liveliness even in arbitrary text. Put another way, this synth sounds more vivid than the AT&T voices but guesses wrong very rarely. I don't seem to hget bored to listening to this one, graet. I particularly like how it emphasizes questions though again have a hard time picking the exclamations.
I've only briefly tested this engine so I cannot say much about the pronounciation. It handles the passage quite well overall, apart from one annoying quirk. It always says the article a as ay which needlessly emphasizes the noun following it e.g. ay bit. There's an exception dictionary so correcting the problem ought to be easy.
I haven't properly tried the SAPI interface but at least in ZoomText there's a significant delay before it starts speaking content. Though you can live with it, it is long enough to be noticible and slow down your work. That's why I would not recommend this engine for daily use at least with an older machine (namely an AMD Thunder Bird). As the delays are more noticible at high rates, I've dropped the fast intelligibility score a half point because of it. Voice switching causes further delays, too, but they aren't as bad as in AT&T. Paul sounds highly intelligible at blazingly fast speaking rates. although i don't find it nearly as intelligible as some of the formant voices, it beats both AT&T and Cepstral easily.
As with many of the other natural voices, customizability is in short supply. Speed and volume can be changed as with virtually any synth. I'm not sure if the engine supports pitch changes but they don't seem to be available via ZoomText, at least. The biggest issue is the number of voices. There's only one male and one female voice for US English. No Brit voices or variations. This is a pretty bad blow as I don't really think highly of the female voice compared to Paul. On the other hand, they do include many Asian languages which is pretty rare in my experience.
Sound quality: 10.0
Usability at high rates: 8.0
Speech for Other Platforms
As far as personal computers are concerned, Apple was one of the first manufacturers to include speech as part of a mainstream operating system. historically, most of the early 90s Apple voices are synthetic in nature. However, more recently Apple has also added a couple of sample based voices namely Vicki and Bruce. As I don't have a Mac at home, the first two samples are from the synthetic Macintalk voices running in Basilisk II (an early 90s 68k Mac emulator). Be warned, the samples contain a bit of studdering and clipping because of emulation, I think., as the Mac volume slider had no effect at all and I recorded in the background. Thanks to a kind Mac user, there's also a sample of the high quality Mac voice Vicki.
For a synthetic voice Macintalk Fred doesn't sound too bad. And I know Fred is synthetic having heard the fact straight from the proverbial horse's mouth. Wikipedia, for instance, falsly claimed at the time of writing that all Macintalk voices are diphone based (read sampled). But back to Fred.
When listening with headphones, there's an annoying amount of bass, some 8-bit hiss can be heard, even though the sound system is internally 16-bit, and s-sounds are slightly synthetic and distorted in nature. Some sounds are joined rather crudely, the th sound in cough is buzzy and there are a lot of clunky sounding audio artifacts once you increase the speed. Though the voice may seem straining in heavy use, it appears to work much better with your average computer speakers and has most likely been optimized for such a setup. It should also be noted that this voice is over ten years in age but has standed the test of time very well. That's why Apple hasn't improved the old Macintalk voices much, though they may have upped the sound quality or intelligibility a bit. Still most of the voices on offer in a modern Mac are exactly the same as the ones in my emulated MacOs 7.5.3 from the early 90s.
Apple is well-known for the range of voices available and as an extreme example I present the pipe organ voice. it sounds like a good vocoder effect and is quite impressive if a little useless even today. The sampled Macintalk voices, such as the above introduced Vicki, have a very pleasant sound to them overall, too.
I'm not sure if all of the voices use the same model for intonation and pronounciation so I'm using good old Fred as a case study. It would seem that both exclamations and questions do affect intonation though a little too subtly to my taste. I believe this could be fixd by changing the intonation, though. One quirk is that even parentheses affect intonation lowering the volume slightly at least. It's a good thing in general but there should be an option of disabling it as you may be reading math or program code in which intonation changes may be undesirible.
I don't find the intelligibility of sampled voices such as Vicki that good for several reasons. Firstly, some sounds add buzzy and robotic defects to the voice which is a definite minus (e.g. the s in sake). Secondly and more importantly, some transitions from one sound to another are either muddy or grainy, the passage and no-one knew ... off to the editor, and make the voices less intelligible than most sampled ones. I would go so far as to claim that Vicki is hard to understand even at ordinary speeds. As a synthetic voice Fred beats Vicki clearly but falls a bit short of Orpheus, ViaVoice or Dectalk both in terms of intelligibility and the number of audio artifacts present. still Fred does it better than most sampled synths even if some transitions have metallic artifacts in them.
The pronounciation is a little behind Microsoft as far as I can tell. It says both microprocessor and embarrassed incorrectly and spells out words a little too keenly such as speak2me. On the other hand, I don't recall seeing any of the horrible auto-abbreviating so prominent in the MS synths. As a general rule, all uppercased words are spelled out letter by letter. While this may be a good overall estimate, it is too context sensitive for screen reading. In program code some names are conventionally in upper-case and even in literature Pratchett's Death speaks in all caps. Again, I wish you could turn off spelling completely.
The parameters that are shown in Macintalk's GUI include only speed and volume. However, apparently the speech API offers other options as some third party software let's you change the intonation amount (modulation) as well as the base pitch of the voice. Though the latter two parameters do work, the voices work best with the defaults and only few other pitch settings are truely usable but still it is nice to have this control. Though there's no exception dictionary and English is the only supported language, the number of voices is huge. In addition to normal sampled and synthetic voices, there are singing synthetic voices and voices that have special boing and robot effects in them. They aren't obvious and lame delay effects as in Microsoft's synths but rather something likely formant based which would be nearly impossible to add afterwords. I think these funny sounding voices are a nice addition but they are not very useful for daily speech use. The fun factor wears off rather quickly and you cannot really use the more exotic voices in dialog as the pipe organ voice clearly demonstrates. It seems the rhythm and harmony of the singing voices is also predefined. If you could control the melody and lyrics over MIDI in real time, you could use the singing voices as a musical tool. Again this would be useful but doesn't benefit your average speech user in anyway. In scoring, I consider the plephora of voices a minor advantage. Having both synthetic and sampled voices, as in Dolphin's Orpheus, is a far greater benefit as such, however.
NOTE: When-ever two numbers are given in scoring, the first is for the synthetic voices (Fred and Pipe organ) where as the second is the score for the sampled stuff (Vicki).
Sound quality: 7.0, 8.0
Usability at high rates: 7.5, 6.0
The Festival speech synthesis system is an open-source speech synthesizer which started in the academic circles. in addition to regular speech synthesis the package offers various kinds of programming interfaces, low-level components and other things such as the ability to sample your own voices if you've got the time and expertise. In addition to serving as an open platform for speech research, Festival has been adopted by many open source projects including most screen readers running in Linux. Festival has also been ported to Windows so these samples are from the Windows version. It should sound the same as any other version, however.
The character of Kal is a little dhry and the pitch or sound changes feel sort of grainy or quantized. The effect is a bit like that of the Speak and Spell toy, if you've ever heard it, though fortunately a lot subtler. Also, I feel Cal sounds a little depressed just like Microsoft's SAPI 5 voices. These quirks together significantly decrease the naturalness score in my view though the sound quality is still usable.
It would also seem that by default, Festival doesn't pay much attention to punctuation. Sure it does add pauses but I cannot detect any difference in questions or exclamations. Another curious habit of Festival's is wildly varying the speech rate unpredictably. It is as though it read normally, stopped to think for a while and then rattled on in burst mode for a while, finally setling to the normal speaking rate again. These speed changes are large enough to affect intelligibility. Even though it sounds a bit muddy at high rates, I still find I can understand the synth moderably well. Pronounciation is ok, too, though it skipped the article the before the word Earth and tends to spell out names that have a lot of consonants in them. Getting the balance between spelling and reading as a word just right is notoriously difficult but I think Festival spells out things a little too eagerly.
When it comes to speech parameters, volume and speed can be changed and I think the synth also supports pitch changes and modulation amount, though not all user interfaces show these parameters and some artificially limit the speed range, too. It is very hard to assess the customizability of the synthesizer because of the flexibility of Festival. as far as the end user goes, you can get about half a dozen English voices for Festival, some of which approach Natural Voices in quality, and it supports numerous other languages already. If one has got the knowledge, however, it is quite feasible to create your own voices, as well as add XML tags for emphasis, speed, voice and intonation changes and so on. With the built-in LISP interpreter, one could also liekly add emphasis to questions and exclamations on a regular basis. As the synthesizer is open-source, one could even remedy the hugely varying speed issue. in brief, the possibilities are endless. In scoring the synth, I'm reviewing the default voice Kal and considering the most likely customization options for end users.
They say whether or not you like a given speech synth comes down to taste and experience in the end. Perhaps it's just me but frankly and subjectively speaking, I've never been much of a Festival fan not even when I had my own Debian box running for a while. I just feel like Festival may offer some great possibilities for the academia but its default voice is far from Festival's commercial, closed-source competitors.
Sound quality: 6.5
Customizability: 8.5 to 10 depending on your experience
Usability at high rates: 7.5
Like Festival eSpeak is another Open-source speech synth for Linux which is gaining some popularity. In fact it is possibly included by default in some future accessible Linux releases such as Ubuntu Edgy. The main problems with Festival for me are its low intelligibility, lack of Finnish support and the rather poor response to punctuation. In this review I'm going to evaluate these aspects in addition too telling you about the synth in general. By the way, the name of the synth application and package is speak but the term eSpeak is used in the Source Forge pages.
Espeak produces most of its voice synthetically just like Dolphin Orpheus, for example. IT actually uses additive synthesis made up of sine waves and a combination of synthetic and sampled material for voices like s or t, for example. The synthetic nature of eSpeak is apparent in the sample and I think it sounds even less natural than Orpheus. However, the voice does beat Amiga Narrator easily and goes OK for screen readder applications, once you do get used to it. Another curious characteristic of the voice is that it is rather dry and definitely robotic sounding. One aspect, which some people might like, is that sampled sounds like s or t are clearly emphasized in speech. Despite its rather machine-like way of speaking the eSpeak voice has very few artifacts. The only one I could pick right away are the two bits god's sake and it's still both of which sound a bit choppy as though the synth was studdering slightly.
The intonation of eSpeak is rather even and pretty flat which reminds me of Orpheus a little. I think it is a bit too flat by default so it would be nice if there was an easy way of controlling the amount of pitch changes applied to the current voice. The speech synth reacts to questions by slightly raising the pitch but it can easily go unnoticed with higher speech rates. I find that I need to look for the changes in particular before I'm able to spot the questions. Other punctuation does affect speaking as well. Most of it adds pauses to give the text more structure. Just like Festival, eSpeak appears to, sadly enough, pay no attention to exclamation marks.
For a free, open-source synth, the pronounciation is rather good. It makes no mistakes in the sample passage and won't spell very easily or resort to auto-abbreviating apart from Mr. and mrs. As a screen reader user, I do consider that a big plus. I wish there was an easy way to turn off all abbreviations, though, or control how eagerly words are spelled. In Books it would be nice to get CIA spelled out, for instance. I think the spelling is also controllable within the language files if you do spot mistakes. Normally, it would be unreasonable to expect a modern GUI user to edit text files but as this is Linux, it seems user expectations are quite different.
Like many other formant synths, eSpeak scores highly in both intelligibility and responsiveness. I can understand it at rhoughly 300 words per minute without losing too much detail. That's lightyears ahead of Festival, although some sample synths and many formant voices do significantly better still. In my testing in an unofficial Windows version I detected no perceivable lag in response times so eSpeak suits screen reader applications well. Within my Linux virtual machine running Ubuntu and the Orca screen reader, there was a small latency that does slow down things a little but doesn't prevent sccreen reader usage. Most likely this is due to virtualization, cumulative sound card latency from host and guest machines, the Orca screen reader itself and or the speech dispatcher component that does the interfacing to the command-line driven eSpeak. Gnome isn't the fastest desktop environment, either.
As far as parameters go I've only played around with the pitch, speed and volume settings that work well. I haven't yet gotten the speech dispatcher voices working but my guess is that the voices vary about as much as in Orpheus. In particular, I do know there's no easy way to simulate the voicing or breathiness parameters that are commonly found in other formant synths. You can additionally tweak the intonation on a language basis and create new voices by adjusting the amplitudes of individual formants and various other voice parameters. This method isn't very intuitive for your average end user but will probably keep adventurous Linux folks happy. And as the synth is open source the sky's the limit in the end. As an example, I'm co-operating with the author of eSpeak to add Finnish support to it. And with relativly little effort it already speaks Finnish passably meaning the ability to read my native language in Linux. That's just great.
Sound quality: 6.5
Customizability: 8.0 to 10 depending on your experience
Usability at high rates: 8.0
I think that The Narrator speech synthesizer, not to be confused with the screen reader having the same name, is one of the earliest widely seen speech synths on personal computers. It's part of the early AmigaOS versions and has been around at least since WorkBench 1.3 and the Amiga 500. I'm not totally sure about the year but think we are talking late 80s here and a synth that's even earlier than the first Macintalk voices. Additionally, unlike the Apple voices the speech capability was no optional extra but a widely accessible utility, called Say, on the OS disk. The following The samples of the Narrator speech synthesizer were taken on an Amiga emulator and are brief because of a character limit in the Say utility. Still, the sound should be fairly authentic as I'm keeping the emulation as close to the original as possible:
It is clear right from the start that the voice is synthetic in nature and it sounds a tad bit more primitive than the earliest of Macintalk voices (Macintalk 1.x not reviewed here). Comparing the synth to other early efforts like SAM on the Commodore 64, the quality is actually pretty good. But when listened against modern synthetic voices such as Dolphin's Orpheus and IBM ViaVoice Narrator really shows its age. Even as a daily user of Orpheus, I think this Narrator voice is distinctly robotic. It has a bit of the same LoFiness as some of the first, experimental synths from the 50s . But, mind you, only a bit, and I think the Amiga speech synth could work well in a science-fiction setting.
The pronounciation in Narrator is kind of faulty. It has trouble with very common English words such as amount and it reads shortenings such as it's, as it s which I find annoying. Clearly, Narrator hasn't really been designed for reading long texts, at least not in the Say utility. However, the Amiga has a Unix:ish notion of devices as files and there's a dedicated speech device that can be used, say, on the command-line with ease. I think one could indicate the pronounciation exactly and there are some 3rd party libraries but these aren't options that your average end-user would take advantage of. They might come in handy for programmer types, however.
As in many of the real old speech synths, there seems to be some distinct randomness in intonation. Although it can add livelyness to the voice at times, it does sometimes accidentally raise the pitch toward the end of the sentence, even if it is not a question at all. For daily speech use, the ability to hint at punctuation with intonation is important so these false positives drop the score. And to make things worse, exclamations and questions don't seem to make any difference. On the other hand, the synth adds pauses upon encountering punctuation signs.
One thing I like in Narrator is the intelligibility of the voice. I think it is almost on par with The Fred of Macintalk, though the lack of high frequencies in the output makes it a little muddy. Still, I find that I can understand the synth surprisingly well at high speech rates.
Yet another thing that Narrator has in common with many of the synthetic, formant-based speech synthesizers, is customizability. The speaking rate and pitch can be changed and there are special female and robot voice choices. Pitch changes do sound ok to me and the voice scales well in terms of speed as it is not based on samples. The female option is very unconvincing, though, and I find even the natural setting robotic enough.
When it comes to user-interaction, the speech synthesizer can be tested in the Say application. Though it is technically a graphical application, in practise the interface is totally command-oriented (compare to a DOS window). The syntax is very easy, but there's an input limit of only a couple of sentences so that's why the samples are so brief. The say application can read text files, too, but only when it's launched in the command-line. I don't know very much about the Amiga, so it is unclear for me whether the speech synth has found other uses besides reading short text snippets. I've never seen any screen reader applications for AmigaOs, though.
When you compare Narrator with modern speech synthesizers, it appears to lose in almost every respect. However, the synth was quite an achievment in the 80s and fit on the OS floppy. Additionally, the speech wasn't truely hardware supported but software based, in stead. I reckon Narrator was before Macintalk, sounded better, was more intelligible and more customizable. On the PC, the earliest examples I can remember are from early 90s, the Finnish Mikropuhe 1.0 talking through the PC speaker as well as Dr. Sbaitso that came with the 8-bit Sound Blaster.
Sound quality: 6.5
Usability at high rates: 7.5