A Review of TextAloud
By Veli-Pekka Tätilä
The use of computer generated, synthetic speech is getting more and more mainstream. Popular operating systems such as Windows and MacOs X include built-in text to speech (TTS) capabilities and synthetic speech is also used in automated weather reports and numerous telephone services to name but a few applications. Text file readers are a common class of speech apps and as a nifty extra many of them let you turn text files into spoken audio directly. The audio file or files can then be listened on the road using some hardware mp3 player, efficiently stored on the computer for later listening or burned on a set of regular audio Cds.
As to the usefulness of such applications, I'd roughly divide the user base in two. Firstly, there are ordinary people who want to save a bit of time and give their eyes a rest, learn how a foreign language should be pronounced or simply think synthetic speech is cool. The other class are people who must rely on synthetic speech on a daily basis such as sight impaired screen reader users. I'm personally a sight impaired power user who knows a lot about synths and audio and is also familiar with programming, so I'll be concentrating on aspects that usually get very little attention, such as keybord usability, screen reader accessibility, batch conversion capabilities and transparent SAPI 4 and SAPI 5 support. This page is a review of Nextup TextAloud which is a popular try-before-you-buy text reader for Windows. I'm only covering the features I found most useful, though I do attempt describing my experiences in a detailed and objective manner. I'm using the TextAloud trial version and the version number at the time of writing is 2.064.
Reading Text and Changing Voices
Reading text files out loud and changing voices is very easy. The user interface has a familiar text box for editing and selecting text, from now on commonly referred to as the text editor. Above this control is a toolbar-like portion of the window that let's you quickly select the desired voice and its parameters as well as read text aloud or dump it directly to an audio file. The voice controls are surprisingly accessible; there are push buttons with labels for common actions and all controls appear to be part of the tab order. Oddly enough, however, pressing shift+tab multiple times only goes through few of the controls and returns to the text editor after the title text field. The correct way to tab around, though it is slightly counter intuitive, is to press shift+tab to get to the voice controls and then tab forwards. The currently selected voice isn't available in the menus but can be set in the options dialog in addition to the list box in the voice controls.
Including a voice panel right in the document window can save a great deal of time in contrast to some freeware programs that only offer the voice parameters in preferences. This is because despite Microsoft's standardization efforts, synths have differing notions of speaking rate, pitch and volume values and often you need to tweak the speed a bit to get it just right.
In addition to reading the whole document, the speak menu offers other choices such as reading the selection or starting from the cursor. However, there's no read paragraph or read to cursor option both of which are commonly seen in screen reading software. Another minor gripe is that while the voice is speaking, there are no means of quickly stepping back of forward in units of words, sentences, paragraphs and so on. Access to such real time navigation using hotkeys would be highly useful because it is easy to mis a sentence or wonder what a badly pronounced word really is. Sure you can pause the voice, even using the menus, manually go bakc and finally re-initiate reading but it just isn't very convenient.
On the bright side, controlling voice parameters like speed and pitch is possible even during playback and works better than I anticipated. But there's a significant flaw in keyboard usability. That is when you change the speed slider on the keyboard, as soon as the change takes effect, the keybord focus jumps to the text editor in stead of remaining in the slider being adjusted. This makes real time speed adjustment from the keyboard a very slow and frustrating procedure. As a work-around, the voice parameters can be changed one step at a time in the edit menu of the program. Additionally, I wish the voice sliders had accompanying text fields for typing in the desired values directly provided that you do know the exact value you'd like.
Importing Files and Using the Clipboard
Typing in text is not the only way to read it, you can also capture stuff from the clipboard automatically as well as open files in various formats. In adition to the obligatory plain text, clean plain text conversions of RTF, DOC, PDF and HTML files are supported and they work moderately well. Not all doc or RTF files seem to be openable and some of them throw a class not registered exception, however.
One major point in usability is speaking a language that the user understands and making the error messages clear and supportive. A straight error text from windows is a good example of how not to do things. Such error messages are not unique to unsuccesfully opening doc and RTF files but plague many of the speech related errors, too. Technical information can help in trouble shooting situations and it should be included. However, it ought not to be the primary focus of the message as far as the average end user is concerned. Even as a programmer, the only significant detail I've learned from the errors is that TextAloud is using Microsoft's OLE Automation for controlling speech synths, the same API I've been using from Perl.
Another short coming of the import facility is that even if we are dealing with a text editor, big and little endian unicode doesn't seem to be properly importable. Mac Unix and DOS text seems to be, however, even without user involvement, cool. By default capturing the clipboard will also prompt the user but this prompt can be disabled so data is copied directly to TextAloud. The trouble with the clipboard capture confirmation dialog is that it jumps in the foreground with a very short time out yet the keyboard focus is not moved to the dialog. This makes it very difficult for sight impaired persons to interact with the clipboard dialog and the time out is by default all too short for screen reader users to get a picture of what's going on.
Reading to an Audio File
Reading the text in TextAloud is a breeze. There's a dedicated button for it and a choice for the same thing in the Speech menu. The only thing you need to provide is the folder in which the audio file should be saved and even this prompt can be disabled in the options. The old DOS term directory is used needlessly, though, and the ability to specify a different audio file name would have been nice. Fortunately, the defaults are smart using the same base name as an imported file or the first few words of the text. Another thing missing in the dialog that starts audio file writing is the choice of format. While you can usually go with the defaults and the format can be changed in the options, sometimes a per-conversion override of the defaults could come in handy. Two examples are creating both a high and low quality version for streaming or previewing different amounts of compression. A dedicated preview function for compressed audio files would be a welcome addition, too, Though sure you can't have everything.
One thing I see as a definite advantage of TextAloud is the amazing audio writing speed. It can be adjusted between 1 x and 150 x where 1x corresponds to the speed you'd get on recording the spoken audio directly. Naturaly, higher values take more CPU time. But even the highest takes about 30 percent on a fast machine so chances are it could go even faster if the implementation permited. Extremely fast reading to a wave file is a true time saver and not found in any of the free text readers I've come across. A big plus for this feature alone.
As far as audio formats go, uncompressed wave files with different sampling rate and channel options, as well as mp3 and wma files are supported unlike in most, if not all, of the free competitors. The choice of audio parameters is very wide, though oddly enough mp3 files can be written in stereo which is rarely necessary and most likely a user mistake. When you select the WMA option, a message box pops up leading you to the necessary WMA download. While this is nice, the means of providing the information go against common usability principles. A keyboard user usually cursors through the format list and when arriving at WMA, the focus is suddenly taken away from the list, which prevents you from cursoring over the WMA choice should there be other formats after it. It would have been a lot subtler to handle this special case without focus steeling. One way is showing the message in a read-only text box and disabling the OK button, when WMA is selected.
Another minor gripe is that not all Windows supported audio formats can be selected. You cannot select the mp3 codec (compressor, decompressor) used and other wave varients like TrueSpeech or ADPCM, are unsupported. While it is true mp3 and wma are the most popular formats, even SoundRecorder is able to interface with the rest of the codecs on offer. For an even wider palette of format support, running external encoder programs such as oggenc ought to be supported. However, it should be noted that one can convert to wave first and then do any desired post processing afterwords, so the lack of sound formats is not a real issue. One small but elegant touch is that if desired, TextAloud can make the sampling rate, dynamics and number of channels used in the audio file match those of the speech synth automatically. As a case study, AT&T Natural Voices use 16 kHz where as Microsoft Voices use 22 kHz, for example. Not requiring the user to know this kind of detailed info, is a sign of a smart program.
Another slight niggle of mine is that internally the audio is first written to a file called temp.wav, which is then converted to the desired format if necessary and deleted when processing has completed. While this isn't a problem for most people, it does mean you'll have to have the disk space for the uncompressed audio unlike in some CD Rippers, namely Cdex. I see two ways of getting around the problem. The first is to allocate physical RAM for small files up to a user configurable limit. The second way is feeding a small buffer of audio data through the codec in real time like mp3 Audio recorders do. Again the lack of straight mp3 or wma writing is no biggy unless you are converting large files and are tight on disk space.
SAPI Support, Pronounciation Editing and Tags
Microsoft SAPI the speech application programming interface is a way for programmers to handle all speech synths the same way in an application. To further complicate matters, though, there are two incompatible versions the older SAPI 4 and the new SAPI 5.x line. Not all SAPI 4 synths support SAPI 5 yet so it is important for an application to support both. Though SAPI 4 is perhaps the more popular, SAPI 5 adds support for XML speech tags and fine pronounciation control with phonetic alphabet.
To its merit TextAloud does in fact support both SAPI standards fairly well. The user can choose either of the two standards or specify he or she wants voices from both. If both are selected and there are some overlapping voices, SAPI 4 takes precedence, even though some features of Textaloud require a SAPI 5 compatible engine. The speech synthesizers and their voices are arranged in a neat two-level tree structure hich let's you preview or disable voices on a per synth basis. The idea is again great but implementation falls short of accessibility. The tree control is custom so screen readres like Dolphin Supernoava 6.x are unable to recognize the branches and read their state. On a similar note, enter doesn't press ok and esc doesn't cancel a dialog unlike in virtually all properly coded Win32 applications. Another issue making the SAPI support less transparent is that advanced pronounciation editing and Voice change tags in the text require a SAPI 5 compatible voice. Rather than graying out the respective menu items, crude message boxes are popped up to prevent you from choosing the action in question. The same attitude is annoyingly common in TextAloud across the board including showing both single and multi article commands, for instance.
Two neat speech options added in TextAloud are delays that are inserted on sentence changes or when a new paragraph is started. Handling these delays at the application level emulates the feature for all SAPI compatible speech synthesizers. If there are more than two consecutive new-lines, the pause is even longer which is a very good thing in general and might be a unique feature in a text reader, though it's been implemented in some screen readers before. Another highly useful feature that's on by default is a punctuation filter. It does come in handy in a table of contents list, where tens of periods are often used to separate the chapter name and page number. Even foreign alphabet such as umlauts are handled correctly. On the down side, there's no option to specify a custom exclude list for punctuation. Thus ASCII headings made of numerous equals signs are read out verbatim.
The pronounciation control offered by textaloud is very versatile. It supports easy-to-use basic word replacements as well as wildcards, regular expressions (though probably not Perl 5 compatible) and even phoneme based pronounciation. Again all this is very useful in theory but there's one major oversight that really decreases the usability of custom pronounciation. That is the changes are global and thus applied to each and every synthesizer and voice. More often than not, one would like to correct the pronounciation only for a particular synth, and sometimes for a particular voice provided that the language or accent is different. This is not possible so correcting the pronounciation of something like Cakewalk for the Microsoft synths makes other synths have to use the same pronounciation. Should Murphy's law kick in, using the basic pronounciation editor could fix a word on one synth but make it incorrect for another. Most screen readers offer synthesizer specific pronounciation and many synthesizers have their own custom editors which could be available within TextAloud as well.
SAPI 5 compatible synths support voice tags and text aloud makes the pause and voice change tags available through the menus. Adding pauses or voice data in a piece can be tedious so I've found that it's just easier to copy and modify the tags by hand. Still I wish all supported tags would have a graphical interface in Textaloud. Another thing worth adding might be a proper XML validator. Currently, if you'll make a mistake with some tags, it will throw a highly technical sounding message box that gives very little info on the possible culprit. As TextAloud has got regular expression support already built-in, would be cool if it had a regexp based tagger that could be used to automatically insert voice changes or other tags in documents. This facility would work best for structured text such as scripts of plays or different levels of quoting in an e-mail message. Sure you can use another regular expression capable text editor for THE job but I think tighter integration to TextAloud would be rather useful.
Dealing with Very Long or Numerous Files
For converting a number of files at once TextAloud offers too USEFUL approaches. You can use the batch converter to quickly turn an arbitrary set of text files into audio. The other choice is splitting the text into multiple articles and reading them in a file all at once.
Normally TextAloud operates in single article mode meaning that only one file is open and ready for editing at a time. It is also possible to switch to multi-article mode in which you can switch between different documents using an articles list box that's part of the main window. Some options like opening or reading more than one file are only available in multi article mode. As single article mode is a special case of multi article mode, however, I wish the two modes would be consolidated into one.
In reading multiple files at once, TextAloud can either output one large audio file or make a new file for each article both being options that are frequently used. Additionally you may use a single voice for all articles, pick a random voice for each one or cycle through the voices with a wrap-around. Processing multiple files isn't very fault tolerant, as I anticipated. If a SAPI error is encountered for some reason, the whole process is aborted. For unatteneded processing, skipping over any problematic files and temporarily disabling the problematic voice would be preferrable. The batch converter in Sound Forge does things about right if you ask me. On the plus side, one thing I like in multi-article processing is that the status bar will clearly show which file is being currently processed so the user is kept posted on what's happening.
It would seem that TextAloud is able to cope with very large files well. Though the manual warns about problems related to files larger than 100 kilobytes, I was able to succesfully convert a half a meg of text at once as a single article. There were no errors during the conversion and at least the beginning and end of the audio file seemed to match the original text as expected.
But 100 kilos of converted text isn't a convenient format to deal with on mp3 players that don't support Jump to time or bookmarks. This most likely includes hardware based solutions such as mp3 capable CD players. TextAloud offers two separate utilities for dealing with the problem. File splitter can split a file in several smaller files while the Batch Converter can be used to efficiently read dozens of files without having to load them up as articles. Unlike the document window, the batch converter let's you specify the voice parameters directly on a per conversion basis and is thus often preferrable to multi-article mode at least in my experience.
The file splitter is smart enough to split on sentence boundries. It can accept a file name prefix or suffix as well as a part number. Splitting criteria includes one or more strings that delimit the parts to be split or a maximum part size in kilobytes. Though these choices are useful and get the job done, a couple of additional modes would help a lot.
From a usability point of view, the unit kilobyte is a little vague to most people and its meaning is also characterset dependent. A more intuitive choice would be to specify an estimated maximum running time based on the current voice and pause settings. Similarly, most texts are complex enough that you cannot give a text string that will unambiguously and certainly delimit only chapters or other convenient units. As the pronounciation dialog has got regular expression support built-in, that support should be migrated to specifying part delimiters, too. You should be able to use back references, memorized bits of text, as part of the output file name. Not only that but a regular expression would let you specify most heading types unambiguously. One example for splitting on headings might be at least two line breaks, one or more numbers, a period, one or more numbers followed by as many white space or word characters as possible and a line break character. naturaly, all this is expressed concisely by a well-crafted regular expression.
Features for Low-Vision Users
Though the TextAloud interface is not self-voicing, there are two features that might be of interest to low-vision users. The first is the ability to customize the font used in the text editor to make it big enough to read. Reading is followed in the document and this tracking can be disabled if desired. As a special trick, one could turn the volume to 0, set up the speed to one's liking and use the tracking of the magnified text as a poor man's line reading mode.
The other option, though it usually decreases accessibility, is the optional use of skinning. It is great that skins are not forced for each and every user so you can rely on the Windows classic or XP look if you want to. But by selecting a good skin, you can get a good contrast between the dialog and its buttons which is something not possible in the Windows classic look. One example of this is the skin named Grab Saphire. Skinned controls in TextAloud are usually as accessible as their Windows counterparts but the big exception here are menus. I'm not sure if it's the custom selection collor or what but supernova fails to track the menus when skinning is on. AS soon as this is fixed, though, I'd say the skinning option could work well for low-vision users who also rely on speech. If only there were an easy way of making more of these skins. And if you could override button, text field and selection colors, you could get around the Windows limitation of a shared text color for dialogs and windows. This could let one create true high-contrast schemes in which buttons, text fields and dialogs stand out in a way never before seen in the Windows world.
TextAloud is a very useful and easy-to-use text to speech utility for an attractively low price (about 20 US dollars). It shines in turning multiple files to compressed audio very quickly and also offers basic text editing, pronounciation and voice changing features. Unlike most free readers, it supports both SAPI 4 and SAPI 5 for maximal coverage and offers some low-vision features, too. As for improvements, the keybord interface, menu logic and implementation of skinning is a bit patchy in places and TextAloud could do with some extra features. In brief, a nice and feature-rich text to audio reader for Windows.