Digital Signal Processor and Text-to-Speech
This is the second post in a series on Text-to-Speech for eLearning written by Dr. Joel Harband and edited by me (which turns out to be a great way to learn). The first post, Text-to-Speech Overview and NLP Quality, introduced the text to speech voice and discussed issues of quality related to its first component – the natural language processor (NLP). In this post we’ll look at the second component of a text to speech voice: the digital signal processor (DSP) and its measures of quality.
Digital Signal Processor (DSP)
The digital signal processor translates the phonetic language specification of the text produced by the NLP into spoken speech. The main challenge of the DSP is to produce a voice that is both intelligible and natural. Two methods are used:
- Formant Synthesis. Formant Synthesis seeks to model the human voice by computer-generated sounds, using an acoustic model. Typically, this method produces intelligible, but not very natural, speech. These are the robotic voices, like MS Mike, that people often associate with text to speech. Although not acceptable for eLearning, these voices have the advantages of being small and fast programs and so they find application in embedded systems and in applications where naturalness is not required as in toys and in assistive technology.
- Concatenative Synthesis. To achieve the remarkable naturalness of Paul and Heather, concatenative synthesis is used. A recording of a real human voice is broken down into acoustic units: phonemes, syllables, words, phrases and sentences and stored in a database. The processor retrieves acoustic units from the database in real time and connects (concatenates) them together to best match the input text.
Concatenative Synthesis and Quality
When you think about how concatenative synthesis works – joining together a lot of smaller sounds to form the voice, it suggests where there can be glitches. Glitches will occur either because there’s not a recorded version of exactly what the sound should be or will occur where the segments are joined when it doesn’t come together quite right. The main strategy is to try to choose database segments that are as long as possible– phrases and even sentences – to minimize the number of connection glitches.
Here is an example of a glitch in Paul when joining the two words “bright” and “eyes”. (It wasn’t easy to find a glitch in Paul – finally found one in a Shakespeare sonnet!)
- Mike - bright eyes
- Heather - bright eyes
- Paul - bright eyes
The output from the best concatenative systems is often indistinguishable from real human voices. Maximum naturalness typically requires speech databases to be very large so the larger the database the higher the quality. Typical TTS voice databases that will be acceptable in eLearning, will be on the order of 100-200 Mb. For lower fidelity applications like telephony, the acoustic unit files can be made smaller by using a lower sampling rate without sacrificing intelligibility and naturalness, making a smaller database (smaller footprint).
By the way, the database is only used to generate the sounds which are then stored as .wav, .mp3, etc. It is not brought along with the eLearning piece itself. So a large database is generally a good thing.
Here is a list of the TTS voices offered by NeoSpeech, Acapela and Nuance with their file sizes and sampling rates.
Voice | Vendor | Sampling rate (kHz) | File Size (Mb) | Applications |
Paul | NeoSpeech | 8 | 270 (Max DB) | Telephone |
Paul | NeoSpeech | 16 | 64 | Multi-media |
Paul | NeoSpeech | 16 | 490 (Max DB) | Multi-media |
Kate | NeoSpeech | 8 | 340 (Max DB) | Telephone |
Kate | NeoSpeech | 16 | 64 | Multi-media |
Kate | NeoSpeech | 16 | 610 (Max DB) | Multi-media |
Heather | Acapela | 22 | 110 | Multi-media |
Ryan | Acapela | 22 | 132 | Multi-media |
Samantha | Nuance | 22 | 48 | Multi-media |
Jill | Nuance | 22 | 39 | Multi-media |
The file size is a combination of the sampling rate and the database size, where the database size is related to the number of acoustics units stored. For example, voices 2 and 3 have the same sampling rate, 16, but voice 3 has a much bigger file size because of the larger database size. In general, the higher sampling rates are used for multimedia applications and the lower sampling rates for telecommunications. Often larger sizes also indicate a higher price point.
The DSP voice quality is then a combination of the two factors: the sampling rate, which determines the voice fidelity and the database size which determines the quality of concatenation and frequency of glitches – the more acoustic units stored in the database, the better the chances of achieving a perfect concatenation without glitches.
And don’t forget to factor in Text-to-Speech NLP Quality. Together with DSP quality you get the overall quality of different Text-to-Speech solutions.
20:34 | 0 Comments
Learning Flash
My posts around the Beginning of Long Slow Death of Flash and my post from a CTO perspective that I Cannot Bet on Flash for new development stirred up quite a bit of response. A lot of it said quite correctly that HTML5 is not there yet. And that Flash provides things that you can’t do in HTML/JavaScript. However, there are some pretty amazing things you can do without Flash.
The bottom line is that none of the feedback I’ve received has convinced me that choosing Flash as a delivery option for a new product or project would be a good idea today, especially if I want it to play on mobile and live for 5 years.
But then I received a great question via a comment:
I am a Masters student enrolled in an Instructional Design course with Walden University. I am somewhat new to the field and this article intrigues me. Should I hold off on learning Flash... and focus more on learning HTML5? Or would it be best to learn both? I know a very little about Flash and made it a goal to learn more, but now I wonder. You input is greatly appreciated.
What a great question and kudos to this student for being so on top of things to ask it!
And it was somewhat the inspiration for this month’s Big Question - Tools to Learn. If you’ve not done so already, you should go read each of the posts there. They have different perspectives and taken together they provide a pretty good roadmap of how to think about what tools you should learn.
Jeff Goldman in Development Tools I Would Learn If I Were You - Jeff's response to June’s Big Question tells us:
Flash: Yes, Flash is still very much alive and well in e-learning and because it is so embedded in our industry and there is nothing at this time that can provide the rich interactive elements that it provides, I do not see it being “dead” in our field anytime soon. The fact is HTML5 is not there yet and if it ever does get there it will probably be more than 5 years before it is at the level of quality and ease of development that Flash currently provides. However, see my comments under HTML/HTML5.
To me the question is more about where you choose to spend your time. The list of tools that Harold and Holly provide are pretty lengthy. And Jeff suggests both Flash and HTML 5. If you have so much time that you can afford to learn all of these tools, then go ahead.
However, if you have to prioritize Flash vs. HTML 5 vs. ??? … then I would put learning Flash (especially scripting in Flash) way down on priority list at this point. Remember End of an Era – Authorware – another Macromedia/Adobe product. These things do eventually die out. How valuable are your Authorware scripting skills at this point?
Learning Flash today is like learning Authorware in 1997.
So, yes, hold off on learning Flash and focus more on learning HTML 5.
20:57 | 0 Comments
Online Exam Preparation and Tutoring – Hot Market
Inc. Magazine published an article The Best Industries for Starting a Business In 2010. Not sure what to make of most of the article, but they did include Exam Preparation and Tutoring as one of the top ten.
Parents always want their kids to do better on tests. A large number of adults returning to school are also looking for an edge. Given the low barrier to entry, this field is competitive. But if you carve out the right niche, it could be lucrative.
The industry, which includes tutoring in such fields such as special education, language, and music, grew about 7 percent last year.
And it seems like there are lots of eLearning Startups that are taking aim at different aspects of the Business of Learning. My 12 eLearning Predictions for 2009 included
Increase in Consumer/Education Social Learning Solutions
2008 was an interesting year that saw a myriad of new start-ups offering content through interesting new avenues. Social learning solutions like social homework help provided by Cramster; CampusBug, Grockit, TutorVista, EduFire, English Cafe, and the list goes on and on.
And it seems like Inc. is maybe just a little bit late as there are a bunch of startups going after online exam preparation and online tutoring. Some eLearning startups rouhgly in this space:
- Knewton focuses on test preparation online using test experts to help students study.
- TutorJam offers online tutoring programs for students in K-12, AP classes, and college.
- Brightstorm focuses on helping students prepare for AP tests, as well as standardized tests.
- Sums Online provides a wide range of math activities to help at home learners.
- DreamBox Learning is an education start-up that provides math games for kids. This was recently acquired by Netflix founder Reed Hastings.
- ProProfs – SAT and certification quizzes.
- PrepMe – personalized prep for SAT, ACT, PSAT.
- Tutor.com – online tutoring.
And there are a bunch more out there. As Inc. tells us – low barrier to entry. So we should expect lots more.
20:03 | 0 Comments
eLearning Learning Sponsored by Rapid Intake
As you probably know, eLearning Learning has been steadily growing and is now one of the top eLearning sites on the web. I wanted to let you know about an exciting development for eLearning Learning that’s being announced this morning in the eLearning DevCon Keynote.
Garin Hess and the team from Rapid Intake has stepped in to help me keep the site going both from an effort and financial standpoint.
I'm very happy to have Garin involved because I've known him for years and he's always done a good job of helping to build the larger eLearning community through conferences that you probably already know about:
Garin was really excited to support this broad community of bloggers. We both believe that while this is a loose network, it provides an important and really valuable voice. It's somewhat the whole reason I started eLearning Learning - many people in the world of eLearning miss the great stuff that is going on in blogs. Of course, if you are reading this, that’s probably not you. That said – I still believe that everyone should be Subscribed to Best of eLearning Learning.
Otherwise you’ve been missing things like:
- Top 75 eLearning Posts - May 2010
- Top 68 eLearning Posts from April - Hot Topics iPad Google Buzz
- Top 125 Workplace eLearning Posts of 2009
- Hot Topics in eLearning for 2009
And even though I subscribe to most of the blogs that are part of eLearning Learning, I still use the Best Of to make sure I’ve not been missing really good content.
By the way, if you want to know more about the site and/or see ways you could be involved, take a look at: Curator Editor Research Opportunities on eLearning Learning.
Garin - thanks for stepping up to help!
20:39 | 0 Comments
Text-to-Speech Overview and NLP Quality
This post is a new kind of thing for me. Dr. Joel Harband wrote most of this post and I worked with him on the focus, the content and a little bit of editing - actually I couldn't help myself and I edited this a lot. So this is really a combined effort at this point.
As you know, Text-to-Speech is something that's very interesting to me and Joel knows a lot about it as CEO of Tuval Software Industries maker of Speech-Over Professional. This software adds text-to-speech voice narration to PowerPoint presentations and is used for training and eLearning at major corporations.
Joel was nice enough to jump in and share his knowledge of applying text-to-speech technology to eLearning.
Please let me know if this kind of things makes sense and maybe I'll do more of it. It certainly makes sense given all that's going on in my personal life.
Text-to-Speech Poised for Rapid Growth in eLearning
Text-to-speech (TTS) is now at the point where virtual classrooms were about 4 years ago when they reached a technological maturity where they were mainstream. It took a couple more years for me to say (in 2009) that virtual classrooms reached a tipping point.
Text-to-speech has reached the point of technical maturity. As such, we are standing at the threshold of a technology shift in our industry: text-to-speech voices are set to replace professional voice talents for adding voice narration in e-learning presentations. Text-to-speech can create professional voice narration without any recording which provides significant advantages:
- keeps narrated presentations continuously up to date (it's too time consuming/expensive to rerecord human narration)
- faster development - streamlined workflow
- lower costs.
It's being adopted today in major corporations, but it's still early in the adoption cycle. That said, at a developer’s conference in 2004, Bill Gates made the statement that that although speech technology was one of the most difficult areas, even partial advances can spawn successful applications. This is now the case for text-to-speech: it’s not yet perfect, but it is good enough for a whole class of applications, especially eLearning and training. The reason is that most people learn out of necessity and will accept a marginal reduction in naturalness as long as the speech is clear and intelligible.
There's a lot going on behind the scenes to make text-to-speech work in eLearning. Like most major innovations it needs to be accompanied by a slew of minor supporting innovations that make it practical, easy to use and effective: modulating the voice with speed, pitch and emphasis, adding silent delays, adding subtitles, pronouncing difficult words and coordinating voice with visuals.
Over the course of a few posts, we will attempt to bring readers up to speed on different aspects of this interesting and important subject. The focus of this post is around the quality of Text-to-Speech based on Natural Language processing.
Text-to-speech Basics
To understand how to think about text-to-speech voices and how they compare, it's important to have some background about what they are. Text-to-speech (TTS) is the automatic production of spoken speech from any text input.
The quality criteria for Text-to-Speech Voices are pretty simple. They are:
- Naturalness
- Intelligibility
Due to recent improvements in processing speed, speech recognition and synthesis, and the availability of large text and speech databases for modeling, text-to-speech systems now exist that meet both criteria to an amazing degree.
A TTS voice is a computer program that has two major parts:
- a natural language processor (NLP) which reads the input text and translates it into a phonetic language and
- a digital signal processor (DSP) that converts the phonetic language into spoken speech.
Each of these parts has a specific role and by understanding a bit more about what they do, you can better evaluate quality of the result.
Natural Language Processor (NLP) and Quality
The natural language processor is what knows the rules of English grammar and word formation (morphology). The natural language processor is able to determine the part of speech of each word in the text and thus to determine its pronunciation. More precisely, here's what the natural language processor does:
- Expands the abbreviations, etc to full text according to a dictionary.
- Determines all possible parts of speech for each word, according to its spelling (morphological analysis).
- Considers the words in context, which allows it to narrow down and determine the most probable part of speech of a word (contextual analysis).
- Translates the incoming text into a phonetic language, which specifies exactly how each word is to be pronounced (Letter-To-Sound (LTS) module).
- Assigns a “neutral” prosody based on division of the sentence into phrases.
This will make more sense by going through examples. And this also provides a roadmap to test quality.
We’ll compare the quality of three TTS voices:
- Mike - a voice provided by Microsoft in Windows XP (old style).
- Paul a voice by NeoSpeech - the voice used in Adobe Captivate.
- Heather a voice by Acapela Group.
Actually, let me have them introduce themselves. Click on the link below to hear them:
- I'm Mike, an old style robotic voice provided by Microsoft in Windows XP.
- I'm Paul, a state of the art voice provided by NeoSpeech.
- I'm Heather, a state of the art voice provided by Acapela-Group.
So, let's put these voices through their paces to see how they do. Actually, in this section, we are going to be testing the natural language processor and its ability to resolve ambiguities of parts of speech in the text.
1. Ambiguity in noun and verb
“Present” can be a noun or a verb, depending on the context. Let’s see how the voices do with the sentence:
“No time like the present to present this present to you.”
- Mike
- Paul
- Heather
Paul and Heather resolve this ambiguity with ease.
Another example: “record” can be a noun or a verb:
“Record the record in record time.”- Mike
- Paul
- Heather
Again, Paul and Heather resolve this ambiguity with ease
2. Ambiguity in verb and adjective
The word “separate” can be a verb or an adjective.
“Separate the cards into separate piles”
- Mike
- Paul
- Heather
Only Paul gets it right.
3. Word Emphasis (Prosody)
Another type of ambiguity is word emphasis in a sentence: The intended meaning of a spoken sentence often depends on the word that is emphasized, as: “He reads well”, “He reads well”, He reads well”. This is called prosody and is impossible to determine from plain text only. The voices try to achieve a “neutral” prosody that tries to cover all possible meanings. A better way is to use modulation tags to directly emphasize a word. We’ll discuss that in a later post.
4. Abbreviations
Most voices are equipped to translate common abbreviations.
The temperature was 30F, which is -1C.
It weighed 2 kg, which is about 4.5 lb.
Let's meet at 12:00
- Mike
- Paul
- Heather
Heather does the best job.
5. Technical Words
Unless they are equipped with specialized dictionaries, TTS voices will occasionally fail to read technical words correctly. However they can be always be taught to say them correctly by using a phonetic language. Here are some examples. Each voice says the word twice: first by itself (incorrectly) and second after being taught (correctly).
Deoxyribonuclease (dee-ok-si-rahy-boh-noo-klee-ace)
- Mike
- Paul
Chymotrypsinogen (kahy-moh-trip-sin-uh-juh
- Mike
- Paul
More Information
20:38 | 0 Comments