How long will it take before your computer looks into your eyes and begin to express realistic feelings? In theory, it may happen now, since virtually all modern Windows PCs have a voice synthesizer. which is a computerized voice that transforms written text into speech, primarily to help visually impaired people who can not read small texts printed on a screen. How exactly does speech synthesis work?
What is Google Voice Synthesis?
There is more than one voice synthesis system and we have decided to pick up the Google system as our example. But the working principle in all is more or less the same.
Computers do their jobs in three distinct stages called input (where you feed information, often with a keyboard or mouse), processing (where the computer responds to your input, say by adding some numbers you typed or picture) and output (where you see how the computer processed your input, usually on a screen or printed on paper). Speech synthesis is simply a form of output, where a computer or other machine reads words aloud to you in a real or simulated voice played by a speaker; Technology is often called text to speech (TTS).
Talking machines are nothing new, but computers that routinely talk to their operators are still extremely unusual. It is true that we drove our cars with the aid of computerized navigators, wrapped ourselves up with computerized pictures on public transport, and listened to computerized excuses at railway stations when our trains were late. But almost none of us talk to our computers or sit down waiting for them to respond. This may change in the future, as computer-generated speech becomes less robotic and more human.
How does the text-to-speech engine work?
Let’s say you have a paragraph of written text that you want your computer to speak out loud. How to turn words written into sounds you can actually hear? There are essentially three stages involved, which we will call text for words, words for phonemes and phonemes for sounds.
Text for words
Reading words sounds easy, but if you’ve heard a young child reading a book that was too difficult for them, you’ll know it’s not as trivial as it sounds. The main problem is that the written text is ambiguous: the same written information can often mean more than one thing and usually you have to understand the meaning or make a educated guess to read it correctly. Therefore, the initial stage in speech synthesis, which is often called preprocessing or normalization, is about reducing ambiguity: it is about reducing the different ways in which you can read a text in one that is more appropriate.
Preprocessing involves passing the text and cleaning it so that the computer makes fewer errors when it actually reads the words out loud. Things like numbers, dates, times, abbreviations, acronyms, and special characters (currency symbols and so on) need to be turned into words, and that’s harder than it sounds. A number can refer to a number of items, a year or an hour, or a combination of a safe, each of which is read somewhat differently. While humans follow the meaning of what is written and discover pronunciation in this way, computers generally do not have the power to do so, so they need to use statistical probability techniques (or neural networks (computer programs structured in order to learn to recognize patterns) to get the most likely pronunciation instead. if the word “year” occurs in the same sentence as a number, it may be reasonable to guess that it is a date and pronounce it appropriately. If there were a decimal point before the numbers, they should be read differently.
Preprocessing also has to face homographs, which are words uttered in different ways according to what they mean. Certain words can be pronounced with different pronunciations and meanings. But one can discover the verbal tense of the sentences, the context, and the words that precede and proceed to identify the best pronunciation for each of them.
Words for phonemes
Having discovered the words that need to be said, the speech synthesizer now has to generate speech sounds that make up those words. In theory, this is a simple problem: all the computer needs are a huge alphabetical list of words and details of how to pronounce each one. For each word, we need a list of the phonemes that make up the sound.
If a computer has a dictionary of words and phonemes, all you have to do to read a word is to look up the list and then read the corresponding phonemes, right? In practice, it’s harder than it looks. As any good actor can demonstrate, a single sentence can be read in many different ways according to the meaning of the text, the person speaking and the emotions they want to convey (in linguistics, this idea is known as prosody and is one of the problems for speech synthesizers to be addressed). Within a sentence, even a single word can be read in many ways because it has multiple meanings. And even in a word, a particular phoneme will sound different according to the phonemes that come before and after it.
An alternative approach involves breaking down words written in their graphemes (units of written components, usually made from individual letters or syllables that make up a word) and then generating phonemes that correspond to them using a simple set of rules. This is a bit like a kid trying to read the words he or she has never encountered before. The advantage of doing this is that the computer can make a reasonable attempt to read any word, whether or not a real word stored in the dictionary, a foreign word or an unusual technical name or term. The disadvantage is that languages have a large number of irregular words that are pronounced in a very different way from the way they are written,
Phonemes for sound
Now we convert our text (our sequence of written words) into a list of phonemes (a sequence of sounds that need to be spoken). But where do we get the basic phonemes that the computer reads aloud when it is turning the text into speech? There are three different approaches. One is to use human recordings that say phonemes, another is for the computer to generate its own phonemes by generating basic sound frequencies (similar to a music synthesizer), and a third approach is to imitate the mechanism of the human voice.
Voice synthesizers that use recorded human voices should be preloaded with small stretches of human sound that can be rearranged. In other words, a programmer has to record many examples of a person saying different things, breaking the sentences spoken into words and words into phonemes. If there are enough speech samples, the computer can rearrange the bits in several different ways to create totally new words and phrases. This type of speech synthesis is called concatenation (of Latin words that simply mean connecting bits together in a series or string). Since it is based on human recordings, concatenation is the most natural type of speech synthesis and is widely used by machines that have only limited things to say (eg, corporate phone pictures).
If you consider that speech is just a sound pattern that varies in tone (frequency) and volume (amplitude) like noise coming out of a musical instrument, it should be possible to make an electronic device that can generate any type of voice. This type of speech synthesis is known as formant, because the formants are the sound frequencies of 3 to 5 (resonant) of the sound that the human vocal apparatus generates and combines to make the sound of speech or singing. Unlike speech synthesizers that use concatenation, which simply rearrange pre-recorded sounds, formant speech synthesizers can say absolutely anything, even words that do not exist or foreign words they have never encountered. This makes formant synths a good choice for GPS (navigation) computers, who need to be able to read many thousands of names from different (and often uncommon) places that would be hard to memorize. In theory, formant synthesizers can easily switch from a male voice to a female voice (doubling frequency) or a child’s voice (tripling) and can speak in any language. In practice, concatenation synths now use huge libraries of sounds so they can tell just about everything. A more obvious difference is that concatenation synthesizers are much more natural than formants, which still tend to appear relatively artificial and robotic. formant synthesizers can easily switch from a male voice to a female voice (doubling frequency) or a child’s voice (tripling) and can speak in any language. In practice, concatenation synths now use huge libraries of sounds so they can tell just about everything. A more obvious difference is that concatenation synthesizers are much more natural than formants, which still tend to appear relatively artificial and robotic. formant synthesizers can easily switch from a male voice to a female voice (doubling frequency) or a child’s voice (tripling) and can speak in any language. In practice, concatenation synths now use huge libraries of sounds so they can tell just about everything. A more obvious difference is that concatenation synthesizers are much more natural than formants, which still tend to appear relatively artificial and robotic.
The more complex approach to generating sounds is called articulatory synthesis, and it means making computers speak, modeling the incredibly complex human vocal apparatus. In theory, this should give the most realistic and human voice of all three methods. Although numerous researchers have experimented with mimicking the human voice box, articulatory synthesis is still by far the least explored method, largely because of its complexity. The most elaborate form of articulatory synthesis would be the engineering of a “talking head” robot with a moving mouth that produces sound in a manner similar to a person , combining mechanical, electrical, and electronic components as needed.
What is voice synthesis and Google text reader used for?
On a typical day and you can find all kinds of recorded voices, but as technology advances, it becomes increasingly difficult to find out whether you are listening to a simple recording or a speech synthesizer. You may have an alarm clock that will wake you up talking time, probably using crude speech synthesis and formant
If you have a talking GPS system in your car, this can use concatenated speech synthesis (if you have only a relatively limited vocabulary) or synthesizer of formants (if the voice is adjustable and can read place names).
If you have an ebook reader, you may have one with built-in reading. If you have visual impairment, you can use a screen reader that speaks the words out loud from your computer screen (most modern Windows computers have a program called Narrator that you can activate to do just that).
Whether you use it or not, your cell phone is likely to be able to listen to your questions and respond via a smart personal assistant such as Siri (iPhone), Cortana (Microsoft) or Google Assistant / Now (Android).
If you are using public transportation, you will hear the recorded voices all the time talking security announcements or telling you what trains and buses are coming next. Are they simple recordings of humans or are they using concatenated and synthesized speech? See if you can figure it out!
A really interesting use of speech synthesis is the teaching of foreign languages. Speech synthesizers are now so realistic that they are good enough for language learners to use as a practice.
Dmitry Shishkin – Virtual voiceover translation in news …
Behind Tacotron 2: Google’s Incredibly Real Text To Speech …
RogerVoice Caption Calls
Get a Free Business Line Right now from Google Voice …
5 Cool Things You Can Do With Google Voice
Patent US6304846 – Singing voice synthesis
Gadget Pro: Google Voice How it work
Google Voice app now works with iPod touch, iPad
Can I get multiple Google Voice numbers to one phone?
5 Cool Things You Can Do With Google Voice
speech synthesis research paper
How does Bing’s voice search compare to Google’s …
3D Computer Simulation of the Human Vocal Tract
What Is Speech Synthesis or Text-to-Speech?
How Google Voice Works as a Switchboard for Telephone …
Patent EP1367563A1 – Voice synthesis device