State of Speech Synthesis and Text-To-Speech? 52
Gnulix asks: "Are there any, preferably either open source products available that produce realistic speech from an arbitrary (English) text? Projects such as Festival doesn't sound all that much better than SAM (Software Automatic Mouth) did on a Commodore 64 back in 1979, nor does SoftVoice's or IBM's new products sound very good. I mean we all know that Stephen Hawking is a fun loving guy, but I bet you that he didn't choose his unrealistic, robotic voice just for the heck of it. With all the amazing advances we have seen in real-time graphics, shouldn't speech synthesis have come much, much further than what is, seemingly, available today?" Ask Slashdot last handled the Voice-To-Text issue in January of this year.
AT&T Natural Voices (Score:5, Informative)
checkout http://www.naturalvoices.att.com/
Re:AT&T Natural Voices (Score:4, Informative)
I don't know about Open Source TTS, but the commercial versions (AT&T, Speechworks, and others) are sitting on the threshold of truly natural speech. I work in the speech industry, so I follow progress and have seen some of the unreleased demos of upcoming versions. In the next couple years, we can expect amazing things. It won't be long before the Speechify Challenge will truly be impossible to beat.
By the way, for those of you who don't know, the newest and best-sounding engines don't use purely synthesized sounds as older and small-footprint engines do (Festival and Steven Hawking). The engines are built using actual recordings: a "voice actor" will sit in a studio and record dozens of hours of speech, and then, over the course of several months, the recordings are then cut and spliced into individual phonyms, which are reassembled by the engine. This means that the voices actually sound like real people, and the only unrealistic part is the inflection when generating complete sentences. You can order custom voices (for several tens of thousands of dollars) and get a voice that sounds identical to that of your celebrity of choice.
Re:AT&T Natural Voices (Score:3, Informative)
Re:AT&T Natural Voices (Score:1)
Re:AT&T Natural Voices (Score:2)
Re:AT&T Natural Voices (Score:2)
Okay, so I just made fun of you. Actually, I still agree with what you said. It would take a LOT of talent which is really interested in text-to-speach for this to happen. But really, I think the pay-offs to society would be time and energy very well spent.
Re:AT&T Natural Voices (Score:1)
Re:AT&T Natural Voices (Score:1)
Now, if someone wants to go and prove me wrong, then go for it! You will do the Open Source community a great service. Of course, there's more to the TTS than splicing recordings, so good luck with that too. Anyway, I hope I am wrong, but I'll believe it when I see it.
Tagging and marking voice corpus data (Score:1)
Yes, this operation is pretty boring, but so is putting together gui installers, or building unit tests, or writing API documentation -- boring, but important. Interestingly, there are a large number of undergrad/grad Linguistics students who do this sort of thing in college laboratories all the time. They're not CS students, though -- their Linguistics (or Psychology, or Cognitive Science, or whatever wacky department their field was attached to) students.
Someone with a CS professorship, an `open science' bent, and enough funding for 2-3 undergrads (probably $10k each, once you include `overhead') needs to find his or her local Linguistics department and make the world a better place.
Please?
Re:Tagging and marking voice corpus data (Score:1)
sigh.
Re:AT&T Natural Voices (Score:1)
Oops - sorry. That was sarcastic, wasn't it?
Re:AT&T Natural Voices (Score:1)
If you want to prove me wrong, then more power to you!
Re:AT&T Natural Voices (Score:2)
Related ? (Score:2, Insightful)
Look at the older article, it's a completely different question.
Re: (Score:1)
Hawking... (Score:5, Interesting)
He does relate, however, in A Brief History of Time, that at first people had trouble understanding "his voice", so that when he would speak or answer questions at lectures, he would have an interpreter who was more familiar with his voice repeat what he just said.
Interesting stuff...
Re:Hawking... (Score:4, Funny)
He declined, saying he and his friends had gotten used to the voice, and it was "his".
Not to mention the legions of fans who follow his side-career as a gangsta rapper [mchawking.com] with due vigor! Changing his would give his music a very different sound!
GMD
Re:Hawking... (Score:3, Funny)
They've got an MP3 [imarc.net] relating Mr. Hawking's experience playing Grand Theft Auto 3. Quite funny. Another about his Drivebys [imarc.net]. Find more here. [mchawking.com]
Re:Hawking... (Score:1)
Re:Hawking... (Score:1)
Tough One... (Score:2, Funny)
If it existed it'd be in government (though the Bush model is obviously a pre alpha leak..)
AT&T Labs Research (Score:2, Informative)
Check out the National Weather Service (Score:3, Informative)
Natoinal Weather Service describes their new system. [205.156.54.206]
Re:Check out the National Weather Service (Score:1)
Re:Check out the National Weather Service (Score:1)
And yes, it's pretty natural-sounding.
Apple, and MS (Score:3, Insightful)
MS has had text-to-speech as a object you can embed in your program with one line of VB code (same as you can embed IE) for a while now.
Apple has had text to speech entensions in tons of different voices for a long time. Some of the G4s used to read dialog boxes to you by default if you didn't click on them fast enough. Pretty unnerving the first couple times.
Several voice activated automated attendant systems I have called for my credit card and bank are amazing these days. They have insanely accurate speech recognition and really good text-to-speech.
So I wouldn't say the field is not advancing... it is.
Of course, a Google search for "open source text to speech" without quotes yields many promising looking hits, which I havn't evaluated. Why didn't you search there before asking Slashdot?
Re:Apple, and MS (Score:2)
Re:Apple, and MS (Score:1)
I can't remember the exact command, but I think it was:
echo Hello there | festival --tts
Unfortunately, it always took a while to start...
Re:Apple, and MS (Score:1)
Oh, I didn't? Thank's for telling me! I thought I did and I'd probably have had to Ask Slashdot to tell me that I didn't...
The larger issue is NLP (Score:5, Interesting)
Nor, to harp on my pet peeve, do we have a theory of semantics that can put XML to any important use on the average webpage. These all need a model of the human psyche, because all human language is flavored with metaphors from the realm of motives and plans, etc (the psychological realm). Psychological science isn't delivering the sorts of models that NLP-etc need, and probably won't for many decades yet. [My AI FAQ] [robotwisdom.com]
I have an interest in this (Score:3, Informative)
For voice control stuff I found a little program called cvoicecontrol to be quite nice.
AT&T has done a lot (Score:3, Informative)
http://www.naturalvoices.com/
quite a big step in the right direction in my opinion.
Re:is linux insecure? (Score:1)
I hope to have my EAL4 certified clay brick later this year.
Actually, graphics hasn't come 'that far' (Score:4, Insightful)
We haven't had that many amazing advances in graphics. Natural speech is to advanced raytracing what current text to speech is to current graphics. We still cannot raytrace in a single system in real time at the resolution of our eyes, and we still cannot produce natural speech in a single system in real time at the resolution of our ears.
Furthermore, we know less about the math of speech than we know about the math of light. Go visit your local university that has a good CS program, and browse the bookstore for the books used to teach speech recognition. In that book you will find that the average sound a human makes goes from production of complex, multitonal sound from the vocal cords through as many as five complex natural filters (body cavities between the vocal cords and lips) before it reaches the ears of the recipient.
Modeling these filters for one sound is hard enough. Each letter in our alphabet, except simple vowels, changes the filters throughout the letter. Furthermore the filters for a given letter may also change depending on the previous and next letter.
A system to create speech, therefore, must generate hundreds (perhaps thousands) of different filtered 'noises' just to reproduce the english language. Other languages can be much more complex.
Current common technology is to simply record the hundreds of 'simple' sounds and add them together. Really good programs use hundreds of hours of speech by voice actors to get several hundred sounds.
The penultimate is to mathematically recreate every part of the human vocal system from the lungs to the lips. This has obviously not occured. The computers may well be powerful enough, but the understanding of the vocal tract is extremely limited.
In other words, wait 5-10 years. There still isn't a killer application for text to speech, but with devices getting smaller and smaller, there will be soon enough.
-Adam
Re:Actually, graphics hasn't come 'that far' (Score:5, Funny)
TTS Synthesizers (Score:3, Informative)
DECTalk [fonix.com] (One of the most widely used)
Eloquent (http://www.eloq.com - dead URL?) (fairly natural-sounding with dialects)
Elan [elantts.com] (European languages)
They've all been improving over the years.
VoiceXML (Score:2)
Call me picky if you will.. (Score:1, Informative)
State of the art in TTS (Score:3, Informative)
Festival with MBrola (Score:2, Informative)
In particular, there is one high-res female voice in MBrola that is very good. If you need any help setting it up (I can happiyl give you my festival config file) just say mail me at netgrok @at@ yahoo . de
That said, I think text output is very underrated technology and is quite useful, if used in moderation for the right purposes. One sometimes reads overexcited hyping about reading your emails out loud in the car or at breakfast, but that ain't gonna happen with current technology.
For one, the synthesis is bit monotonous for long texts (but then, now that I think about it, having you SO or kids read out a letter out loud would probably be not any better...)
Secondly, you do not necessarily want a user interface where a computer reads out things the whole time (logs, for instance) because a) its annoying and b) it will not work in an office with multiple people.
Where it DOES work and is trivial to implement is for things that are singular events that occur during the day or alarms. Similar to the sort of PA announcements that you would get in a department store. They do not read loud the whole time, do they?
In our case we use festival with Mbrola for two things:
The backup will start at 4 this afternoon" is the announcement I just heard. Sometimes I add "Please insert the tape, pleeeeease", just to hear the damn computer beg ME for a change
Also, if I FORGET to insert the tape the computer starts moaning continously about it. Nothing like a whining b.tch to convince you get off your butt to put the tape in
It is also trivial to insert this into a standard sysv start/stop script at boot time so that you have some tag when critical servers are shut down for some reason.
Costs for this setup? 2 hours of install time (installing MBrola on festival took some digging-through-the-docs. If you need it, mail me). Writing a script that "say x" instead of "echo x) took about 2 minutes. And putting these commands into the cron job took about 10. So for a bout 3 hours worth of time and set of very cheapo computer speakers you get a good useful functional system which works, and the voice is very, very good. This is pretty neat for critical or semi-critical announcement kind of events, not continuous interaction.
Since the only command I use to activate it is "say x" from unix shellscripts with your current setup is trivial.
Btw, the MBrola website has a demo of a german voice reading a weather and traffic report which is even better than the English one.
Of yeah, it was fun to watch the cleaning lady almost get a heart attack when the computer greeted her...
Winbond text to speech chip (Score:1)
Devantech (a small company in england that makes boards for robot builders) has built a little board around it, that can have 30 canned phrases and do text to speach over rs-232 or i2c. Info on their board can be found at http://www.robot-electronics.co.uk/shop/Speech_Sy
The board goes goes about $83 from US distributors
http://www.acroname.com/robotics/parts/R184-SP0
While this might be a bit expensive, this guy is makeing small quantities, and this is designed to be run from a small robot driven by a micro controller, not a full computer.
Speech (Score:2)
Re: (Score:1)