
Ask Slashdot: Effective, Reasonably Priced Conferencing Speech-to-Text? 81
First time accepted submitter DeafScribe writes "Every year during the holidays, many people in the deaf community lament the annual family gathering ritual because it means they sit around bored while watching relatives jabber. This morning, I had the best one-on-one discussion with my mother in years courtesy of her iPhone and Siri; voice recognition is definitely improving. It would've been nice if conference-level speech-to-text had been available this evening for the family dinner. So how about it? Is group speech to text good enough now, and available at reasonable cost for a family dinner scenario?"
There isn't any... (Score:2)
That said, there are companies working on it, like Dragon, but they're not there yet, and when they get closer, they won't be cheap.
TL;DR If it existed we'd be using it already.
Re:There isn't any... (Score:5, Insightful)
Based on what I get on my TV when I press the Mute button, they really shouldn't be...
Re: (Score:2)
No kidding. I'm always amazed at the low quality of the captioning for accuracy, spelling, and typo's. News broadcasts seem to be the worst because it is live presumably, but even canned content is often poorly done.
Re: (Score:2)
Re: (Score:2)
That's because your pinkie is the easiest finger to make a mistake with, and the semicolon is right under your ;pinkie. I do it all the time.
Speed is more im;portant than accuarcy.
Court stenographers are pretty good (Score:3)
I've sat in trials in federal courts a few times. You can watch the stenographer type the transcript on a monitor in front of you, and it's much better than the TV captions.
These were pharmaceutical patent cases and FDA litigation, mostly technical stuff, chemists being cross-examined.
They had a system with a court stenographer typing into a computerized stenotype machine, and the judge and both parties watching the result on monitors.
I was last in court a few years ago, but I don't think it's changed much.
Re: (Score:2)
The thing is that in TV though you don't need a full time stenographer. You need at the most someone for broadcasts without transcripts, and even the talking heads on the news programs are often working from scripts on teleprompters. Probably the cheapest thing they could do is hire a person that is training to become a stenographer. They get real world experience for some pay and the broadcaster gets some semblance of professionalism. In most areas you are talking about four hours of live broadcast a day a
Re: (Score:2)
Re: (Score:2)
I don't like this idea of, "Hire a trainee to work cheap."
Some of the show is scripted, but when I watch a news program, the most interesting thing is the on-air comments by the invited guests, and the back-and-forth debate. How is Newt Gingrich (or whomever) going to talk his way out of this one? That's also the most difficult thing to get down (as I know from often taking notes on panel debates).
I think a trainee would make so many mistakes that it would cost more to go back and correct the mistakes than
Re: (Score:2)
You could always hire a few of them and run the results through a differential analysis program so that it uses the the parts that match between the transcribers when in doubt.
Re: (Score:2)
Most of the time when you view closed captions, it is typed up, not automatically transcribed by a computer program - link [wikipedia.org]
Further, for live events, it is typically typed live by a stenographer which yields the inherent delay
As for errors, I personally have mostly seen errors when I'm watching over the air and the reception isn't very clear (though I don't often use closed captioning so my sample size is limited)
Re:There isn't any... (Score:4, Interesting)
Take ten or so friends to a restaurant. It can be that you're the only patrons there so it's relatively quiet, that's fine. Seat everyone along two sides of a long table, and put a person at each end. Seat yourself in the middle of one of the long sides. Now, as your party is served, attempt to pay attention to all of the conversation going on among the friends. You'll probably find that the friends break into three or four distinct conversations, with some people floating between conversations depending on what's being talked about. Now, in turn, try to focus on or participate in every distinct conversation at the table.
Even as someone with good hearing, this will be a difficult task. With at few as four people it's possible to have two distinct conversations going on in parallel, and with six people it's almost guaranteed to have at least some moments with two simultaneous conversations.
Unless a family operates their dinners with parliamentary rules for who has the floor, it would be almost impossible for software to successfully monitor and differentiate so many speakers, even if the hardware were ideally installed so that each individual speaker could be individually sampled. Fully able-bodied humans struggle with this with years of experience in attempting to sort through the chatter, I don't see how software is going to make it work, and I also don't see how the hearing-impaired individual is going to be able to read to keep up with that many conversations simultaneously in order to really enjoy the experience, while eating.
Re:There isn't any... (Score:4, Interesting)
More fun audio tricks:
Look at each person speaking as you're trying to listen to them, then look at someone else while still trying to follow the same person's conversation. Usually, a few moments after looking away, you'll find the other conversations are more distracting. Your brain is trying to match up what you're hearing with the mouths moving in front of you.
Also, put in one earplug and close your eyes, so you lose spacial awareness. Again, the voices will blend together much more. That's because the brain also uses spacial cues (visual placement, stereo hearing) to separate sound sources.
Re: (Score:2)
Re: (Score:2)
There's a reason why radio personalities often sound similar, they have excessively good diction to excess. Even shock-jocks and others that seem counter-culture or edgy must have excellent verbal communications skills in order to work in that industry; they have to overcome the problem of communicating when the listener cannot read their lips. Vowel sounds and other 'long' sounds are easy to make out, but the staccato consonants are what give meaning to the sounds, and lots of lazy speakers u
Re: (Score:2)
Counterargument: I don't know much about audio analysis, but imagine you have two people talking with different-pitched voices. I wouldn't be at all surprised if a computer, via some FFT and frequency analysis or something, would be able to do a far better job at separating them than a person could. Actually, I'd be a bit surprised if that wasn't true.
Things become more difficult of course with similar-sounding voices, but I still suspect there's a fair bit of potential. There have been a lot of things peop
Re: (Score:2)
A microphone array and some DSP should have no trouble distinguishing between sound sources at distinct points in space, in addition to removing background noise.
Re: (Score:2)
Re:There isn't any... (Score:5, Insightful)
There's no perfect solution, but something that works for 60% might already be better than nothing.
I work in the closed captioning industry, and I'd say anything less than 95% accuracy is actually WORSE than nothing. Automatic Speech Recognition (ASR) has no concept of context or situational awareness. The mistakes they make tend to be not in the simple common words and phrases, but concentrated in the nouns, especially proper nouns: names of people, places, companies, products, etc. Even at 80% accuracy, which is quite good for the current best speaker independent ASR systems, you're looking at 2 words out of every 10 being substituted with the wrong word, completely changing the meaning of the phrases. Imagine the chaos if (major news network)'s closed captioning reported some celebrity or politician as saying "I'm not a fan of Jews." when they actually said "I'm not a fan of juice." (Which would be 83% accurate!) Wars have been started for one misheard word out of a thousand; imagine how bad 200 out of 1000 would be.
Here's an article about a HUMAN transcription error that caused a pretty major ruckus. Now imagine this kind of problem being an order of magnitude worse:
http://www.people.com/people/article/0,,20693447,00.html [people.com]
People who lost hearing later in life tend to do better with high error rate ASR because they know what words sound like and can figure out easy substitutions, e.g. Juice vs. Jews, Election vs. Erection, etc., but people who were born deaf or lost hearing before language acquisition cannot easily make these substitutions in their head because they don't "hear" the word sounds when they read them.
Re: (Score:2)
I can use automatic translation to read foreign web sites. I notice when the translation is weird and most probably wrong, but still I more or or less understand what is being discussed. How on earth can that be worse than absolutely nothing?
Re: (Score:1)
Re: (Score:2)
On the other hand, conversations between the relatives at Christmas have massive amounts of context - we *know* Uncle Joe who has been sending money to the Zionist Freedom Front ever since his older sister was liberated from Auswich couldn't possibly have said "I'm not a fan of the Jews", whereas we don't have the personal-level context needed to decide whether some idiot celebrity or politician might or might not have said "I'm not a fan of the Jews". So for the purpose of the original poster, an 80% accu
Re: (Score:1)
Oh really?
Original sentence: After the party, your mom and I swept up together.
Translated sentence: After the party, your mom and I slept together.
Now the translated sentence is well over 60% accurate, but do you see you know how the meaning has completely changed?
Re:There isn't any... (Score:5, Informative)
No, but to expand on your car analogy, they have to be able to meet certain minimum standards and customer requirements.
And dropping out of analogy, the hypothetical courtroom automatic stenographer would probably have it easiest, as the rules of the court dictate that only one person may speak at a time, and most courts have individual microphones for every speaking party for acoustically recording the proceedings anyway. The same cannot be said for the dinner table.
Even the most rudimentary system for sampling several participants would cost hundreds of dollars. A half-way accurate comparison would be the equipment needed to record a drum-set, with individual microphones for each drum, cymbal, and accessory, and a processor that monitors line-levels and individually records each input separately. Replace the function of recording each input and turn it into processing each input for discrete words, and only then are you even getting to the hard part, interpreting what the sounds actually are.
The low-end equipment to record drums is hundreds of dollars. High end equipment to do the same thing costs thousands of dollars. Now tack on the cost of the processing side, and you're probably at tens of thousands of dollars. Just to attempt to participate in a large group conversation as opposed to small-party conversation where polite participants will probably work to simplify the flow of conversation to allow the impaired individual to participate.
A friend of mine in a social club has a son with some form of developmental disability. I've heard that it's Aspergers, but I'm not entirely certain as many of the traits commonly associated with Aspergers don't seem to manifest with him. When he's party to our conversations we modify our conversation to accommodate him. We attempt to avoid speaking over each other or over him, and we increase the amount of time that one considers a pause by a given speaker, so that we don't interrupt him while he's talking.
If we had a substantially hearing-impaired member, we would probably modify our conversations accordingly, slowing our speech enough that lips could be read, attempting to avoid talking over each other, and attempting to keep our faces oriented to where the individual could see those faces. Given the nature of our vocabulary in this social setting (a speculative fiction group) it would be highly unlikely that a speech-to-text system would correctly interpret any of the truly important words in the conversation anyway, so such a system would be useless.
Re: (Score:2)
It hasn't advanced a bit since 15 years ago when IBM advertised ViaVoice. Unless you're the person a particular speech recognition tool was tailored for or someone with almost identical speech, all you get is nonsense poetry that vaguely resembles the rhythm and rhyme of what you said.
Re: (Score:3)
Is Dragon actually still working on it? The original authors who were the true R&D geniuses behind the technology were locked out of improving the product (and pretty much the industry) because of copyright after the botched sale of their company [wikipedia.org]. The recent demonstrations of the software I've seen look like not much has improved since 2000 except computers have gotten faster.
Re: (Score:2)
Is Dragon actually still working on it? The original authors who were the true R&D geniuses behind the technology were locked out of improving the product (and pretty much the industry) because of copyright after the botched sale of their company [wikipedia.org]. The recent demonstrations of the software I've seen look like not much has improved since 2000 except computers have gotten faster.
I doubt you have actually been paying attention or use speech recognition. It has gotten so much better in the last fourteen years. Back then, it was basically worthless and now accuracy really isn't any more of an issue than it is with human listener even with various foreign accents speaking English.
Re: (Score:2)
There are arrays of microphones installed in indoor arenas that can isolate the words of individuals in the audience of 10s of 1000s. One assumes the audio stream can be split to allow isolating more than one person simultaneously.
So it isn't impossible, tho it may require a lot of DSP equipment.
This.
All the people going on about putting a mic on each person, or all around the table, or whatever are way off. Those mics will pick up everyone else's voice as well, and speech-to-text capability goes way down with extra voices at even low volumes. Phased arrays are where it's at -- with enough mics, you can do a much better job at getting the voice you want, and at nulling out any other concurrent speakers -- and of course, as you say, you can use the same mic array to simultaneously monitor several sp
captions (Score:5, Insightful)
Good luck.
Re: (Score:2)
Youtube is far from the best speech-to-text technology available. The best STT technology is probably owned by the NSA or companies that work with them. Part of the secret sauce to good STT is voice training and speaker recognition, which I don't believe youtube's STT is capable of yet. As far as Youtube is concerned, it's only one person talking throughout each video, so when you have a French dude speaking one sentence, followed by an Irish woman the next sentence, youtube may not dynamically adapt to
"Listnote" for android (Score:2)
The main gap I see is that this is really only practical for 1v1 conversations and group settings will require exponentially more sophistication, to identify and differentiate between different speakers.
I would love to say that loved ones should fucking learn their child/parent/frie
Re: (Score:1)
Now you've got to learn Mandarin or Cantonese or shunt her out of your "loved-ones-group".
Re: (Score:2)
Re: (Score:2)
ASL is my son's first language, and there are plenty of people in his life who refuse to learn to speak with him.
If its a hard sell, there's probably a reason for that.
Technical solutions are the focus of this article.
ASL to text is drastically harder problem, but it appears to be under development [ndtv.com].
Text to ASL is starting to be available [google.com] but probably only useful for people too young to read. (Showing them the text would be quicker if they could read). However it might serve as a teaching aid for other to learn ASL.
The deaf seldom speak clearly enough for any speech recognition to work. Siri and Android speech recogni
Holiday rituals (Score:5, Funny)
Turns out being able to hear doesn't actually help here.
Re: (Score:1)
Turns out being able to hear doesn't actually help here.
Amusing, yes, but we have a duty as a society to help our vulnerable. You may have the gift of hearing or sight today, but tomorrow anything can happen. We're not just helping them with this technology, we're helping ourselves.
As well, some people have auditory processing disorders that make groups of people essentially unintelligible. I happen to be one of them -- I can hear and see just fine but I cannot separate individual words or conversations in a crowd. As a result, the only person I regularly go out
Re: (Score:3)
Ask most Deaf people if they think they're "vulnerable" and you'll get a pretty strong "No" (or a shake of a head with the first two fingers pinched together with the thumb).
YMMV, of course.
Re: (Score:1)
> Smoke detectors,
That's why they flash. High pitched sounds don't do a good job waking up infants and old people. Or the deaf.
> oncoming trains,
Look both ways and feel the vibration traveling through the ground. Also a good practice where there's electric cars.
> cops yelling “Stop or I’ll shoot...”
Expecting people to hear is silly.
> I don’t know why so many in the deaf community insist that being unable to hear is not a disability.
You came closest with that last one: bei
Re: (Score:3)
I expect that they object to being viewed as persons of intrinsically lesser capability.
It's wrong, although not malicious, to say "deaf people are vulnerable," because vulnerability is not a permanent attribute; it's an ephemeral state that can usually be engineered out of the environment. A level crossing with only a crossbuck sign can be fitted with flashing lights.
multiple conversations (Score:2)
The problem with dinner conversations is that there are usually a number of them going on at one time. A computer has enough trouble following one voice let alone multiple voices at the same time.
Language (Score:2)
"This morning, I had the best one-on-one discussion with my mother in years courtesy of her iPhone and Siri; voice recognition is definitely improving."
So, let me just ask, what was hindering this before? If one of you can hear and talk, and the other can read, then surely it's not a huge leap to one of you typing and the other reading? I know it's not a given, but it seems pretty obvious. And, from the deaf person's side, surely nothing is lost? Hell, you could have "talked with" them while chatting to
speech to text (Score:2)
Daughter and her friend communicate via texting a lot. They've got speech to text turned on at one end, and on the other end, text to speech.
Waaaait a minute...
not yet (Score:2)
My impression is, commercial speech to text works really well right now -- in fact, just well enough to lull you into complacency before it really screws you up.
Re: (Score:2)
Humans instinctively develop a language faculty at a very young age - it's easy to underestimate what a difficult problem it is for computers and difficult to appreciate how good even young children are at it. 99% accuracy sounds good, but a person who could read at that accuracy would be called illiterate.
Re: (Score:1)
Common problem for the Elderly.. please test this! (Score:1)
Freelance interpreters (Score:2)
No, there is nothing good enough. In fact, automated systems got a good, old-fashioned drubbing at the captioning challenge at the ASSETS 2013 conference. Nothing came even close to a human steno captioner.
If you want to make family events accessible and the person in question signs, my recommendation is to hire freelance interpreters. They often charge on a sliding scale, depending on the event and means of the client, and tend to run much cheaper than anyone you can get through referral services. That is
Re: (Score:2)
Well-said ... by someone who has no clue about what it is like.
Nope, unless ... (Score:3)
Are you able to do all of the following at your dinner conversation?:
1) Provide everyone with a decent close-talking directional microphone.
2) Require each person to take turns speaking, so there is very little overlap.
3) Have no pre-adolescents speaking.
4) Eliminate noticeable background noises.
5) Have no one with a strong non-native dialect speaking.
6) Require everyone to speak in full, grammatical sentences.
To the extent you say no to any of the above, you will get increasingly poor output. They are listed approximately in order of importance (1 being the most important). If you can say yes to all of those, you can probably get in the vicinity of 90% accuracy. This might be usable, depending on your ultimate purpose. If you were to additionally train acoustic and language models for all of the speakers, and then tell the software which user was speaking (i.e. switch the user on the fly during the conversation), you could probably get 95% accuracy and that would be quite usable.
So, in other words ... nope.
You might try... (Score:2)
Record the conversation and they play it into Dragon, it works but you need a good quality audio feed. I've also tinkered with Julius [sourceforge.jp] and although it takes a bit of set up it works in most cases but you have to tweak it a bit more than Dragon at least in terms of what I was dealing with.
Thanks for the suggestions (Score:3, Informative)
I've been following the field for awhile, so I'm aware of the barriers to success; it is, as the engineers like to say, a non-trivial problem. But I can't possibly be aware of every development, so it's really helpful to get your perspective.
I agree with the general consensus that we're a ways off from accurate machine transcription of group discussions, for the reasons discussed; that several conversations can be active at once, interference from other background noises, comprehending context, etc.
The point about late-deafened people being able to work with lower accuracy is a good one. I'm like that, I can recognize phonetic mistakes and mentally substitute the correct word because I know what it intended, but I have lot of born-deaf friends who would be lost.
One reader took upbrage at rusotto's joke about being able to hear doesn't help here. Me, I laughed. I know what it means to be bored silly even when everything is clear, and even ASL conversations can have the same problem. Another point about referring to deaf folks as "vulnerable" - yes, most people would resent that sort of label, even among those who understand it's not done with malice.
About communications with my mother - yes, we can converse by text, or through an ASL interpreter, and via video relay, and we've done all those things. But each of them is mediated to some degree, and working through Siri is too, but with an important difference; the other mediated techniques are more intrusive and divert focus from the person you're conversing with.
In video relay, I don't see my mother at all - I see an interpreter. Text - typing or writing - is also face to face, but it's slow. An ASL interpreter divides focus between the person I'm conversing with and the 'terp. All of these options work. The difference with Siri is, I can see her as she's speaking, focus on HER, read the text generated by Siri and match that with the facial expressions and body language.
One point made was the capacity for reading fast enough to keep up with transcription of a full table of rapid-fire conversation; I agree that would be tough.
Probably the most practical solution now is an ASL 'terp for those (like me) who know ASL. This is one area where the human capacity for a complex task trumps current tech.
Glass (Score:1)
Something should be possible... if its done in "Wet" electronics (Brain-Body), with enough processing power and sensitivity, discernment should be somewhat possible in "Dry" electronics (
Multiple Mics Can Separate Conversations (Score:1)
I've not noticed anyone else mention using multiple microphones, but with two microphones spaced a few metres apart a computer can separate out two people talking over each other due to speed-of-sound differences (i.e. conversation A reaches mic 1 then mic 2, conversation B reaches mic 2 then mic 1).
I've seen two people talking separated using two microphones, I don't think you need more microphones for more people. This is a common machine learning demo (how I've seen it), but it is just signal process
Sometimes being deaf is a good thing... (Score:2)
... Especially when family members are yelling and stuff. Ugh!