Want to read Slashdot from your mobile device? Point it at m.slashdot.org and keep reading!

 



Forgot your password?
typodupeerror
Technology

Ask Slashdot: Effective, Reasonably Priced Conferencing Speech-to-Text? 81

Posted by samzenpus
from the keep-talking dept.
First time accepted submitter DeafScribe writes "Every year during the holidays, many people in the deaf community lament the annual family gathering ritual because it means they sit around bored while watching relatives jabber. This morning, I had the best one-on-one discussion with my mother in years courtesy of her iPhone and Siri; voice recognition is definitely improving. It would've been nice if conference-level speech-to-text had been available this evening for the family dinner. So how about it? Is group speech to text good enough now, and available at reasonable cost for a family dinner scenario?"
This discussion has been archived. No new comments can be posted.

Ask Slashdot: Effective, Reasonably Priced Conferencing Speech-to-Text?

Comments Filter:
  • ...because if there were, it'd be put into immediate use for TV closed captioning for live programs, for live presentations with a text crawl at the podium of the speaker, and in courtrooms, replacing the stenographer.

    That said, there are companies working on it, like Dragon, but they're not there yet, and when they get closer, they won't be cheap.

    TL;DR If it existed we'd be using it already.
    • by KiloByte (825081)

      It hasn't advanced a bit since 15 years ago when IBM advertised ViaVoice. Unless you're the person a particular speech recognition tool was tailored for or someone with almost identical speech, all you get is nonsense poetry that vaguely resembles the rhythm and rhyme of what you said.

    • by Solandri (704621)

      there are companies working on it, like Dragon, but they're not there yet

      Is Dragon actually still working on it? The original authors who were the true R&D geniuses behind the technology were locked out of improving the product (and pretty much the industry) because of copyright after the botched sale of their company [wikipedia.org]. The recent demonstrations of the software I've seen look like not much has improved since 2000 except computers have gotten faster.

      • there are companies working on it, like Dragon, but they're not there yet

        Is Dragon actually still working on it? The original authors who were the true R&D geniuses behind the technology were locked out of improving the product (and pretty much the industry) because of copyright after the botched sale of their company [wikipedia.org]. The recent demonstrations of the software I've seen look like not much has improved since 2000 except computers have gotten faster.

        I doubt you have actually been paying attention or use speech recognition. It has gotten so much better in the last fourteen years. Back then, it was basically worthless and now accuracy really isn't any more of an issue than it is with human listener even with various foreign accents speaking English.

  • captions (Score:5, Insightful)

    by phantomfive (622387) on Monday December 30, 2013 @02:13PM (#45820779) Journal
    Go find some youtube videos with auto-captioning. That is the upper-limit on the quality you will get with today's technology.

    Good luck.
    • by snsh (968808)

      Youtube is far from the best speech-to-text technology available. The best STT technology is probably owned by the NSA or companies that work with them. Part of the secret sauce to good STT is voice training and speaker recognition, which I don't believe youtube's STT is capable of yet. As far as Youtube is concerned, it's only one person talking throughout each video, so when you have a French dude speaking one sentence, followed by an Irish woman the next sentence, youtube may not dynamically adapt to

  • it exposes the raw functionality of Android's speech recognition better than anything else I have seen. "I just want something that will put on screen what is said aloud" is a feature set that is surprisingly hard to find.

    The main gap I see is that this is really only practical for 1v1 conversations and group settings will require exponentially more sophistication, to identify and differentiate between different speakers.

    I would love to say that loved ones should fucking learn their child/parent/frie
  • by russotto (537200) on Monday December 30, 2013 @02:21PM (#45820885) Journal

    it means they sit around bored while watching relatives jabber

    Turns out being able to hear doesn't actually help here.

    • Turns out being able to hear doesn't actually help here.

      Amusing, yes, but we have a duty as a society to help our vulnerable. You may have the gift of hearing or sight today, but tomorrow anything can happen. We're not just helping them with this technology, we're helping ourselves.

      As well, some people have auditory processing disorders that make groups of people essentially unintelligible. I happen to be one of them -- I can hear and see just fine but I cannot separate individual words or conversations in a crowd. As a result, the only person I regularly go out

      • by rk (6314)

        Ask most Deaf people if they think they're "vulnerable" and you'll get a pretty strong "No" (or a shake of a head with the first two fingers pinched together with the thumb).

        YMMV, of course.

  • The problem with dinner conversations is that there are usually a number of them going on at one time. A computer has enough trouble following one voice let alone multiple voices at the same time.

  • "This morning, I had the best one-on-one discussion with my mother in years courtesy of her iPhone and Siri; voice recognition is definitely improving."

    So, let me just ask, what was hindering this before? If one of you can hear and talk, and the other can read, then surely it's not a huge leap to one of you typing and the other reading? I know it's not a given, but it seems pretty obvious. And, from the deaf person's side, surely nothing is lost? Hell, you could have "talked with" them while chatting to

  • Daughter and her friend communicate via texting a lot. They've got speech to text turned on at one end, and on the other end, text to speech.

    Waaaait a minute...

  • My impression is, commercial speech to text works really well right now -- in fact, just well enough to lull you into complacency before it really screws you up.

    • by Livius (318358)

      Humans instinctively develop a language faculty at a very young age - it's easy to underestimate what a difficult problem it is for computers and difficult to appreciate how good even young children are at it. 99% accuracy sounds good, but a person who could read at that accuracy would be called illiterate.

  • When we talk to each other, we speak very differently to when we're recording ourselves into a microphone or talking to a speech to text algorithm. Computers aren't at the point where they can interpret communicative intent and so they are unable to transcribe what you mean to say as opposed to the sounds that came out of your mouth. Even when people talk to their speech to text phones, we often see some very strange interpretations. Just check out the regular and frequent postings on Fail Blog.
  • www.Stay-in-Touch.ca includes a "Hard-of-Hearing" function. It is text or speech-to-text at your end which comes through as text to "Grandma" She picks up the phone and speaks. Then you reply. Runs on an Android Tablet for Grandma and a web-Page for you! I'd love you to test this.
  • No, there is nothing good enough. In fact, automated systems got a good, old-fashioned drubbing at the captioning challenge at the ASSETS 2013 conference. Nothing came even close to a human steno captioner.

    If you want to make family events accessible and the person in question signs, my recommendation is to hire freelance interpreters. They often charge on a sliding scale, depending on the event and means of the client, and tend to run much cheaper than anyone you can get through referral services. That is

  • by rogerz (78608) <roger@3playmedi a . com> on Monday December 30, 2013 @04:38PM (#45822345)

    Are you able to do all of the following at your dinner conversation?:

    1) Provide everyone with a decent close-talking directional microphone.
    2) Require each person to take turns speaking, so there is very little overlap.
    3) Have no pre-adolescents speaking.
    4) Eliminate noticeable background noises.
    5) Have no one with a strong non-native dialect speaking.
    6) Require everyone to speak in full, grammatical sentences.

    To the extent you say no to any of the above, you will get increasingly poor output. They are listed approximately in order of importance (1 being the most important). If you can say yes to all of those, you can probably get in the vicinity of 90% accuracy. This might be usable, depending on your ultimate purpose. If you were to additionally train acoustic and language models for all of the speakers, and then tell the software which user was speaking (i.e. switch the user on the fly during the conversation), you could probably get 95% accuracy and that would be quite usable.

    So, in other words ... nope.

  • Record the conversation and they play it into Dragon, it works but you need a good quality audio feed. I've also tinkered with Julius [sourceforge.jp] and although it takes a bit of set up it works in most cases but you have to tweak it a bit more than Dragon at least in terms of what I was dealing with.

  • by DeafScribe (639721) on Monday December 30, 2013 @05:43PM (#45822967) Homepage

    I've been following the field for awhile, so I'm aware of the barriers to success; it is, as the engineers like to say, a non-trivial problem. But I can't possibly be aware of every development, so it's really helpful to get your perspective.

    I agree with the general consensus that we're a ways off from accurate machine transcription of group discussions, for the reasons discussed; that several conversations can be active at once, interference from other background noises, comprehending context, etc.

    The point about late-deafened people being able to work with lower accuracy is a good one. I'm like that, I can recognize phonetic mistakes and mentally substitute the correct word because I know what it intended, but I have lot of born-deaf friends who would be lost.

    One reader took upbrage at rusotto's joke about being able to hear doesn't help here. Me, I laughed. I know what it means to be bored silly even when everything is clear, and even ASL conversations can have the same problem. Another point about referring to deaf folks as "vulnerable" - yes, most people would resent that sort of label, even among those who understand it's not done with malice.

    About communications with my mother - yes, we can converse by text, or through an ASL interpreter, and via video relay, and we've done all those things. But each of them is mediated to some degree, and working through Siri is too, but with an important difference; the other mediated techniques are more intrusive and divert focus from the person you're conversing with.

    In video relay, I don't see my mother at all - I see an interpreter. Text - typing or writing - is also face to face, but it's slow. An ASL interpreter divides focus between the person I'm conversing with and the 'terp. All of these options work. The difference with Siri is, I can see her as she's speaking, focus on HER, read the text generated by Siri and match that with the facial expressions and body language.

    One point made was the capacity for reading fast enough to keep up with transcription of a full table of rapid-fire conversation; I agree that would be tough.

    Probably the most practical solution now is an ASL 'terp for those (like me) who know ASL. This is one area where the human capacity for a complex task trumps current tech.

  • by jhumkey (711391)
    Not "Google Glass" as is but . . . some future version of that, would seem to be the ideal. HIGHLY directional microphone, lets you "look" at the speaker of interest to help #1 see facial expressions/body language, #2 discern that "voice" from among the background noise clutter, with interpreted output onto the display.

    Something should be possible... if its done in "Wet" electronics (Brain-Body), with enough processing power and sensitivity, discernment should be somewhat possible in "Dry" electronics (
  • I've not noticed anyone else mention using multiple microphones, but with two microphones spaced a few metres apart a computer can separate out two people talking over each other due to speed-of-sound differences (i.e. conversation A reaches mic 1 then mic 2, conversation B reaches mic 2 then mic 1).

    I've seen two people talking separated using two microphones, I don't think you need more microphones for more people. This is a common machine learning demo (how I've seen it), but it is just signal process

  • ... Especially when family members are yelling and stuff. Ugh!

Real Users find the one combination of bizarre input values that shuts down the system for days.

Working...