Forgot your password?

Ask Slashdot: Effective, Reasonably Priced Conferencing Speech-to-Text? 81

Posted by samzenpus
from the keep-talking dept.
First time accepted submitter DeafScribe writes "Every year during the holidays, many people in the deaf community lament the annual family gathering ritual because it means they sit around bored while watching relatives jabber. This morning, I had the best one-on-one discussion with my mother in years courtesy of her iPhone and Siri; voice recognition is definitely improving. It would've been nice if conference-level speech-to-text had been available this evening for the family dinner. So how about it? Is group speech to text good enough now, and available at reasonable cost for a family dinner scenario?"
This discussion has been archived. No new comments can be posted.

Ask Slashdot: Effective, Reasonably Priced Conferencing Speech-to-Text?

Comments Filter:
  • by TWX (665546) on Monday December 30, 2013 @04:28PM (#45821605)

    To put this into a car analogy, electric cars don't need to surpass ICE cars in every conceivable scenario to make one worth buying for a given individual.

    No, but to expand on your car analogy, they have to be able to meet certain minimum standards and customer requirements.

    And dropping out of analogy, the hypothetical courtroom automatic stenographer would probably have it easiest, as the rules of the court dictate that only one person may speak at a time, and most courts have individual microphones for every speaking party for acoustically recording the proceedings anyway. The same cannot be said for the dinner table.

    Even the most rudimentary system for sampling several participants would cost hundreds of dollars. A half-way accurate comparison would be the equipment needed to record a drum-set, with individual microphones for each drum, cymbal, and accessory, and a processor that monitors line-levels and individually records each input separately. Replace the function of recording each input and turn it into processing each input for discrete words, and only then are you even getting to the hard part, interpreting what the sounds actually are.

    The low-end equipment to record drums is hundreds of dollars. High end equipment to do the same thing costs thousands of dollars. Now tack on the cost of the processing side, and you're probably at tens of thousands of dollars. Just to attempt to participate in a large group conversation as opposed to small-party conversation where polite participants will probably work to simplify the flow of conversation to allow the impaired individual to participate.

    A friend of mine in a social club has a son with some form of developmental disability. I've heard that it's Aspergers, but I'm not entirely certain as many of the traits commonly associated with Aspergers don't seem to manifest with him. When he's party to our conversations we modify our conversation to accommodate him. We attempt to avoid speaking over each other or over him, and we increase the amount of time that one considers a pause by a given speaker, so that we don't interrupt him while he's talking.

    If we had a substantially hearing-impaired member, we would probably modify our conversations accordingly, slowing our speech enough that lips could be read, attempting to avoid talking over each other, and attempting to keep our faces oriented to where the individual could see those faces. Given the nature of our vocabulary in this social setting (a speculative fiction group) it would be highly unlikely that a speech-to-text system would correctly interpret any of the truly important words in the conversation anyway, so such a system would be useless.

  • by DeafScribe (639721) on Monday December 30, 2013 @06:43PM (#45822967) Homepage

    I've been following the field for awhile, so I'm aware of the barriers to success; it is, as the engineers like to say, a non-trivial problem. But I can't possibly be aware of every development, so it's really helpful to get your perspective.

    I agree with the general consensus that we're a ways off from accurate machine transcription of group discussions, for the reasons discussed; that several conversations can be active at once, interference from other background noises, comprehending context, etc.

    The point about late-deafened people being able to work with lower accuracy is a good one. I'm like that, I can recognize phonetic mistakes and mentally substitute the correct word because I know what it intended, but I have lot of born-deaf friends who would be lost.

    One reader took upbrage at rusotto's joke about being able to hear doesn't help here. Me, I laughed. I know what it means to be bored silly even when everything is clear, and even ASL conversations can have the same problem. Another point about referring to deaf folks as "vulnerable" - yes, most people would resent that sort of label, even among those who understand it's not done with malice.

    About communications with my mother - yes, we can converse by text, or through an ASL interpreter, and via video relay, and we've done all those things. But each of them is mediated to some degree, and working through Siri is too, but with an important difference; the other mediated techniques are more intrusive and divert focus from the person you're conversing with.

    In video relay, I don't see my mother at all - I see an interpreter. Text - typing or writing - is also face to face, but it's slow. An ASL interpreter divides focus between the person I'm conversing with and the 'terp. All of these options work. The difference with Siri is, I can see her as she's speaking, focus on HER, read the text generated by Siri and match that with the facial expressions and body language.

    One point made was the capacity for reading fast enough to keep up with transcription of a full table of rapid-fire conversation; I agree that would be tough.

    Probably the most practical solution now is an ASL 'terp for those (like me) who know ASL. This is one area where the human capacity for a complex task trumps current tech.

You are in a maze of little twisting passages, all alike.